Question on performance of fix_nvc.cpp

Hello,

I am not sure if this is the appropriate place to ask this but I have a few questions on two specific functions in fix_nve.cpp (while using the gpu package). When doing some profiling of LAMMPs using a testing dataset, one of the things that I noticed that a significant portion of execution time (>20%) was spent in FixNVE::initial_integrate and FixNVE::final_integrate. Specifically the time is spent in the loops performing the calculation (such as this one from initial_integrate):

for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit) {
dtfm = dtf / mass[type[i]];
v[i][0] += dtfm * f[i][0];
v[i][1] += dtfm * f[i][1];
v[i][2] += dtfm * f[i][2];
x[i][0] += dtv * v[i][0];
x[i][1] += dtv * v[i][1];
x[i][2] += dtv * v[i][2];
}

My questions really surround why this calculation is performed on the CPU instead of on the GPU. My initial assumption as to why this is a CPU bound calculation was that V, X, and F are allocated in a non contiguous manner (i.e. V[0][0] is not contiguous up to V[nlocal-1][2]). However for my specific test, it appears that these values are always allocated in contiguous manner. A manual conversion of both initial_integrate and final_integrate to the GPU actually improved performance/reduced execution time by about 17%.

My questions boil down to the following:

  1. What cases (if any) are V, F, and X possible not laid out contiguously? From the source code level, it appears that they are guaranteed to be contiguous however I am only looking at (very) small portion on the LAMMPs code base (that was identified using a prototype performance tool) so this assumption could easily be wrong.

  2. Has there been any thought about converting these functions (or are there conversions that already exist) to the GPU?

Thanks,
Ben

Hello,

I am not sure if this is the appropriate place to ask this but I have a
few questions on two specific functions in fix_nve.cpp (while using the gpu
package). When doing some profiling of LAMMPs using a testing dataset, one
of the things that I noticed that a significant portion of execution time
(>20%) was spent in FixNVE::initial_integrate and FixNVE::final_integrate.
Specifically the time is spent in the loops performing the calculation
(such as this one from initial_integrate):

    for (int i = 0; i < nlocal; i++)
      if (mask[i] & groupbit) {
        dtfm = dtf / mass[type[i]];
        v[i][0] += dtfm * f[i][0];
        v[i][1] += dtfm * f[i][1];
        v[i][2] += dtfm * f[i][2];
        x[i][0] += dtv * v[i][0];
        x[i][1] += dtv * v[i][1];
        x[i][2] += dtv * v[i][2];
      }

My questions really surround why this calculation is performed on the CPU
instead of on the GPU.

​the reason is simplicity of data management, also the speedup gained here
is less. that is even more true, when running with MPI.
the GPU package supports multiple MPI ranks to be attached to the same GPU
and then you can speed up the non GPU accelerated parts with MPI (for as
long as you have enough CPU cores) and even OpenMP on top of that (if you
have even more CPU cores per GPU). this also applies to the force
computation, where only pair and kspace are ported to the GPU and in a
multi-MPI per GPU setups, it is often even faster to run kspace on the CPU,
concurrently to the GPU. bond/angle/dihedral/improper are run on the CPU
always.

when keeping all the data on the GPU, you have to make certain, that *all*
fixes, computes and so on are running on the GPU. otherwise, you have to
frequently synchronize data between the GPU and the host and your
performance is limited​. the KOKKOS package does follows this keep all data
on the GPU approach when compiled with CUDA support. the downside is, that
you cannot attach multiple MPI tasks to the same GPU.

My initial assumption as to why this is a CPU bound calculation was that
V, X, and F are allocated in a non contiguous manner (i.e. V[0][0] is not
contiguous up to V[nlocal-1][2]). However for my specific test, it appears
that these values are always allocated in contiguous manner. A manual
conversion of both initial_integrate and final_integrate to the GPU
actually improved performance/reduced execution time by about 17%.

My questions boil down to the following:

1. What cases (if any) are V, F, and X possible not laid out contiguously?
From the source code level, it appears that they are guaranteed to be
contiguous however I am only looking at (very) small portion on the LAMMPs
code base (that was identified using a prototype performance tool) so this
assumption could easily be wrong.

​data is contiguous, but the layout and data structures on the host is not
good for vectorization​. coordinates are stored in
x0,y0,z0,x1,y1,z1,x2,y2,z2,... order. same for forces and velocities.

2. Has there been any thought about converting these functions (or are
there conversions that already exist) to the GPU?

​there was a package (now discontinued) that did this. in some cases
(typically very large systems), this approach was better, in other cases,
typically a smaller amount of atoms per MPI/GPU task, the current approach
of the GPU package was more efficient.​

​before you start coding, i would recommend you look into how the GPU
package performs with oversubscribed GPUs and also you should explore the
GPU proxy daemon (if your setup supports this), which will allow even more
host parallelism and higher levels of oversubscribed GPUs.

axel.​

The design of the GPU package is to only put the pair style (and KSpace)

onto the GPU. By contrast the KOKKOS package will try to put everything

it can (including fixes, like fix nve) onto the GPU.

Depending on what problem size, potential, etc you run, sometimes KOKKOS

is faster than the GPU package, sometimes vice versa.

Steve

By contrast the KOKKOS package will try to put everything

it can (including fixes, like fix nve) onto the GPU.

Right, you should try out Kokkos, it runs all the fixes it can on the GPU.

the downside is, that you cannot attach multiple MPI tasks to the same GPU.

Actually Kokkos now supports this too, but it may not be faster, especially without CUDA MPS.

Stan

Thanks for the quick (and extremely informative) reply's Steve, Stan, and Axel.

I should have mentioned this in my original e-mail but i am a GPU performance tools researcher and LAMMPs is one of the many applications which I use for experiments. This code region was flagged with one of the experimental techniques that I am currently exploring. At a high level, this specific technique takes into account runtime memory layout and access patterns to determine suitability of a CPU code region for the GPU. My reason for reaching out to developers in general is really two fold: (1) is to get more information about the possible performance issues that are found such that i can refine my techniques to give more actionable information and (2) more importantly to illuminate possible issues that the developers may be unaware of.

data is contiguous, but the layout and data structures on the host is not good for vectorization​. coordinates are stored in x0,y0,z0,x1,y1,z1,x2,y2,z2,... order. same for forces and velocities.

If I am reading this correctly, the layout in memory of the values "x0,y0,z0,x1,y1,z1,x2,y2,z2,...." would be contiguous (i.e. values are stored in a continuous memory range starting at x0 and ending at z2) and this holds for all three variables (V, F, X). If my understanding is correct, this example can be rewritten as something similar to the following:

x = &(X[0][0])
v = &(V[0][0])
f = &(F[0][0])
for(int i = 0; i < nlocal * 3; i++)
     if (mask[i/3] & groupbit) {
         dtfm = dtf / mass[type[pos/3]];
         v[pos] += dtfm * f[pos];
         x[pos] += dtv * v[pos];
     }

Which is not horrible for GPU vectorization, though it likely would only be faster than the CPU in specific conditions. My current testing setup for LAMMPs is far from expansive in terms of coverage, testing only a small number of application conditions (which I am in the process of expanded) so i cannot give a complete picture as to under what conditions it would be faster. Also as mentioned, this only works at all if the underlying data stored within these three variables is always contiguous (it is in my testing so far, but that may not be the case in all setups).

Right, you should try out Kokkos, it runs all the fixes it can on the GPU.

I will definitely try the Kokkos package. The machine I am running experiments on (ORNL Titan) supports CUDA MPS (well Cray's implementation of CUDA MPS at least). The fact that the fixes are already on the GPU may serve as a very good test and refinement suite for this specific technique especially given how rare it is to see two (well) maintained implementations with (relatively) the same functionality for both the CPU and GPU.

Thanks again for the quick (and very informative) responses on this.

Ben

Ben, I’m happy to help with any Kokkos compiling, runtime, or performance issues, let us know.

Stan