Hello,
I am not sure if this is the appropriate place to ask this but I have a few questions on two specific functions in fix_nve.cpp (while using the gpu package). When doing some profiling of LAMMPs using a testing dataset, one of the things that I noticed that a significant portion of execution time (>20%) was spent in FixNVE::initial_integrate and FixNVE::final_integrate. Specifically the time is spent in the loops performing the calculation (such as this one from initial_integrate):
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit) {
dtfm = dtf / mass[type[i]];
v[i][0] += dtfm * f[i][0];
v[i][1] += dtfm * f[i][1];
v[i][2] += dtfm * f[i][2];
x[i][0] += dtv * v[i][0];
x[i][1] += dtv * v[i][1];
x[i][2] += dtv * v[i][2];
}
My questions really surround why this calculation is performed on the CPU instead of on the GPU. My initial assumption as to why this is a CPU bound calculation was that V, X, and F are allocated in a non contiguous manner (i.e. V[0][0] is not contiguous up to V[nlocal1][2]). However for my specific test, it appears that these values are always allocated in contiguous manner. A manual conversion of both initial_integrate and final_integrate to the GPU actually improved performance/reduced execution time by about 17%.
My questions boil down to the following:

What cases (if any) are V, F, and X possible not laid out contiguously? From the source code level, it appears that they are guaranteed to be contiguous however I am only looking at (very) small portion on the LAMMPs code base (that was identified using a prototype performance tool) so this assumption could easily be wrong.

Has there been any thought about converting these functions (or are there conversions that already exist) to the GPU?
Thanks,
Ben