Hello,
I am not sure if this is the appropriate place to ask this but I have a few questions on two specific functions in fix_nve.cpp (while using the gpu package). When doing some profiling of LAMMPs using a testing dataset, one of the things that I noticed that a significant portion of execution time (>20%) was spent in FixNVE::initial_integrate and FixNVE::final_integrate. Specifically the time is spent in the loops performing the calculation (such as this one from initial_integrate):
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit) {
dtfm = dtf / mass[type[i]];
v[i][0] += dtfm * f[i][0];
v[i][1] += dtfm * f[i][1];
v[i][2] += dtfm * f[i][2];
x[i][0] += dtv * v[i][0];
x[i][1] += dtv * v[i][1];
x[i][2] += dtv * v[i][2];
}
My questions really surround why this calculation is performed on the CPU instead of on the GPU. My initial assumption as to why this is a CPU bound calculation was that V, X, and F are allocated in a non contiguous manner (i.e. V[0][0] is not contiguous up to V[nlocal-1][2]). However for my specific test, it appears that these values are always allocated in contiguous manner. A manual conversion of both initial_integrate and final_integrate to the GPU actually improved performance/reduced execution time by about 17%.
My questions boil down to the following:
-
What cases (if any) are V, F, and X possible not laid out contiguously? From the source code level, it appears that they are guaranteed to be contiguous however I am only looking at (very) small portion on the LAMMPs code base (that was identified using a prototype performance tool) so this assumption could easily be wrong.
-
Has there been any thought about converting these functions (or are there conversions that already exist) to the GPU?
Thanks,
Ben