there are many factors that affect GPU acceleration:
- hardware choices, e.g. balance between GPU and CPU, mainboard design, choice of CPU and memory speed, GPU layout.
- implementation (e.g. KOKKOS vs. GPU package), compilation settings, precision selection (single vs. mixed vs. double precision)
I tried this one this morning. It only works with DOUBLE_DOUBLE. If I use SINGLE_SINGLE or SINGLE_DOUBLE, LAMMPS (17Nov16) complains and crashes (K80):
Initializing Device and compiling on process 0…Done.
Initializing Devices 0-1 on core 0…Done.
Initializing Devices 0-1 on core 1…Done.
Initializing Devices 0-1 on core 2…Done.
Initializing Devices 0-1 on core 3…Done.
Initializing Devices 0-1 on core 4…Done.
Initializing Devices 0-1 on core 5…Done.
Initializing Devices 0-1 on core 6…Done.
Initializing Devices 0-1 on core 7…Done.
Initializing Devices 0-1 on core 8…Done.
Initializing Devices 0-1 on core 9…Done.
Initializing Devices 0-1 on core 10…Done.
Initializing Devices 0-1 on core 11…Done.
Neighbor list info …
0 neighbor list requests
update every 1 steps, delay 10 steps, check yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 7.3
ghost atom cutoff = 7.3
binsize = 3.65, bins = 51 51 51
Setting up Verlet run …
Unit style : metal
Current step : 0
Time step : 0.001
Cuda driver error 700 in call at file ‘geryon/nvd_timer.h’ in line 76.
[cli_12]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 12
- run time settings, balance between tasks that can run on the GPU and those that must run on the CPU, amount of data to be move between host of
- nature of the problem at hand
for all of this applies the “weakest link” principle, i.e. if all of them fit well, you’ll see good GPU acceleration. if one or more doesn’t fit so well, you may see quite significant degradation of the GPU acceleration. also, one has to scale its expectations of acceleration. when running a K80 with 24 CPU cores, you have 12 cores/GPU, and thus a speedup of 3 actually means 3x for each CPU core. OTOH, with the price of a single K80 being similar (if not larger) to a reasonably equipped 24 core node, it not a great deal to have 3x speed vs 2x speed, considering that GPU acceleration can depend much on the system at hand and the choice of input settings. this is the context in which my comment about “moderate GPU acceleration” has to be seen. there are scenarios, where the GPU acceleration can be much more.
e.g. for the 24core CPU + 2x GPU layout and using the GPU package, it would be a better choice to run long-range electrostatics on the CPU while computing real-space coulomb on the GPU, and one can tweak the real-space coulomb cutoff so that it reaches the optimal balance. this can often result in different choices for CPU and GPU.
also one has to factor in the cost of non-accelerated operations, e.g. all kinds of computes and fixes and for the KOKKOS package on has to keep in mind, that it only supports double precision and that extra transfer of data between host and GPU for fixes/computes not ported to KOKKOS, can make it more costly than with the GPU package, where ideally only position data is sent to the GPU and forced data returned.
Actually, when I use KOKKOS package, I get the following warning:
WARNING: Fixes cannot send data in Kokkos communication, switching to classic communication (…/comm_kokkos.cpp:365)
finding optimal choices here is non-trivial and requires spending some time understanding what all the many choices mean and experimenting extensively with various specific settings and choices to learn where a particular simulation input can benefit from acceleration the most. there are no simple “do this, not that” rules here.
Clearly, it depends on many factors.
Christophe