Poor performance with KOKKOS + GPU

Dear all,

Few days ago Axel told me that the speedup factor I get using GPU package was moderate. Maybe it is due to my problem that does not fit well with GPU, I do not know. For an example, for a PKA of 20 keV in Fe I get the following times:

  • 24 procs: 8m18s
  • 24 procs + 1K80: 2m52s
    It is not even a factor 3. I do not know if I am doing something wrong and if it could be improved. I tried playing with the parameters of package command but did not observe any improvement.

Therefore, I tried KOKKOS package with GPU to see if I could get better performance. However, I got very bad performance. For the same simulation as above it took even longer, 10m50s. Since it was the first time I used KOKKOS package, perhaps I did not install it correctly or did something wrong. This is what I did to install it:

  • Edit Makefile.kokkos_cuda_mpich
    -KOKKOS_DEVICES = Cuda, OpenMP
    -KOKKOS_ARCH = Kepler37

  • make yes-kokkos

  • make clean-all

  • make kokkos_cuda_mpich

To run the simulation with kokkos package, 24 cores and 1K80 I did the following:

mpirun -np 1 lammps_kokkos_cuda_mpich -k on t 24 g 2 -sf kk -in in.script

or

mpirun -np 1 -ppn 24 lammps_kokkos_cuda_mpich -k on g 2 -sf kk -in in.script

In both cases it took 10m50s, ie, even slower than with only 24 cores and no GPU.

Could someone tell me if I did something wrong in the installation or to run the simulation ?

Many thanks in advance and best regards,
Christophe

If you only have one GPU, I don’t think it makes sense to put g to 2? Unless the K80 is seen as two GPUs instead of one, I do not know.

For a simple LJ liquid I get about a factor of 13 speedup with a GTX 1070 over a single core, but this typically only holds for pretty large systems (in my case 64000). How many atoms do you have?

Furthermore, you can do some tweaking with the package command. I am sure other people (Stan and/or Axel) will have more useful things to say.

there are many factors that affect GPU acceleration:

  • hardware choices, e.g. balance between GPU and CPU, mainboard design, choice of CPU and memory speed, GPU layout.
  • implementation (e.g. KOKKOS vs. GPU package), compilation settings, precision selection (single vs. mixed vs. double precision)
  • run time settings, balance between tasks that can run on the GPU and those that must run on the CPU, amount of data to be move between host of
  • nature of the problem at hand

for all of this applies the “weakest link” principle, i.e. if all of them fit well, you’ll see good GPU acceleration. if one or more doesn’t fit so well, you may see quite significant degradation of the GPU acceleration. also, one has to scale its expectations of acceleration. when running a K80 with 24 CPU cores, you have 12 cores/GPU, and thus a speedup of 3 actually means 3x for each CPU core. OTOH, with the price of a single K80 being similar (if not larger) to a reasonably equipped 24 core node, it not a great deal to have 3x speed vs 2x speed, considering that GPU acceleration can depend much on the system at hand and the choice of input settings. this is the context in which my comment about “moderate GPU acceleration” has to be seen. there are scenarios, where the GPU acceleration can be much more.

e.g. for the 24core CPU + 2x GPU layout and using the GPU package, it would be a better choice to run long-range electrostatics on the CPU while computing real-space coulomb on the GPU, and one can tweak the real-space coulomb cutoff so that it reaches the optimal balance. this can often result in different choices for CPU and GPU.

also one has to factor in the cost of non-accelerated operations, e.g. all kinds of computes and fixes and for the KOKKOS package on has to keep in mind, that it only supports double precision and that extra transfer of data between host and GPU for fixes/computes not ported to KOKKOS, can make it more costly than with the GPU package, where ideally only position data is sent to the GPU and forced data returned.

finding optimal choices here is non-trivial and requires spending some time understanding what all the many choices mean and experimenting extensively with various specific settings and choices to learn where a particular simulation input can benefit from acceleration the most. there are no simple “do this, not that” rules here.

axel.

If you only have one GPU, I don’t think it makes sense to put g to 2? Unless the K80 is seen as two GPUs instead of one, I do not know.

Actually the K80 is made of 2 K40. I am using the GPU package since ~2 years and I must put g 2 if I want both to be exploited. It is seen as 2 GPUs.

For a simple LJ liquid I get about a factor of 13 speedup with a GTX 1070 over a single core, but this typically only holds for pretty large systems (in my case 64000). How many atoms do you have?

My system has 524288 atoms.

Furthermore, you can do some tweaking with the package command. I am sure other people (Stan and/or Axel) will have more useful things to say.

I was digging in the mailing list I tried some of the suggestions about to tweak the package command. But nothing.

Btw, I just read one of your posts in the mailing list and saw the best practice file you attached. According to the guide you posted, when you build LAMMPS with KOKKOS, you do:
make yes-gpu
make yes-kokkos

make yes-USER-CUDA
make yes-USER-CG-CMM
make kokkos_cuda

Is it necessary to do make yes-gpu when you build LAMMPS with the KOKKOS package ? I did not.

Christophe

there are many factors that affect GPU acceleration:

  • hardware choices, e.g. balance between GPU and CPU, mainboard design, choice of CPU and memory speed, GPU layout.
  • implementation (e.g. KOKKOS vs. GPU package), compilation settings, precision selection (single vs. mixed vs. double precision)

I tried this one this morning. It only works with DOUBLE_DOUBLE. If I use SINGLE_SINGLE or SINGLE_DOUBLE, LAMMPS (17Nov16) complains and crashes (K80):

Initializing Device and compiling on process 0…Done.
Initializing Devices 0-1 on core 0…Done.
Initializing Devices 0-1 on core 1…Done.
Initializing Devices 0-1 on core 2…Done.
Initializing Devices 0-1 on core 3…Done.
Initializing Devices 0-1 on core 4…Done.
Initializing Devices 0-1 on core 5…Done.
Initializing Devices 0-1 on core 6…Done.
Initializing Devices 0-1 on core 7…Done.
Initializing Devices 0-1 on core 8…Done.
Initializing Devices 0-1 on core 9…Done.
Initializing Devices 0-1 on core 10…Done.
Initializing Devices 0-1 on core 11…Done.

Neighbor list info …
0 neighbor list requests
update every 1 steps, delay 10 steps, check yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 7.3
ghost atom cutoff = 7.3
binsize = 3.65, bins = 51 51 51
Setting up Verlet run …
Unit style : metal
Current step : 0
Time step : 0.001
Cuda driver error 700 in call at file ‘geryon/nvd_timer.h’ in line 76.
[cli_12]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 12

  • run time settings, balance between tasks that can run on the GPU and those that must run on the CPU, amount of data to be move between host of
  • nature of the problem at hand

for all of this applies the “weakest link” principle, i.e. if all of them fit well, you’ll see good GPU acceleration. if one or more doesn’t fit so well, you may see quite significant degradation of the GPU acceleration. also, one has to scale its expectations of acceleration. when running a K80 with 24 CPU cores, you have 12 cores/GPU, and thus a speedup of 3 actually means 3x for each CPU core. OTOH, with the price of a single K80 being similar (if not larger) to a reasonably equipped 24 core node, it not a great deal to have 3x speed vs 2x speed, considering that GPU acceleration can depend much on the system at hand and the choice of input settings. this is the context in which my comment about “moderate GPU acceleration” has to be seen. there are scenarios, where the GPU acceleration can be much more.

e.g. for the 24core CPU + 2x GPU layout and using the GPU package, it would be a better choice to run long-range electrostatics on the CPU while computing real-space coulomb on the GPU, and one can tweak the real-space coulomb cutoff so that it reaches the optimal balance. this can often result in different choices for CPU and GPU.

also one has to factor in the cost of non-accelerated operations, e.g. all kinds of computes and fixes and for the KOKKOS package on has to keep in mind, that it only supports double precision and that extra transfer of data between host and GPU for fixes/computes not ported to KOKKOS, can make it more costly than with the GPU package, where ideally only position data is sent to the GPU and forced data returned.

Actually, when I use KOKKOS package, I get the following warning:

WARNING: Fixes cannot send data in Kokkos communication, switching to classic communication (…/comm_kokkos.cpp:365)

finding optimal choices here is non-trivial and requires spending some time understanding what all the many choices mean and experimenting extensively with various specific settings and choices to learn where a particular simulation input can benefit from acceleration the most. there are no simple “do this, not that” rules here.

Clearly, it depends on many factors.
Christophe

there are many factors that affect GPU acceleration:
- hardware choices, e.g. balance between GPU and CPU, mainboard design,
choice of CPU and memory speed, GPU layout.
- implementation (e.g. KOKKOS vs. GPU package), compilation settings,
precision selection (single vs. mixed vs. double precision)

I tried this one this morning. It only works with DOUBLE_DOUBLE. If I use
SINGLE_SINGLE or SINGLE_DOUBLE, LAMMPS (17Nov16) complains and crashes
(K80):

Initializing Device and compiling on process 0...Done.
Initializing Devices 0-1 on core 0...Done.
Initializing Devices 0-1 on core 1...Done.
Initializing Devices 0-1 on core 2...Done.
Initializing Devices 0-1 on core 3...Done.
Initializing Devices 0-1 on core 4...Done.
Initializing Devices 0-1 on core 5...Done.
Initializing Devices 0-1 on core 6...Done.
Initializing Devices 0-1 on core 7...Done.
Initializing Devices 0-1 on core 8...Done.
Initializing Devices 0-1 on core 9...Done.
Initializing Devices 0-1 on core 10...Done.
Initializing Devices 0-1 on core 11...Done.

Neighbor list info ...
  0 neighbor list requests
  update every 1 steps, delay 10 steps, check yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 7.3
  ghost atom cutoff = 7.3
  binsize = 3.65, bins = 51 51 51
Setting up Verlet run ...
  Unit style : metal
  Current step : 0
  Time step : 0.001
Cuda driver error 700 in call at file 'geryon/nvd_timer.h' in line 76.
[cli_12]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 12

​using single or mixed precision requires a simulation, where the forces
remain representable in single precision math. it is quite likely ​that you
have an overflow here.
single precision calculations are much more sensitive to what happens and
that there are no extreme situations locally.

- run time settings, balance between tasks that can run on the GPU and
those that must run on the CPU, amount of data to be move between host of
- nature of the problem at hand

for all of this applies the "weakest link" principle, i.e. if all of them
fit well, you'll see good GPU acceleration. if one or more doesn't fit so
well, you may see quite significant degradation of the GPU acceleration.
also, one has to scale its expectations of acceleration. when running a K80
with 24 CPU cores, you have 12 cores/GPU, and thus a speedup of 3 actually
means 3x for *each* CPU core. OTOH, with the price of a single K80 being
similar (if not larger) to a reasonably equipped 24 core node, it not a
great deal to have 3x speed vs 2x speed, considering that GPU acceleration
can depend much on the system at hand and the choice of input settings.
this is the context in which my comment about "moderate GPU acceleration"
has to be seen. there are scenarios, where the GPU acceleration can be much
more.

e.g. for the 24core CPU + 2x GPU layout and using the GPU package, it
would be a better choice to run long-range electrostatics on the CPU while
computing real-space coulomb on the GPU, and one can tweak the real-space
coulomb cutoff so that it reaches the optimal balance. this can often
result in different choices for CPU and GPU.

also one has to factor in the cost of non-accelerated operations, e.g.
all kinds of computes and fixes and for the KOKKOS package on has to keep
in mind, that it only supports double precision and that extra transfer of
data between host and GPU for fixes/computes not ported to KOKKOS, can make
it more costly than with the GPU package, where ideally only position data
is sent to the GPU and forced data returned.

Actually, when I use KOKKOS package, I get the following warning:

WARNING: Fixes cannot send data in Kokkos communication, switching to
classic communication (../comm_kokkos.cpp:365)

​so you are likely using some kind of fix that hasn't been or cannot be
properly ported to KOKKOS.​

finding optimal choices here is non-trivial and requires spending some
time understanding what all the many choices mean and experimenting
extensively with various specific settings and choices to learn where a
particular simulation input can benefit from acceleration the most. there
are no simple "do this, not that" rules here.

Clearly, it depends on many factors.

​...and particularly, you must not depend on LAMMPS doing the best choice
for you. with the many degrees of freedom on the hardware side, and the
complexity of calculations that LAMMPS support, this is essentially
impossible to do.

bottom line: if GPUs work well and you have the optimal settings and
scientific problem and hardware, they can work much faster than CPUs, if
there are issues with any of these, things will not work so well.

axel.​

Christophe

axel.

​--

Dr. Axel Kohlmeyer [email protected] http://goo.gl/1wk0
College of Science & Technology, Temple University, Philadelphia PA, USA
International Centre for Theoretical Physics, Trieste. Italy.