best performance of lammps with DPD using OMP and GPU packages

Dear LAMMPS users,

I was trying to achieve the best performance of LAMMPS with dpd
interaction using OMP and GPU packages. Below are my results and the
way I got them. I would be very thankful if you provide you
comments/suggestions regarding how optimal are my simulations and how
they can be improved further.

Machine: Cray XC30 system, 8-core 64-bit Intel SandyBridge CPU (Intel®
Xeon® E5-2670), NVIDIA Tesla K20X with 6 GB GDDR5 memory.

I ran dpd fluid for 50000 timesteps (see the input script in the end
of the letter).

The best performance with OMP package for one node was achieved for 8
MPI tasks and 2 OMP (hyperthreading). The time: 33m32s.
The best time for GPU package was 18m21. It looks suspicious that the
speed up for GPU was 1.8x.

The way we call aprun for OMP:
export OMP_NUM_THREADS=2
time aprun -n 8 -N 8 -d $OMP_NUM_THREADS -j 2 ./lmp-omp < in.water-cpu

Compiler optimizations:
For nvcc: -O3 -code=sm_35 -Xptxas --use_fast_math
For gcc: -O3 -mavx -mtune=native (-fopenmp)

OMP lammps script:
package omp 2
boundary p p p

units lj
atom_style atomic

lattice custom 3.0 a1 1.0 0.0 0.0 a2 0.0 1.0 0.0 a3 0.0 0.0 1.0 &
     basis 0.5 0.0 0.0 basis 0.0 0.5 0.0 basis 0.0 0.0 0.5

region box block -24.0 24.0 -24.0 24.0 -24.0 24.0

create_box 1 box
create_atoms 1 random 442368 1234 box
mass 1 1.0

neighbor 0.3 bin
neigh_modify delay 0 every 4 check yes

comm_style brick
comm_modify vel yes

pair_style dpd/omp 0.0945 1.0 34387
pair_coeff 1 1 100.0 45.0 1.0

thermo 10000
timestep 0.001
fix 1 all nve/omp
run 50000

GPU lammps script:
package gpu 1 device kepler
boundary p p p

units lj
atom_style atomic

lattice custom 3.0 a1 1.0 0.0 0.0 a2 0.0 1.0 0.0 a3 0.0 0.0 1.0 &
     basis 0.5 0.0 0.0 basis 0.0 0.5 0.0 basis 0.0 0.0 0.5

region box block -24.0 24.0 -24.0 24.0 -24.0 24.0

create_box 1 box
create_atoms 1 random 442368 1234 box
mass 1 1.0

neighbor 0.3 bin
neigh_modify delay 0 every 4 check yes

comm_style brick
comm_modify vel yes

pair_style dpd/gpu 0.0945 1.0 34387

pair_coeff 1 1 100.0 45.0 1.0
thermo 10000
timestep 0.001

fix 1 all nve
run 50000

Dear LAMMPS users,

I was trying to achieve the best performance of LAMMPS with dpd
interaction using OMP and GPU packages. Below are my results and the
way I got them. I would be very thankful if you provide you
comments/suggestions regarding how optimal are my simulations and how
they can be improved further.

Machine: Cray XC30 system, 8-core 64-bit Intel SandyBridge CPU (Intel®
Xeon® E5-2670), NVIDIA Tesla K20X with 6 GB GDDR5 memory.

I ran dpd fluid for 50000 timesteps (see the input script in the end
of the letter).

The best performance with OMP package for one node was achieved for 8
MPI tasks and 2 OMP (hyperthreading). The time: 33m32s.
The best time for GPU package was 18m21. It looks suspicious that the
speed up for GPU was 1.8x.

what is suspicious about that? what kind of speedup did you expect?

you are sharing your GPU with 8(!) CPUs, so the single GPU gives you 8
times the 1.8 speedup.
you have a cutoff of 1 sigma, so there are not a lot of members in the
neighbor list, i.e. you don't give the GPU a lot to work with.

axel.

Thank you for the reply. May be I confused you since I didn't mention
that for GPU run I used only one CPU:
aprun -n 1 -N 1 ../lmp_gpu < in.water-gpu

I expect the higher speed up because of the following. Theoretical
Peak Floating-point Performance per node (8 cpu) for single precision:
332 Gflops and for GPU 3900Gflops. So from this, I would expect the
upper limit for speed up is ~11. At the same time, the bandwidth for
cpu ~40GB/s and for GPU is ~155GB/s. Thus the lower limit should be
around 3.8.

How would you suggest to change the lammps script to load GPU more
(and, thus, observe higher relative speed up)? Do you suggest
increasing rc from 1 to, lets say, 1.5 or there are so other ways as
well?

How many atoms are you simulating? The input script in your first email generates 440K atoms. You could try increase the number of MPI tasks per GPU, up to 8 for your node.

The optimal MPI tasks sharing the GPU depends on the number of atoms per node for a given model (dpd in this case).

-Trung

Thank you Trung: it came out that the best performance is achieved if
I use 4 mpi processes for one GPU. The speed up is exactly 2x. Yes, I
use 440K atoms, as written in the script.
Am I right that this speed up is due to utilisation of CPU or there
are GPU-related reasons.

Hi Kirill,

yes, you can benefit from using multiple MPI tasks with the GPU package for that number of atoms per GPU. If you decrease the number of atoms per GPU, you will see the change in the optimal number of MPI tasks sharing the GPU.

Are you using mixed precision or double precision for the GPU package? You can also play with the parameter tpa (GPU threads per atom) (see package gpu’s doc page) to see if it can give you some more speedup.

-Trung

I don't understand why sharing GPU between MPI processes makes a speed
up - since it should only increase the communication. Just for my
curiosity, why does it happen?
I tried different number of threads - it doesn't give a speed up, 8
threads is optimal. I use single precision. Summarising, I think ~9m
is the best performance for this system. It gives speed up around 3.66
in comparison with OMP lammps, which is approximately the ration
between bandwidth for these cpu and gpu.

Hi Kirill,

when you increase the number of MPI processes, the communication overhead between the MPI processes increases, and the computation time is reduced due to spatial decomposition. As long as the reduction in computation time is greater than the increase in the communication overhead, you will get some speedup from using multiple MPI processes versus using a single process. As you already saw, using more than 4 MPI processes is slower than using 4 MPI processes. Or as I already suggested, you can decrease the number of atoms, keeping the same number of MPI processes, i.e. decreasing the number of atoms per MPI process, to see when the crossover happens.

Now, when you use the GPU package, there is additional overhead due to communication between the GPU and the MPI processes in every time step. This overhead sets the lower bound for the number of atoms per MPI process to expect some speedup. Using more than one MPI processes sharing the GPU can be take another overhead, depending on the GPU hardware, driver and toolkit. We have had a discussion on this topic a while ago (http://lammps.sandia.gov/threads/msg47572.html)

The speedup you observes is relevant to the specified number of atom count per node (440K). I am not sure if it’d be the same for different atom counts per node.

Best,
-Trung