Device time info with KOKKOS-GPU and performance issue with KOKKOS

Ok, I will do that. I am sending the message again to the mailing-list.

In both case (GPU and KokkosGPU) I am using 4 GPU and 24 MPI process per node, and OMP_NUM_THREAD=1

For the Kokkos package, it is usually best to run only 1 MPI rank per GPU. (For K80, this means 1 MPI rank per logical GPU = 2 MPI ranks per physical GPU.) This is because Kokkos tries to run everything on the GPU, including the integrator and other fixes/computes. In contrast, the GPU package only runs the pair style and a few other calculations on the GPU, while the integrator and other fixes and computes are done on the host CPU, so it is normally better to run multiple MPI/GPU for the GPU package. Are you using CUDA MPS (multi-process service)? If you are seeing speedup with Kokkos using more than 1 MPI rank per GPU on a simple LJ test case, then you probably doing something wrong, i.e. not using the GPUs at all.

Unfortunately the Kokkos package doesn’t yet have device timing output like the GPU package. As Axel mentioned, you are using mixed precision for the GPU package and double precision for Kokkos, so that will explain some performance difference. I’m happy to look more into this–can you post the timing breakdown?

I ran your test case on K80+Power8. Using newton off and cuda/aware on gives best performance for Kokkos. Looks like most of the difference is due to mixed vs double precision in the pair style:


mpiexec -np 4 --bind-to core ~/lammps_master/src/lmp_kokkos_cuda_mpi -in in.test -k on g 4 -sf kk -pk kokkos newton on neigh full comm device cuda/aware off

Loop time of 76.6201 on 4 procs for 500 steps with 10976000 atoms

mpiexec -np 4 --bind-to core ~/lammps_master/src/lmp_kokkos_cuda_mpi -in in.test -k on g 4 -sf kk -pk kokkos newton off neigh full comm device cuda/aware off

Loop time of 62.1825 on 4 procs for 500 steps with 10976000 atoms

mpiexec -np 4 --bind-to core ~/lammps_master/src/lmp_kokkos_cuda_mpi -in in.test -k on g 4 -sf kk -pk kokkos newton off neigh full comm device cuda/aware on

Loop time of 46.7088 on 4 procs for 500 steps with 10976000 atoms

MPI task timing breakdown:

Section | min time | avg time | max time |%varavg| %total

Hi Stan,
Thanks for your reply and sorry for not answering earlier. I was away for a week.

This is alarming to me?

If you are seeing speedup with Kokkos using more than 1 MPI rank per GPU on a simple LJ test case, then you probably doing something wrong, i.e. not using the GPUs at all.

YES, I am getting speed-up when using multiple processes per gpu.

This LJ-problem takes 228 sec in a cpu-only node, 67 sec with GPU package (double precision). I tried various options for Kokkos GPU.
The fastest was 4GPU/24 processes, neigh full, newton off, comm device, cuda aware off (75 sec).
The timing for 4GPU/4 processes, neigh full, newton off, comm device, cuda aware off was 228 sec. (no acceleration!)

I can send you my inputs, and outputs later, since the cluster is now down for maintenance.

Anyway, how do I know whether the GPUs are actually being used ?
In output, I can see the following line:

If you’re running LAMMPS with a GPU on a linux workstation, you can do the “nvidia-smi” command in conjunction with the “watch” command to see if LAMMPS if running on the GPU. Maybe you could talk with your cluster’s administrator about this?

Will

Hi Stan,
I am replying to this thread after a long time after revisiting the issue very recently.

I want to comment on a few observations as follows:

To my surprise, I got better timing

This is surprising to me as well. Do you know what part of the computation got faster? i.e. pair_style, comm, etc?

Could this gain by 24 MPI tasks be due to Kokkos OpenMP+GPU mixing somehow​

No, in this case you were still only using 1 OpenMP thread. See this section Using OpenMP threading and CUDA together here: https://lammps.sandia.gov/doc/Speed_kokkos.html​ for more info.

Stan

To my surprise, I got better timing

Can you run with “export CUDA_LAUNCH_BLOCKING=1” (assuming bash, or however you set an env var in your shell). That will give an accurate timing breakdown on the GPU.

Thanks,

Stan

Hi Stan,
Thanks for coming back to me.
I will send you the output with your suggested modifications, but it will be on 15th June.
Right now the access to Jureca HPC is restricted for security reasons.
Thanks,
Prithwish

Hi Stan,
As you advised, I ran those jobs again with “export CUDA_LAUNCH_BLOCKING=1” .

There is a change in the timing breakdown table.
4gpu/24proc is faster than 4gpu/4proc setting for Kokkos-GPU.

The output files (logs), input and the job submission script are attached herewith.
Any insight is highly appreciated.

Best regards,
Prithwish

4gpu24proc.log.lammps (4.03 KB)

in.lj (587 Bytes)

4gpu4proc.log.lammps (4.04 KB)

run.sh (544 Bytes)