Cores and GPUs with Kokkos and CUDA

_James_Kress · September 25, 2016, 6:22pm

Am I correct in my reading of the LAMMPS documentation that if I build LAMMPS with KOKKOS device=CUDA on my 36 core, 2 GPU system I can only run 2 MPI processes with my 2 GPUs, i.e. one core per GPU? The other 34 cores are not allowed to access the GPUs for that LAMMPS run?

Thanks.

Jim

Ray_Shan1 · September 26, 2016, 2:19pm

Am I correct in my reading of the LAMMPS documentation that if I build LAMMPS with KOKKOS device=CUDA on my 36 core, 2 GPU system I can only run 2 MPI processes with my 2 GPUs, i.e. one core per GPU? The other 34 cores are not allowed to access the GPUs for that LAMMPS run?

Thanks.

Jim

Stan_Moore · September 28, 2016, 10:12pm

Yes, it is required to use only 1 CPU per 1 GPU with the Kokkos package

No, you can definitely use more than 1 CPU core per GPU with Kokkos (I tried it today). And I did get speedup if a fix is running in non-threaded mode on the host CPU. But if everything is running with Kokkos on the GPU, it will probably slow down your simulation.

Stan

_James_Kress · September 29, 2016, 2:40am

Stan,

np =2, g 2 wall time=3:39

np=20 g 2 wall time=44:39

using in.lj with 100,000 steps

mpirun -np=20 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj

Jim

Stan_Moore · September 30, 2016, 3:46pm

Jim, if you are serious about this you would also need to use CUDA MPS https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf. However, I really don’t think it will help you with ReaxFF unless you have some fixes that are non-Kokkos and taking up a significant portion of the simulation time. I also don’t think running part of your ReaxFF simulation on the GPU and part on the CPU using OpenMP threading makes sense. You’d probably be better off just using all GPU or all CPU.

However, I have been experimenting with a different approach that uses 1 MPI per GPU but also allows you to add additional MPI tasks without GPUs to the same domain decomposition. It requires precise load balancing to get the right domain sizes for each MPI task and that still needs additional development in LAMMPS. I will keep you posted on what I find.

Stan

_James_Kress · September 30, 2016, 5:43pm

Stan,

Thanks for the direction to MPS. Given that the np=2 g 2 case goes to 95% utilization (per nvidia-smi) of both GPUs, I cannot see how MPS would help. How would I invoke MPS with lammps to give it a try?

Also, in the np=20 g 2 case it was interesting to see the 20 processes spawned on the GPUs, 10 per GPU. I assume the dramatic increase in wall time was due to the overhead incurred due to the transfer of data back and forth to/ from the CPUs/ GPUs. I didn’t do any thread locking but I don’t think that would have made much difference, do you?

Note doing:

np=20 lmp_kokkos_mpi_only I get wall = 4:00

np=36 lmp_kokkos_mpi_only I get wall = 2:34

However, I have been experimenting with a different approach that uses 1 MPI per GPU … I will keep you posted on what I find.

That would be great. Thanks!

Jim

Stan_Moore · July 17, 2017, 6:24pm

Jim, to follow up on an old thread:

How would I invoke MPS with lammps to give it a try?

CUDA MPS is a daemon that needs to be started and running on the compute nodes. I’m not sure how to do this myself, but someone else is looking into this on one of our GPU testbeds. I’ll let you know how much of a different it makes when we get it running.

However, I have been experimenting with a different approach that uses 1 MPI per GPU, I will keep you posted on what I find.

I tried to use larger domains for GPUs and smaller domains for other regular MPI tasks. However, I ran into load balancing issues since optimal the domain size needed for the pair and fix qeq were different.

Would you please give me an example command line for [using Kokkos OpenMP and CUDA at the same time]? Also, how would I modify an example, e.g. in.lj, to implement what you suggested?

This feature will be documented soon. Here are the basics. The suffix “/kk” is equivalent to “/kk/device”, and for Kokkos CUDA, using the “-sf kk” in the command line gives you the default CUDA version everywhere. However, if you explicitly add the “/kk/host” suffix to a specific style in your input script, you will instead get the Kokkos OpenMP CPU version for that specific style. Conversely, if you use the “-sf kk/host” at the command line and then use the “/kk” or “/kk/device” suffix for a specific style in your input script, that specific style will run on the GPU while everything else will run on the CPU in OpenMP mode. I’ve attached an example that runs fix qeq using OpenMP on the host and pair reax/c/kk and everything else with CUDA on the GPU. The command to run with 1 GPU and 8 OpenMP threads is then:

mpiexec -np 1 --bind-to core ~/lammps_master/src/lmp_kokkos_cuda_openmpi -in in.reaxc.tatb -k on g 1 t 8 -sf kk -pk kokkos neigh half newton on

I don’t guarantee that this will be faster than running everything on the GPU, but it does allow you to use more CPU cores.

Stan

in.reaxc.tatb (1.37 KB)

_James_Kress · July 17, 2017, 7:18pm

Stan,

Thank you.

Jim