I would like to ask for your recommendations to achieve a good performance gain with LAMMPS on A100 Nvidia GPUs. Currently, I am limited to only 4 GPUs per node. I have both tried using the latest LAMMPS release version (LAMMPS/02Aug2023) and a pre-installed older version (LAMMPS/23Jun2022 with CUDA-11.4.1) in a cluster I have access to. I observe no significant performance gain with 2 or 4 GPUs than using just one. I am doing benchmarks on the 3d LJ melt from the LAMMPS examples folder (file name: in.melt). I have tried different box sizes ranging from a few thousand atoms to 63 million atoms. I was wondering if you could tell me if the commands I use for running LAMMPS with GPU/KOKKOS packages are correct or if they are the reason why I am not getting better performance with more GPUs.
srun lmp -in in.melt -sf gpu -pk gpu 4 neigh no newton off split -1.0 # for GPU package
mpirun -np 4 --oversubscribe --use-hwthread-cpus --map-by hwthread lmp -in in.melt -k on g 4 -sf kk -pk kokkos neigh full newton off gpu/aware on # for KOKKOS package
The GPU package requires at least one MPI process per GPU. So when running in serial - as you do in this command line - there is no benefit to signaling that you have 4 GPUs. LAMMPS will always use only 1 GPU. Thus you have to have at least 4 MPI processes to use 4 GPUs when you have requested just one node. You can also attached 8 MPI processes to the 4 GPUs (two per GPU) as this will give you MPI parallelization for the non-accelerated parts of the code and increases the GPU utilization. Why do you use “neigh no”? For pair style lj/cut using “neigh yes” should be faster, GPU/CPU balance rarely leads to an improvement. There is a point, however, where adding more MPI processes is not helping anymore.
You should definitely get speedup with Kokkos and multiple A100 GPUs if using 63 million atoms. Can you post log files for 1 vs 4 GPUs? I would also bind to core and not hwthread, something like mpiexec -np 4 --bind-to core
Dear Axel,
Thank you very much for your suggestions. I tried re-running the simulations according to your latest comments and I got double or triple performance gain, compared to what I used to get. Also, I am finally observing better performance for 2 GPUs than for 1.
In both cases you are only running on a single MPI rank and therefore only using 1 GPU, even though you requested 4 GPUs. You can see that from these lines:
1 by 1 by 1 MPI processor grid
Loop time of 1067.6 on 1 procs for 10000 steps with 62500000 atoms