KOKKOS: multiple simulations (GPU or/and CPU?)

alfredo_sanchez · February 15, 2025, 12:32pm

I will present my problem and I hope you can help me. I am new in LAMMPS and I am trying to understand how the KOKKOS library works. I am modeling a system that contains, among other oxides, (15-x)N2O - (x)K2O, with x varying between 0 and 15.

The hardware I have at the moment

1 NVIDIA A100 GPU
1 Intel® Xeon® Gold 6240R Processor 35.75M Cache, 2.40 GHz
4 NVIDIA RTX 6000 ADA GPUs (will be active soon)

I want to explore the possibility of simultaneously launching simulations for x = 1,3,5,7,…,15. Each simulation will have between 10000 and 30000 atoms. My question is general: what strategy is recommended in this type of situations to fully exploit the performance of the hardware I have and minimize the execution time? Is it advisable to run, for example, 4 or 8 simulations simultaneously on the GPU? Or, on the contrary, is it advisable to run them sequentially on the CPU, one by one, for this number of atoms/simulation?

Additionally, if there are more atoms per simulation, and always considering the limitations of the GPU memory used, is there an optimal number of simulations in a single run? The answer is probably that it depends on the hardware, but my question is more related to best practices than to hardware issues. I appreciate the help and good disposition of the community.

akohlmey · February 16, 2025, 11:23pm

Your scenario is a very unusual one, so I am not aware of any predetermined answers for that.
However, what you are asking for can be easily determined empirically. What is the optimal solution will likely depend on the specifics of your input and the styles you are using (and if they all support KOKKOS). How well you can use the different GPUs depends on how well they support double precision floating point math. KOKKOS packages use that exclusively.

You have two ways of running multiple simulations, a) run separate commands in combination with the CUDA_VISIBLE_DEVICES environment variable or b) use the multi-partition feature in LAMMPS. I would also compare the single GPU performance against a share of CPU cores (via MPI or OpenMP or both).

Again, that depends on your specific simulation settings, pair style and cutoff. The main memory consumer in MD simulations are the neighbor lists which depends on the cutoff.

Oh, and if you plan to use the same GPU with multiple processes, consider using the mult-process GPU server (MPS). Some more hints for optimizing LAMMPS performance are in the manual: 7. Accelerate performance — LAMMPS documentation

alfredo_sanchez · February 17, 2025, 12:56pm

Thank you very much (once again) @akohlmey . Certainly, when using MPS not only the GPU shows a significant speedup, but also the GPU usage dropped from ~95% to ~50%. Additionally, thanks to your answer I understood the need to do benchmarks on this type of uncommon schemes. I do not know if I should share such a benchmarks with the community, but in case of yes, please let me know how to do it.

stamoor · February 18, 2025, 1:55am

You could also look into CUDA MIG: NVIDIA Multi-Instance GPU User Guide r560.