Increasing LAMMPs performance for small number of atoms

I am trying to optimize a relatively small gas/membrane simulation in LAMMPs. System:

  • 500 rigid N2 molecules

  • long 1D-like box: 2500 x 50 x 50 Å

  • electrostatics with PPPM

  • GPU available

  • using fix nvt/rigid

Current launch:

srun --ntasks=2 --cpus-per-task=1 lmp -sf gpu -pk gpu 1 pair/only on -in system.in

My current performance is only about ~1 ns/day, which seems very low for such a small system.

Are there any personal recommendations regarding optimal MPI/OpenMP layout or PPPM settings for such systems? The docs suggest that PPPM might be better calculated on the CPU, but I even tried running the entire simulation on the CPU and observed only marginal performance gains. I also tried some different configurations (increasing number of tasks or cpus-per-task), but again, only small gains.

Thanks!

At 500 atoms per GPU, there is very little gain from GPU acceleration. In addition, you have a very large box. Your calculation is likely dominated by the time spent on KSpace since the cost of PPPM scales with the volume times the number of particles.

I would first try with CPU-only and increase the real space cutoff to reduce the work required on kspace. There should be some optimum. This will also increase the number of work units (i.e. pairs) and thus will make GPU acceleration more efficient. I would try with a single GPU to avoid extra cost due to MPI communication.

Another possible optimization could be to use a rigid/small fix instead of a plain rigid fix. But that will have the most impact with multiple MPI ranks. When using multiple MPI ranks, you should use the processors command to control how to divide the cell into subdomains to minimize load imbalances.

1 Like

@wagner_muller the amount of host-device memory bandwidth with fix nvt/rigid at every timestep is most likely the bottleneck keeping the GPU idle.

please try fix rigid/small/nvt/kk from KOKKOS: TIP4P by alphataubio · Pull Request #4971 · lammps/lammps · GitHub with cuFFT for PPPM in KOKKOS using KOKKOS_PREC=single, mixed, double.

since most of my KOKKOS kernels are threading over nlocal_body and you only have 500 rigid N2 molecules, you’re still a long way from saturating 6912 and 16896 cuda cores in A100 and H100 GPU.

splitting across multiple GPUs will be much worse with packing/unpacking mpi comm overhead

if fix rigid/small/nvt/kk can be run asynchronously in parallel with pppm you might get a little more parallelism.

1 Like