LAMMPS does not run with KOKKOS enabled

m.adibi · October 25, 2022, 9:46pm

it crashed with the following error:


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 3403356 RUNNING AT mike181
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
Tue Oct 25 16:45:54 CDT 2022

m.adibi · October 25, 2022, 9:51pm

I can send my input files in a private message or I can email them to you. Would that be okay with you? Thanks!

stamoor · October 25, 2022, 9:58pm

Yes that means your GPU-aware MPI is not working correctly. In that case, it has to do the MPI buffer pack/unpack on the CPU and transfer a bunch of data which leads to overhead.

Either is fine with me, you can find my email here: Authors of LAMMPS.

m.adibi · October 25, 2022, 10:13pm

Sent the files via email. Thank you very much for your kind help.

stamoor · October 25, 2022, 11:09pm

@m.adibi are you using mixed precision with the GPU package?

stamoor · October 25, 2022, 11:27pm

For the GPU package with double precision on a single A100, and 32 cores, I’m seeing:

Loop time of 58.9571 on 32 procs for 2000 steps with 432000 atoms
Performance: 29.309 ns/day, 0.819 hours/ns, 33.923 timesteps/s, 14.655 Matom-step/s

For Kokkos:

Loop time of 73.3546 on 1 procs for 2000 steps with 432000 atoms
Performance: 23.557 ns/day, 1.019 hours/ns, 27.265 timesteps/s, 11.778 Matom-step/s

Is this similar to what you are seeing?

m.adibi · October 25, 2022, 11:36pm

I dont recall because I compiled my lammps with GPU package a long time ago. However, I checked the compiling commands and it seems that I compiled GPU with double precision. Heere is the commands I used:

cmake -D GPU_ARCH=sm_70 -D GPU_API=cuda -D GPU_PREC=double -D PKG_MOLECULE=on -D PKG_MANYBODY=on -D PKG_RIGID=on -D PKG_KSPACE=on -D PKG_GPU=on -D PKG_MISC=ON …/cmake

m.adibi · October 25, 2022, 11:47pm

This is what I get:

for GPU package:

Loop time of 76.2756 on 32 procs for 2000 steps with 432000 atoms

Performance: 22.655 ns/day, 1.059 hours/ns, 26.221 timesteps/s
99.8% CPU use with 32 MPI tasks x 1 OpenMP threads

for KOKKOS:

Loop time of 72.3745 on 1 procs for 2000 steps with 432000 atoms

Performance: 23.876 ns/day, 1.005 hours/ns, 27.634 timesteps/s

99.7% CPU use with 1 MPI tasks x 1 OpenMP threads

So, as of my understanding the kokkos package does not work very well for me if I set number of gpus more than one , right? and this is because I have not compiled with gpu/aware?

stamoor · October 25, 2022, 11:56pm

GPU/aware really depends on your MPI library, nothing in LAMMPS needed to be compiled differently. If you are using MVAPICH2, then export MV2_USE_CUDA=1 should be sufficient. You should get a warning at the top of the LAMMPS log file when LAMMPS turns it off (to avoid segfault). Can you post that warning here?

m.adibi · October 26, 2022, 12:05am

Just to make sure I understand what you mean:
I need to run with 2 processors, 2 gpus and export MV2_USE_CUDA=1 and also use gpu/aware on
in the batch file to generate the warnings?

m.adibi · October 26, 2022, 12:10am

I only added export MV2_USE_CUDA=1 and received the following warnings:

WARNING: When using a single thread, the Kokkos Serial backend (i.e. Makefile.kokkos_mpi_only) gives better performance than the OpenMP backend (src/KOKKOS/kokkos.cpp:217)
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
WARNING: Detected MPICH. Disabling GPU-aware MPI (src/KOKKOS/kokkos.cpp:317)
  using 1 OpenMP thread(s) per MPI task
Reading restart file ...
  restart file = 24 Mar 2022, LAMMPS = 23 Jun 2022
WARNING: Old restart file format revision. Switching to compatibility mode. (src/read_restart.cpp:609)
WARNING: Restart file used different # of processors: 48 vs. 2 (src/read_restart.cpp:654)

This should be the warning you are talking about:

WARNING: Detected MPICH. Disabling GPU-aware MPI (src/KOKKOS/kokkos.cpp:317)
  using 1 OpenMP thread(s) per MPI task

m.adibi · October 26, 2022, 12:17am

So there is nothing I can do?

stamoor · October 26, 2022, 3:27pm

You could try export MPICH_GPU_SUPPORT_ENABLED=1 depending on your version of MPICH. Otherwise you’ll need to use a different MPI library such as OpenMPI or MVAPICH2, see https://developer.nvidia.com/blog/introduction-cuda-aware-mpi/.

m.adibi · October 26, 2022, 7:12pm

I believe I was able to resolve the issue by loading the mvapich2 module instead of mpich. The performance on 4 gpu/4proc now is around 55 ns/Day compared to 23.876 ns/Day for1proc/1gpu.

Loop time of 31.1606 on 4 procs for 2000 steps with 432000 atoms

Performance: 55.455 ns/day, 0.433 hours/ns, 64.184 timesteps/s

98.7% CPU use with 4 MPI tasks x 1 OpenMP threads

I truly appreciate your kind help, dear Dr. Moore! Just as last question, should the number of CPUs be the same as number of GPUs for the best performance? I see that performance improved only twice for my case even though I use four times more GPUs.

stamoor · October 26, 2022, 7:24pm

Great. Yes the number of MPI ranks should equal the the total number of GPUs (i.e. number of nodes*GPUs per node). You need more atoms in the system to be able to strong scale efficiently, 400k atoms isn’t enough to keep all the threads on multiple GPUs fully saturated. That said, we are looking at exposing more parallelism in the manybody potentials by threading over neighbors in addition to atoms, which could help.

stamoor · October 26, 2022, 7:51pm

Did some profiling. The performance limiter of your problem on A100 GPUs appears to be imbalance in the number of neighbors per atom. I’m assuming this is because you are running a liquid system and there is no regular crystal lattice like in the Stillinger-Weber benchmark we ran first. This imbalance causes warp divergence since some threads end earlier than others, which reduces the compute efficiency. We are looking into ways to improve this.

m.adibi · October 26, 2022, 8:21pm

Actually, it is a three-phase system; crystals are in the middle of the box, with liquid and dissolved gas located to the left and right of the box.

stamoor · October 26, 2022, 8:25pm

Wow that is complex. The idea we have is to thread over neighbors in addition to atoms. The hope is that will give both better strong scaling by exposing more parallelism, and better load balance too since each thread processes less interactions, fixing both issues at once. I’ll keep you posted.

m.adibi · October 26, 2022, 8:46pm

Thanks very much, Dr. Moore. I am waiting to see those changes reflected in LAMMPS. My research can be significantly benefited from those modifications as it takes months for my current system to reach its final state. Thanks for keeping me posted; I truly appreciate it!