===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 3403356 RUNNING AT mike181
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
Tue Oct 25 16:45:54 CDT 2022
Yes that means your GPU-aware MPI is not working correctly. In that case, it has to do the MPI buffer pack/unpack on the CPU and transfer a bunch of data which leads to overhead.
Either is fine with me, you can find my email here: Authors of LAMMPS.
I dont recall because I compiled my lammps with GPU package a long time ago. However, I checked the compiling commands and it seems that I compiled GPU with double precision. Heere is the commands I used:
So, as of my understanding the kokkos package does not work very well for me if I set number of gpus more than one , right? and this is because I have not compiled with gpu/aware?
GPU/aware really depends on your MPI library, nothing in LAMMPS needed to be compiled differently. If you are using MVAPICH2, then export MV2_USE_CUDA=1 should be sufficient. You should get a warning at the top of the LAMMPS log file when LAMMPS turns it off (to avoid segfault). Can you post that warning here?
Just to make sure I understand what you mean:
I need to run with 2 processors, 2 gpus and export MV2_USE_CUDA=1 and also use gpu/aware on
in the batch file to generate the warnings?
I only added export MV2_USE_CUDA=1 and received the following warnings:
WARNING: When using a single thread, the Kokkos Serial backend (i.e. Makefile.kokkos_mpi_only) gives better performance than the OpenMP backend (src/KOKKOS/kokkos.cpp:217)
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
WARNING: Detected MPICH. Disabling GPU-aware MPI (src/KOKKOS/kokkos.cpp:317)
using 1 OpenMP thread(s) per MPI task
Reading restart file ...
restart file = 24 Mar 2022, LAMMPS = 23 Jun 2022
WARNING: Old restart file format revision. Switching to compatibility mode. (src/read_restart.cpp:609)
WARNING: Restart file used different # of processors: 48 vs. 2 (src/read_restart.cpp:654)
This should be the warning you are talking about:
WARNING: Detected MPICH. Disabling GPU-aware MPI (src/KOKKOS/kokkos.cpp:317)
using 1 OpenMP thread(s) per MPI task
I believe I was able to resolve the issue by loading the mvapich2 module instead of mpich. The performance on 4 gpu/4proc now is around 55 ns/Day compared to 23.876 ns/Day for1proc/1gpu.
Loop time of 31.1606 on 4 procs for 2000 steps with 432000 atoms
I truly appreciate your kind help, dear Dr. Moore! Just as last question, should the number of CPUs be the same as number of GPUs for the best performance? I see that performance improved only twice for my case even though I use four times more GPUs.
Great. Yes the number of MPI ranks should equal the the total number of GPUs (i.e. number of nodes*GPUs per node). You need more atoms in the system to be able to strong scale efficiently, 400k atoms isn’t enough to keep all the threads on multiple GPUs fully saturated. That said, we are looking at exposing more parallelism in the manybody potentials by threading over neighbors in addition to atoms, which could help.
Did some profiling. The performance limiter of your problem on A100 GPUs appears to be imbalance in the number of neighbors per atom. I’m assuming this is because you are running a liquid system and there is no regular crystal lattice like in the Stillinger-Weber benchmark we ran first. This imbalance causes warp divergence since some threads end earlier than others, which reduces the compute efficiency. We are looking into ways to improve this.
Wow that is complex. The idea we have is to thread over neighbors in addition to atoms. The hope is that will give both better strong scaling by exposing more parallelism, and better load balance too since each thread processes less interactions, fixing both issues at once. I’ll keep you posted.
Thanks very much, Dr. Moore. I am waiting to see those changes reflected in LAMMPS. My research can be significantly benefited from those modifications as it takes months for my current system to reach its final state. Thanks for keeping me posted; I truly appreciate it!