I used the following commands to compile lammps with kokkos-enabled package:
cmake -C …/cmake/presets/basic.cmake -C …/cmake/presets/kokkos-cuda.cmake …/cmake
cmake --build .; make -j 64
The compiling process was successful and then I used the following job file to run my simulation:
#!/bin/bash #SBATCH -N 1 # request two nodes #SBATCH -n 64 # specify 16 MPI processes (8 per node) # specify 6 threads per process #SBATCH -t 72:00:00 #SBATCH -c 1 #SBATCH -p gpu #SBATCH -A myAllocation #SBATCH -o mk3-MwWater_CH4.out # optional, name of the stdout, using the job number (%j) and the first node (%N) #SBATCH -e mk3-MwWater_CH4.err # optional, name of the stderr, using job and first node values
Sat Oct 22 23:28:55 CDT 2022
LAMMPS (23 Jun 2022)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:105)
will use up to 0 GPU(s) per node
ERROR: Kokkos has been compiled with GPU-enabled backend but no GPUs are requested (src/KOKKOS/kokkos.cpp:207)
Last command: (unknown)
Sat Oct 22 23:28:57 CDT 2022
Sat Oct 22 23:53:56 CDT 2022
LAMMPS (23 Jun 2022)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:105)
will use up to 1 GPU(s) per node
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 1035134 RUNNING AT mike183
= EXIT CODE: 134
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
Sat Oct 22 23:54:02 CDT 2022
Also, when I attempt to compile kokkos-enabled lammps with the following script: cmake -D PKG_KOKKOS=ON -D Kokkos_ARCH_HOSTARCH=yes -D Kokkos_ARCH_GPUARCH=yes -D Kokkos_ENABLE_CUDA=yes -D Kokkos_ENABLE_OPENMP=yes -D CMAKE_CXX_COMPILER=${HOME}/mylammps/lib/kokkos/bin/nvcc_wrapper -D PKG_MOLECULE=on -D PKG_MANYBODY=on -D PKG_RIGID=on -D PKG_KSPACE=on -D PKG_MISC=ON ../cmake
I face with the following error:
-- The CXX compiler identification is unknown
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - failed
-- Check for working CXX compiler: /home/madibi/mylammps/lib/kokkos/bin/nvcc_wrapper
-- Check for working CXX compiler: /home/madibi/mylammps/lib/kokkos/bin/nvcc_wrapper - broken
CMake Error at /usr/share/cmake/Modules/CMakeTestCXXCompiler.cmake:59 (message):
The C++ compiler
"/home/madibi/mylammps/lib/kokkos/bin/nvcc_wrapper"
is not able to compile a simple test program.
It fails with the following output:
Change Dir: /home/madibi/mylammps/build_mpi_kokkos2/CMakeFiles/CMakeTmp
Run Build Command(s):/usr/bin/gmake -f Makefile cmTC_f5937/fast && /usr/bin/gmake -f CMakeFiles/cmTC_f5937.dir/build.make CMakeFiles/cmTC_f5937.dir/build
gmake[1]: Entering directory '/home/madibi/mylammps/build_mpi_kokkos2/CMakeFiles/CMakeTmp'
Building CXX object CMakeFiles/cmTC_f5937.dir/testCXXCompiler.cxx.o
/home/madibi/mylammps/lib/kokkos/bin/nvcc_wrapper -o CMakeFiles/cmTC_f5937.dir/testCXXCompiler.cxx.o -c /home/madibi/mylammps/build_mpi_kokkos2/CMakeFiles/CMakeTmp/testCXXCompiler.cxx
In file included from /usr/local/packages/cuda/10.2.89/qfy6kks/bin/../targets/x86_64-linux/include/cuda_runtime.h:83,
from <command-line>:
/usr/local/packages/cuda/10.2.89/qfy6kks/bin/../targets/x86_64-linux/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported!
138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported!
| ^~~~~
gmake[1]: *** [CMakeFiles/cmTC_f5937.dir/build.make:78: CMakeFiles/cmTC_f5937.dir/testCXXCompiler.cxx.o] Error 1
gmake[1]: Leaving directory '/home/madibi/mylammps/build_mpi_kokkos2/CMakeFiles/CMakeTmp'
gmake: *** [Makefile:127: cmTC_f5937/fast] Error 2
CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
CMakeLists.txt:17 (project)
-- Configuring incomplete, errors occurred!
See also "/home/madibi/mylammps/build_mpi_kokkos2/CMakeFiles/CMakeOutput.log".
See also "/home/madibi/mylammps/build_mpi_kokkos2/CMakeFiles/CMakeError.log".
We need more info, like the abort error message, or a stack track of where it is failing. Could try re-running with ulimit -c unlimited to get a core dump.
Thanks Stamoor. The problem only occurred when I compiled LAMMPS using the below commands:
cmake -C …/cmake/presets/basic.cmake -C …/cmake/presets/kokkos-cuda.cmake …/cmake
cmake --build .; make -j 64
@stamoor However, when I run lammps with kokkos, for a system of around 500K atoms (only short interaction forces), I get very low performance, almost half of what I get by running via GPU. package. I also get the following warning at the beginning of my simulations:
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
CPU: Intel Ice Lake (Intel® Xeon® Platinum 8358 Processor) CPUs, GPU: NVIDIA Ampere A100 GPU’s with NVLink interconnect
This is my run command:
mpirun -n $SLURM_NPROCS /home/madibi/mylammps/build_mpi_kokkos/lmp -k on g 1 -sf kk -pk kokkos newton on neigh half comm device -in mw_hydrate_Teql.in
I run with different number of processes from 1 to 64 ( the maximum number of cpus available on the node)
pair style: sw, fix: I use fix nph, and fix langevin simultaneously
As far as I know all pair styles and fixes are KOKKOS-enabled.
This looks like an ideal case for Kokkos. I would expect to see the Kokkos package as fast or faster than the GPU package.
How many GPUs?
You will probably only want to run a single MPI rank per GPU and leave most of the CPU cores idle. This is because everything is running on the GPU, and there isn’t a good way for the CPU to help out (without slowing the GPU down).
Currently you are only running a single GPU (-k on g 1), be sure to use -k on g 4 for 4 GPUs per node, -k on g 6 for 6 GPUs per node, etc.
Also be sure you are using GPU-aware MPI when using multiple GPUs. LAMMPS will disable that automatically and give a warning if there is a problem with your MPI library, which can really slow down performance.
I have as many as 4 GPUs available per node, but I can run simulations on up to 8 nodes. Thus, I can run on 32 GPUs in total.There are around 0.5 million atoms in the simulation box. I have another system with around 2 milliion atoms but the runtime still is higher for that system as well, compared to running with GPU package. I also used -k on g 4 but the performance is still low compared to the GPU package with the same number of gpus.
How can I be sure if I am using a GPU-aware MPI? I followd the instructions on the lammps webpage to compile lammps with KOKKOS.
Here is what I get for 2 and 4 GPUs for the same 512k atom SW problem:
Loop time of 7.8263 on 2 procs for 6200 steps with 512000 atoms
Performance: 68.446 ns/day, 0.351 hours/ns, 792.201 timesteps/s, 405.607 Matom-step/s
Loop time of 5.38935 on 4 procs for 6200 steps with 512000 atoms
Performance: 99.396 ns/day, 0.241 hours/ns, 1150.418 timesteps/s, 589.014 Matom-step/s
For GPU-aware, you can force it with -pk kokkos newton on neigh half gpu/aware on, in this case it will segfault if there is a problem instead of automatically disabling GPU-aware MPI.
So, it seems that my kokkos-senabled LAMMPS produces just the same results as yours for the benchmark. The pair style of the benchmark is the same as it is in my own simulation file. Do you think it is because of the fix nph, fix langevin that my simulation performance is lower on KOKKOS?
For the GPU package, 64 CPU cores is probably too much for a single GPU unless you have a really huge number of atoms since you are basically strong-scaling the kernels you offload to the GPU, which leads to overhead. You also want to make sure you enable the CUDA MPS (multi-process service) daemon when using more than 1 MPI rank per GPU, see Multi-Process Service :: GPU Deployment and Management Documentation, this can really help performance.
For Kokkos, both fix nph and fix langevin are ported to Kokkos and should be running on the GPU so I wouldn’t expect that to significantly slow down the simulation. So I’m not sure why it is slower, would need to do some profiling. Is the input something you can share? Or I could give you instructions on how to get a quick profile.
That is still running on a single GPU: 1 procs. You need to change both the number of MPI ranks and the number of GPUs in the Kokkos args (e.g. -k on g 4 -sf kk)
For example: mpiexec -np 4 ./lmp -in in.intel.sw -k on g 4 -sf kk -pk kokkos newton on neigh half
Be sure to set export MV2_USE_CUDA=1. Can you also try: -pk kokkos newton on neigh half comm device gpu/aware on. If that crashes then GPU-aware MPI is not enabled. The last issue could be how you are binding the MPI tasks. If all 4 tasks are on the same socket on a dual-socket CPU that is bad.