LAMMPS does not run with KOKKOS enabled

m.adibi · October 23, 2022, 4:40am

I used the following commands to compile lammps with kokkos-enabled package:
cmake -C …/cmake/presets/basic.cmake -C …/cmake/presets/kokkos-cuda.cmake …/cmake
cmake --build .; make -j 64

The compiling process was successful and then I used the following job file to run my simulation:
#!/bin/bash
#SBATCH -N 1 # request two nodes
#SBATCH -n 64 # specify 16 MPI processes (8 per node) # specify 6 threads per process
#SBATCH -t 72:00:00
#SBATCH -c 1
#SBATCH -p gpu
#SBATCH -A myAllocation
#SBATCH -o mk3-MwWater_CH4.out # optional, name of the stdout, using the job number (%j) and the first node (%N)
#SBATCH -e mk3-MwWater_CH4.err # optional, name of the stderr, using job and first node values

below are job commands

module purge
module load mpich/3.4.2/intel-2021.5.0
module load cuda/11.6.0

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TAS

Run LAMMPS on MPI using gpu

#mpirun -np 48 lmp -sf gpu -pk gpu 2 -in runHydr.in
echo $SLURM_NPROCS
echo $OMP_NUM_THREADS

export MV2_USE_CUDA=1

date

Remove all previously written files

rm *.lammpstrj
rm *.bin
rm *.temp
rm .p
rm .h
rm slurm- db1-
rm *.csv *.rdf *.mden

mpirun -n $SLURM_NPROCS /home/madibi/mylammps/build_mpi_kokkos/lmp -k on -sf kk -in mw_hydrate_Teql.in

However I ended up with the error below:
Autoloading intel/2021.5.0

Loading mpich/3.4.2/intel-2021.5.0
Loading requirement: intel/2021.5.0
64

Sat Oct 22 23:28:55 CDT 2022
LAMMPS (23 Jun 2022)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:105)
will use up to 0 GPU(s) per node
ERROR: Kokkos has been compiled with GPU-enabled backend but no GPUs are requested (src/KOKKOS/kokkos.cpp:207)
Last command: (unknown)
Sat Oct 22 23:28:57 CDT 2022

Does anyone know why?

m.adibi · October 23, 2022, 4:55am

I resolved the previous issue by adding “g 1” to the running line on the job file. However, I faced the error below:
Autoloading intel/2021.5.0

Loading mpich/3.4.2/intel-2021.5.0
Loading requirement: intel/2021.5.0
1

Sat Oct 22 23:53:56 CDT 2022
LAMMPS (23 Jun 2022)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:105)
will use up to 1 GPU(s) per node

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 1035134 RUNNING AT mike183
= EXIT CODE: 134
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
Sat Oct 22 23:54:02 CDT 2022

akohlmey · October 23, 2022, 7:33am

Please see the guidelines on posting issues or bugs to the forum. and also:
https://docs.lammps.org/Errors_debug.html

m.adibi · October 24, 2022, 1:07am

Thanks.

Also, when I attempt to compile kokkos-enabled lammps with the following script:
cmake -D PKG_KOKKOS=ON -D Kokkos_ARCH_HOSTARCH=yes -D Kokkos_ARCH_GPUARCH=yes -D Kokkos_ENABLE_CUDA=yes -D Kokkos_ENABLE_OPENMP=yes -D CMAKE_CXX_COMPILER=${HOME}/mylammps/lib/kokkos/bin/nvcc_wrapper -D PKG_MOLECULE=on -D PKG_MANYBODY=on -D PKG_RIGID=on -D PKG_KSPACE=on -D PKG_MISC=ON ../cmake

I face with the following error:

-- The CXX compiler identification is unknown
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - failed
-- Check for working CXX compiler: /home/madibi/mylammps/lib/kokkos/bin/nvcc_wrapper
-- Check for working CXX compiler: /home/madibi/mylammps/lib/kokkos/bin/nvcc_wrapper - broken
CMake Error at /usr/share/cmake/Modules/CMakeTestCXXCompiler.cmake:59 (message):
  The C++ compiler

    "/home/madibi/mylammps/lib/kokkos/bin/nvcc_wrapper"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: /home/madibi/mylammps/build_mpi_kokkos2/CMakeFiles/CMakeTmp

    Run Build Command(s):/usr/bin/gmake -f Makefile cmTC_f5937/fast && /usr/bin/gmake  -f CMakeFiles/cmTC_f5937.dir/build.make CMakeFiles/cmTC_f5937.dir/build
    gmake[1]: Entering directory '/home/madibi/mylammps/build_mpi_kokkos2/CMakeFiles/CMakeTmp'
    Building CXX object CMakeFiles/cmTC_f5937.dir/testCXXCompiler.cxx.o
    /home/madibi/mylammps/lib/kokkos/bin/nvcc_wrapper    -o CMakeFiles/cmTC_f5937.dir/testCXXCompiler.cxx.o -c /home/madibi/mylammps/build_mpi_kokkos2/CMakeFiles/CMakeTmp/testCXXCompiler.cxx
    In file included from /usr/local/packages/cuda/10.2.89/qfy6kks/bin/../targets/x86_64-linux/include/cuda_runtime.h:83,
                     from <command-line>:
    /usr/local/packages/cuda/10.2.89/qfy6kks/bin/../targets/x86_64-linux/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported!
      138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported!
          |  ^~~~~
    gmake[1]: *** [CMakeFiles/cmTC_f5937.dir/build.make:78: CMakeFiles/cmTC_f5937.dir/testCXXCompiler.cxx.o] Error 1
    gmake[1]: Leaving directory '/home/madibi/mylammps/build_mpi_kokkos2/CMakeFiles/CMakeTmp'
    gmake: *** [Makefile:127: cmTC_f5937/fast] Error 2





  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  CMakeLists.txt:17 (project)


-- Configuring incomplete, errors occurred!
See also "/home/madibi/mylammps/build_mpi_kokkos2/CMakeFiles/CMakeOutput.log".
See also "/home/madibi/mylammps/build_mpi_kokkos2/CMakeFiles/CMakeError.log".

Do you have any idea why?

akohlmey · October 24, 2022, 1:37am

You acknowledge the pages with suggestions, but you do not act on them.
That makes it unlikely that you will get much help.

Please carefully read the output that you are quoting. The explanation is right there.

stamoor · October 24, 2022, 3:42pm

We need more info, like the abort error message, or a stack track of where it is failing. Could try re-running with ulimit -c unlimited to get a core dump.

m.adibi · October 25, 2022, 4:31am

Thanks Stamoor. The problem only occurred when I compiled LAMMPS using the below commands:
cmake -C …/cmake/presets/basic.cmake -C …/cmake/presets/kokkos-cuda.cmake …/cmake
cmake --build .; make -j 64

But, I recompiled lammps using:
cmake -D PKG_KOKKOS=ON -D Kokkos_ARCH_HOSTARCH=yes -D Kokkos_ARCH_GPUARCH=yes -D Kokkos_ENABLE_CUDA=yes -D Kokkos_ENABLE_OPENMP=yes -D CMAKE_CXX_COMPILER=${HOME}/mylammps/lib/kokkos/bin/nvcc_wrapper -D PKG_MOLECULE=on -D PKG_MANYBODY=on -D PKG_RIGID=on -D PKG_KSPACE=on -D PKG_MISC=ON …/cmake

cmake --build .; make -j 64

And the problem resolved now.

m.adibi · October 25, 2022, 5:03am

@stamoor However, when I run lammps with kokkos, for a system of around 500K atoms (only short interaction forces), I get very low performance, almost half of what I get by running via GPU. package. I also get the following warning at the beginning of my simulations:
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false

stamoor · October 25, 2022, 4:23pm

Need more info:

what type of GPU and CPU
what is your run command
what pair and fix styles are you using in the input?
Are these styles Kokkos-enabled?

m.adibi · October 25, 2022, 6:10pm

CPU: Intel Ice Lake (Intel® Xeon® Platinum 8358 Processor) CPUs, GPU: NVIDIA Ampere A100 GPU’s with NVLink interconnect
This is my run command:
mpirun -n $SLURM_NPROCS /home/madibi/mylammps/build_mpi_kokkos/lmp -k on g 1 -sf kk -pk kokkos newton on neigh half comm device -in mw_hydrate_Teql.in

I run with different number of processes from 1 to 64 ( the maximum number of cpus available on the node)

pair style: sw, fix: I use fix nph, and fix langevin simultaneously
As far as I know all pair styles and fixes are KOKKOS-enabled.

stamoor · October 25, 2022, 6:41pm

This looks like an ideal case for Kokkos. I would expect to see the Kokkos package as fast or faster than the GPU package.

How many GPUs?

You will probably only want to run a single MPI rank per GPU and leave most of the CPU cores idle. This is because everything is running on the GPU, and there isn’t a good way for the CPU to help out (without slowing the GPU down).

Currently you are only running a single GPU (-k on g 1), be sure to use -k on g 4 for 4 GPUs per node, -k on g 6 for 6 GPUs per node, etc.

Also be sure you are using GPU-aware MPI when using multiple GPUs. LAMMPS will disable that automatically and give a warning if there is a problem with your MPI library, which can really slow down performance.

m.adibi · October 25, 2022, 8:12pm

Thanks for the reply.

I have as many as 4 GPUs available per node, but I can run simulations on up to 8 nodes. Thus, I can run on 32 GPUs in total.There are around 0.5 million atoms in the simulation box. I have another system with around 2 milliion atoms but the runtime still is higher for that system as well, compared to running with GPU package. I also used -k on g 4 but the performance is still low compared to the GPU package with the same number of gpus.

How can I be sure if I am using a GPU-aware MPI? I followd the instructions on the lammps webpage to compile lammps with KOKKOS.

stamoor · October 25, 2022, 8:23pm

Can you try running this Stillinger-Weber benchark: lammps/in.intel.sw at 2b8d6fc4d93f6c7dcce870ed134e6d3ac45f47b7 · lammps/lammps · GitHub

Here is what I get on a single A100 with NVLink, only difference is the CPU (2.8 Ghz AMD EPYC Milan 7543P).

Loop time of 11.705 on 1 procs for 6200 steps with 512000 atoms
Performance: 45.765 ns/day, 0.524 hours/ns, 529.689 timesteps/s, 271.201 Matom-step/s

Or I could try running your input if you can share it.

stamoor · October 25, 2022, 8:43pm

Here is what I get for 2 and 4 GPUs for the same 512k atom SW problem:

Loop time of 7.8263 on 2 procs for 6200 steps with 512000 atoms
Performance: 68.446 ns/day, 0.351 hours/ns, 792.201 timesteps/s, 405.607 Matom-step/s

Loop time of 5.38935 on 4 procs for 6200 steps with 512000 atoms
Performance: 99.396 ns/day, 0.241 hours/ns, 1150.418 timesteps/s, 589.014 Matom-step/s

For GPU-aware, you can force it with -pk kokkos newton on neigh half gpu/aware on, in this case it will segfault if there is a problem instead of automatically disabling GPU-aware MPI.

m.adibi · October 25, 2022, 8:47pm

Running with KOKKOS, I got this for the Stillinger-Weber benchmark:

Loop time of 11.9299 on 1 procs for 6200 steps with 512000 atoms

Performance: 44.902 ns/day, 0.534 hours/ns, 519.704 timesteps/s

I used 1 CPU and 1 GPU.

and I got the following performance for running with GPU package (64 CPUs, 1 GPU):
Loop time of 109.331 on 64 procs for 6200 steps with 512000 atoms

Performance: 4.900 ns/day, 4.898 hours/ns, 56.708 timesteps/s

So, it seems that my kokkos-senabled LAMMPS produces just the same results as yours for the benchmark. The pair style of the benchmark is the same as it is in my own simulation file. Do you think it is because of the fix nph, fix langevin that my simulation performance is lower on KOKKOS?

stamoor · October 25, 2022, 9:09pm

For the GPU package, 64 CPU cores is probably too much for a single GPU unless you have a really huge number of atoms since you are basically strong-scaling the kernels you offload to the GPU, which leads to overhead. You also want to make sure you enable the CUDA MPS (multi-process service) daemon when using more than 1 MPI rank per GPU, see Multi-Process Service :: GPU Deployment and Management Documentation, this can really help performance.

For Kokkos, both fix nph and fix langevin are ported to Kokkos and should be running on the GPU so I wouldn’t expect that to significantly slow down the simulation. So I’m not sure why it is slower, would need to do some profiling. Is the input something you can share? Or I could give you instructions on how to get a quick profile.

m.adibi · October 25, 2022, 9:10pm

However, when I run the benchmark on KOKKOS with 4 gpus, I get lower or the same performance:

Loop time of 11.9368 on 1 procs for 6200 steps with 512000 atoms
Performance: 44.876 ns/day, 0.535 hours/ns, 519.401 timesteps/s

stamoor · October 25, 2022, 9:11pm

That is still running on a single GPU: 1 procs. You need to change both the number of MPI ranks and the number of GPUs in the Kokkos args (e.g. -k on g 4 -sf kk)

For example: mpiexec -np 4 ./lmp -in in.intel.sw -k on g 4 -sf kk -pk kokkos newton on neigh half

m.adibi · October 25, 2022, 9:22pm

I did an the performance is even lower!

Loop time of 49.4631 on 4 procs for 6200 steps with 512000 atoms
Performance: 10.830 ns/day, 2.216 hours/ns, 125.346 timesteps/s

This is my command line for the above run:

mpirun -n $SLURM_NPROCS /home/madibi/mylammps/build_mpi_kokkos/lmp -k on g 4 -sf kk -pk kokkos newton on neigh half comm device -in in.intel.sw

stamoor · October 25, 2022, 9:33pm

Be sure to set export MV2_USE_CUDA=1. Can you also try: -pk kokkos newton on neigh half comm device gpu/aware on. If that crashes then GPU-aware MPI is not enabled. The last issue could be how you are binding the MPI tasks. If all 4 tasks are on the same socket on a dual-socket CPU that is bad.

LAMMPS does not run with KOKKOS enabled

below are job commands

Run LAMMPS on MPI using gpu

Remove all previously written files

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 1035134 RUNNING AT mike183 = EXIT CODE: 134 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 1035134 RUNNING AT mike183
= EXIT CODE: 134
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES