I am trying to get a build of LAMMPS with MACE via the ML-IAP package that supports multiple GPUs. I’ve attempted to include as much information as possible in this posting. For reference the openmpi was compiled with cuda (from the HPC admin):
“I have built this MPI now. The UCC and UCX are included within this install and I built them with cuda also. So, it has NCCL, UCC, UCX, ROCM, CUDA, etc. all included.”
cmake commands
cmake -D CMAKE_BUILD_TYPE=debug
-D CMAKE_INSTALL_PREFIX=$(pwd)
-D CMAKE_CXX_STANDARD=17
-D CMAKE_CXX_STANDARD_REQUIRED=ON
-D BUILD_MPI=ON
-D BUILD_SHARED_LIBS=ON
-D PKG_KOKKOS=ON
-D Kokkos_ENABLE_DEBUG=on
-D Kokkos_ENABLE_CUDA=ON
-D Kokkos_ARCH_ZEN2=ON
-D Kokkos_ARCH_AMPERE80=ON
-D CMAKE_CXX_COMPILER=$(pwd)/../lib/kokkos/bin/nvcc_wrapper
-D PKG_ML-MACE=ON
-D PKG_ML-IAP=ON
-D MLIAP_ENABLE_PYTHON=ON
-D PKG_ML-SNAP=ON
-D PKG_PYTHON=ON
-D PKG_COLVARS=ON
-D PKG_MOLECULE=ON
-D PKG_EXTRA-DUMP=ON
-D PKG_EXTRA-COMPUTE=ON
-D PKG_EXTRA-FIX=ON
-D PKG_EXTRA-PAIR=ON
-D PKG_REPLICA=ON
-D PKG_RIGID=ON
-D PKG_KSPACE=ON
../cmake >> cmake_log.txt 2>&1
I’ll skip the job itself it is just 1 water molecule performing NVT for 1000 steps with a MACE MLIAP formatted foundational model converted for LAMMPS, and crashes right after the section “Setting up Verlet run”.
Running on 1 GPU is fine (with some ignorable warnings/errors in the slurm stdout file)
“mpirun -np 1 /vast/projects/athena/staging/laco457/LAMMPS/LAMMPS_MLIAP/build-MLIAP-OPMICUDA/lmp -k on g 1 -sf kk -pk kokkos gpu/aware off newton on neigh half < mini.in > OUTPMF.1GPU”
Running on 2 GPUs crashes
“mpirun -np 2 /vast/projects/athena/staging/laco457/LAMMPS/LAMMPS_MLIAP/build-MLIAP-OPMICUDA/lmp -k on g 2 -sf kk -pk kokkos gpu/aware off newton on neigh half < mini.in > OUTPMF.2GPU”
I’ve tried different combinations of inputs such as gpu/aware on/off, comm device/no, and specifying the package options within the lammps file itself with no success. The errors all seem the same:
[a100-07:217959] *** Process received signal ***
[a100-07:217959] Signal: Segmentation fault (11)
[a100-07:217959] Signal code: Invalid permissions (2)
[a100-07:217959] Failing at address: 0x1554edbaf200
[a100-07:217959] [ 0] /usr/lib64/libc.so.6(+0x3e730)[0x15553ea3e730]
[a100-07:217959] [ 1] /a/LAMMPS/LAMMPS_MLIAP/build-MLIAP-OPMICUDA/liblammps.so.0(ZN9LAMMPS_NS15PairMLIAPKokkosIN6Kokkos4CudaEE17pack_forward_commIdEEiiPiPdiS5_PT+0x7f)[0x155546436505]
[a100-07:217959] [ 2] /a/LAMMPS/LAMMPS_MLIAP/build-MLIAP-OPMICUDA/liblammps.so.0(ZN9LAMMPS_NS15PairMLIAPKokkosIN6Kokkos4CudaEE17pack_forward_commEiPiPdiS4+0x1d1)[0x15554642d061]
[a100-07:217959] [ 3] /a/LAMMPS/LAMMPS_MLIAP/build-MLIAP-OPMICUDA/liblammps.so.0(_ZN9LAMMPS_NS9CommBrick12forward_commEPNS_4PairEi+0xcb)[0x155543511ddb]
[a100-07:217959] [ 4] /a/LAMMPS/LAMMPS_MLIAP/build-MLIAP-OPMICUDA/liblammps.so.0(_ZN9LAMMPS_NS10CommKokkos12forward_commEPNS_4PairEi+0x67)[0x15554435466f]
[a100-07:217959] [ 5] /a/LAMMPS/LAMMPS_MLIAP/build-MLIAP-OPMICUDA/liblammps.so.0(_ZN9LAMMPS_NS15PairMLIAPKokkosIN6Kokkos4CudaEE12forward_commIdEEiPT_S6_i+0x11a)[0x15554642dcfc]
[a100-07:217959] [ 6] /a/LAMMPS/LAMMPS_MLIAP/build-MLIAP-OPMICUDA/liblammps.so.0(_ZN9LAMMPS_NS21MLIAPDataKokkosDevice16forward_exchangeIdEEvPT_S3_i+0x35)[0x1555446d6575]
[a100-07:217959] [ 7] /a/LAMMPS/LAMMPS_MLIAP/build-MLIAP-OPMICUDA/liblammps.so.0(+0x564d911)[0x1555446a0911]
[a100-07:217959] [ 8] /a/LAMMPS/LAMMPS_MLIAP/build-MLIAP-OPMICUDA/liblammps.so.0(+0x564bd19)[0x15554469ed19]
[a100-07:217959] [ 9] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(+0x140616)[0x15553d8fd616]
[a100-07:217959] [10] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x716)[0x15553d8ec806]
[a100-07:217959] [11] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(_PyFunction_Vectorcall+0x75)[0x15553d8fc3a5]
[a100-07:217959] [12] /a/mace-polar-OMPICUDA/lib/python3.10/site-packages/torch/lib/libtorch_python.so(+0x81f40c)[0x1553edc3540c]
[a100-07:217959] [13] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(+0x13eee8)[0x15553d8fbee8]
[a100-07:217959] [14] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(PyObject_Call+0x20f)[0x15553d909e0f]
[a100-07:217959] [15] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x5670)[0x15553d8f1760]
[a100-07:217959] [16] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(+0x14c2e9)[0x15553d9092e9]
[a100-07:217959] [17] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x48a4)[0x15553d8f0994]
[a100-07:217959] [18] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(+0x14c2e9)[0x15553d9092e9]
[a100-07:217959] [19] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x12f9)[0x15553d8ed3e9]
[a100-07:217959] [20] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(+0x14c2e9)[0x15553d9092e9]
[a100-07:217959] [21] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(PyObject_Call+0xc1)[0x15553d909cc1]
[a100-07:217959] [22] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x2a3f)[0x15553d8eeb2f]
[a100-07:217959] [23] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(+0x14c2e9)[0x15553d9092e9]
[a100-07:217959] [24] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(PyObject_Call+0xc1)[0x15553d909cc1]
[a100-07:217959] [25] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x2a3f)[0x15553d8eeb2f]
[a100-07:217959] [26] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(_PyFunction_Vectorcall+0x75)[0x15553d8fc3a5]
[a100-07:217959] [27] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(_PyObject_FastCallDictTstate+0x19b)[0x15553d8f44ab]
[a100-07:217959] [28] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(_PyObject_Call_Prepend+0x67)[0x15553d907887]
[a100-07:217959] [29] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(+0x212ace)[0x15553d9cface]
[a100-07:217959] *** End of error message ***
prterun noticed that process rank 0 with PID 217959 on node a100-07 exited on
signal 11 (Segmentation fault).
I’ve run into a wall in terms of trying to resolve this myself and would appreciate any suggestions.
Thank you for your help with this.