Segmentation fault (11) Invalid permissions (2) Using multi GPUs with LAMMPS+MLIAP/MACE

I am trying to get a build of LAMMPS with MACE via the ML-IAP package that supports multiple GPUs. I’ve attempted to include as much information as possible in this posting. For reference the openmpi was compiled with cuda (from the HPC admin):
“I have built this MPI now. The UCC and UCX are included within this install and I built them with cuda also. So, it has NCCL, UCC, UCX, ROCM, CUDA, etc. all included.”

cmake commands

cmake -D CMAKE_BUILD_TYPE=debug
-D CMAKE_INSTALL_PREFIX=$(pwd)
-D CMAKE_CXX_STANDARD=17
-D CMAKE_CXX_STANDARD_REQUIRED=ON
-D BUILD_MPI=ON
-D BUILD_SHARED_LIBS=ON
-D PKG_KOKKOS=ON
-D Kokkos_ENABLE_DEBUG=on
-D Kokkos_ENABLE_CUDA=ON
-D Kokkos_ARCH_ZEN2=ON
-D Kokkos_ARCH_AMPERE80=ON
-D CMAKE_CXX_COMPILER=$(pwd)/../lib/kokkos/bin/nvcc_wrapper
-D PKG_ML-MACE=ON
-D PKG_ML-IAP=ON
-D MLIAP_ENABLE_PYTHON=ON
-D PKG_ML-SNAP=ON
-D PKG_PYTHON=ON
-D PKG_COLVARS=ON
-D PKG_MOLECULE=ON
-D PKG_EXTRA-DUMP=ON
-D PKG_EXTRA-COMPUTE=ON
-D PKG_EXTRA-FIX=ON
-D PKG_EXTRA-PAIR=ON
-D PKG_REPLICA=ON
-D PKG_RIGID=ON
-D PKG_KSPACE=ON
../cmake >> cmake_log.txt 2>&1

I’ll skip the job itself it is just 1 water molecule performing NVT for 1000 steps with a MACE MLIAP formatted foundational model converted for LAMMPS, and crashes right after the section “Setting up Verlet run”.

Running on 1 GPU is fine (with some ignorable warnings/errors in the slurm stdout file)

“mpirun -np 1 /vast/projects/athena/staging/laco457/LAMMPS/LAMMPS_MLIAP/build-MLIAP-OPMICUDA/lmp -k on g 1 -sf kk -pk kokkos gpu/aware off newton on neigh half < mini.in > OUTPMF.1GPU”

Running on 2 GPUs crashes
“mpirun -np 2 /vast/projects/athena/staging/laco457/LAMMPS/LAMMPS_MLIAP/build-MLIAP-OPMICUDA/lmp -k on g 2 -sf kk -pk kokkos gpu/aware off newton on neigh half < mini.in > OUTPMF.2GPU”

I’ve tried different combinations of inputs such as gpu/aware on/off, comm device/no, and specifying the package options within the lammps file itself with no success. The errors all seem the same:

[a100-07:217959] *** Process received signal ***
[a100-07:217959] Signal: Segmentation fault (11)
[a100-07:217959] Signal code: Invalid permissions (2)
[a100-07:217959] Failing at address: 0x1554edbaf200
[a100-07:217959] [ 0] /usr/lib64/libc.so.6(+0x3e730)[0x15553ea3e730]
[a100-07:217959] [ 1] /a/LAMMPS/LAMMPS_MLIAP/build-MLIAP-OPMICUDA/liblammps.so.0(ZN9LAMMPS_NS15PairMLIAPKokkosIN6Kokkos4CudaEE17pack_forward_commIdEEiiPiPdiS5_PT+0x7f)[0x155546436505]
[a100-07:217959] [ 2] /a/LAMMPS/LAMMPS_MLIAP/build-MLIAP-OPMICUDA/liblammps.so.0(ZN9LAMMPS_NS15PairMLIAPKokkosIN6Kokkos4CudaEE17pack_forward_commEiPiPdiS4+0x1d1)[0x15554642d061]
[a100-07:217959] [ 3] /a/LAMMPS/LAMMPS_MLIAP/build-MLIAP-OPMICUDA/liblammps.so.0(_ZN9LAMMPS_NS9CommBrick12forward_commEPNS_4PairEi+0xcb)[0x155543511ddb]
[a100-07:217959] [ 4] /a/LAMMPS/LAMMPS_MLIAP/build-MLIAP-OPMICUDA/liblammps.so.0(_ZN9LAMMPS_NS10CommKokkos12forward_commEPNS_4PairEi+0x67)[0x15554435466f]
[a100-07:217959] [ 5] /a/LAMMPS/LAMMPS_MLIAP/build-MLIAP-OPMICUDA/liblammps.so.0(_ZN9LAMMPS_NS15PairMLIAPKokkosIN6Kokkos4CudaEE12forward_commIdEEiPT_S6_i+0x11a)[0x15554642dcfc]
[a100-07:217959] [ 6] /a/LAMMPS/LAMMPS_MLIAP/build-MLIAP-OPMICUDA/liblammps.so.0(_ZN9LAMMPS_NS21MLIAPDataKokkosDevice16forward_exchangeIdEEvPT_S3_i+0x35)[0x1555446d6575]
[a100-07:217959] [ 7] /a/LAMMPS/LAMMPS_MLIAP/build-MLIAP-OPMICUDA/liblammps.so.0(+0x564d911)[0x1555446a0911]
[a100-07:217959] [ 8] /a/LAMMPS/LAMMPS_MLIAP/build-MLIAP-OPMICUDA/liblammps.so.0(+0x564bd19)[0x15554469ed19]
[a100-07:217959] [ 9] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(+0x140616)[0x15553d8fd616]
[a100-07:217959] [10] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x716)[0x15553d8ec806]
[a100-07:217959] [11] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(_PyFunction_Vectorcall+0x75)[0x15553d8fc3a5]
[a100-07:217959] [12] /a/mace-polar-OMPICUDA/lib/python3.10/site-packages/torch/lib/libtorch_python.so(+0x81f40c)[0x1553edc3540c]
[a100-07:217959] [13] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(+0x13eee8)[0x15553d8fbee8]
[a100-07:217959] [14] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(PyObject_Call+0x20f)[0x15553d909e0f]
[a100-07:217959] [15] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x5670)[0x15553d8f1760]
[a100-07:217959] [16] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(+0x14c2e9)[0x15553d9092e9]
[a100-07:217959] [17] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x48a4)[0x15553d8f0994]
[a100-07:217959] [18] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(+0x14c2e9)[0x15553d9092e9]
[a100-07:217959] [19] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x12f9)[0x15553d8ed3e9]
[a100-07:217959] [20] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(+0x14c2e9)[0x15553d9092e9]
[a100-07:217959] [21] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(PyObject_Call+0xc1)[0x15553d909cc1]
[a100-07:217959] [22] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x2a3f)[0x15553d8eeb2f]
[a100-07:217959] [23] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(+0x14c2e9)[0x15553d9092e9]
[a100-07:217959] [24] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(PyObject_Call+0xc1)[0x15553d909cc1]
[a100-07:217959] [25] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x2a3f)[0x15553d8eeb2f]
[a100-07:217959] [26] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(_PyFunction_Vectorcall+0x75)[0x15553d8fc3a5]
[a100-07:217959] [27] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(_PyObject_FastCallDictTstate+0x19b)[0x15553d8f44ab]
[a100-07:217959] [28] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(_PyObject_Call_Prepend+0x67)[0x15553d907887]
[a100-07:217959] [29] /a/mace-polar-OMPICUDA/lib/libpython3.10.so.1.0(+0x212ace)[0x15553d9cface]
[a100-07:217959] *** End of error message ***
prterun noticed that process rank 0 with PID 217959 on node a100-07 exited on
signal 11 (Segmentation fault).

I’ve run into a wall in terms of trying to resolve this myself and would appreciate any suggestions.
Thank you for your help with this.

Unfortunately, there is not much that we can do to help you here, since MACE is not part of LAMMPS but maintained separately by the MACE developers. Thus, we have no knowledge about the details of their implementation and you have to reach out to them for assistance with your issues.

To have a more meaningful report, I suggest you build with -D CMAKE_BUILD_TYPE=Debug (and not “debug”) and also report which LAMMPS version exactly you are using.

Thank you for your reply. I don’t have the expertise to interpreting that backtrace and which program is actually causing the problem. I’ll try reaching out to MACE to see if their team can provide any insight.
I corrected the typo for ‘Debug’ in the cmake, rebuilt LAMMPS and reran my test jobs with no noticeable difference. The LAMMPS version taken from version.h: LAMMPS_VERSION “11 Feb 2026”. I’ll include that next time I post on this forum.

I’ve created a post on the MACE github page and will link it here so that if they provide a solution future readers of this thread might be able to find it. The git post also includes additional information in the form of files that I wasn’t able to include here.

It looks like your OpenMPI is not CUDA aware, and something in MLIAP KOKKOS isn’t set up to handle to that. This isn’t a good sign:

If you can use a CUDA-aware MPI then this issue should go away.

Just to be clear, I think this is likely a LAMMPS KOKKOS MLIAP issue, not inside MACE. But an external collaborator (Matt Bettencourt) created the KOKKOS MLIAP port, so I’m not very familiar with it.

… Frustratingly I switched the gpu/aware flag from off to on in both the slurm script and the lammps script and the job ran successfully. This is something I had tried before but didn’t work, the inputs are the same, so I don’t know why this is only working now.

I wish I knew why this is suddenly working, but regardless the issue has been resolved. Thank you for your responses.

This may happen if the sysadmin has changed the default mpirun, or the mpirun you were using, from a cuda-unaware build of (for example) OpenMPI to a cuda-aware build, between when you last tested it and now.