Python module segmentation fault

Hello everyone,

The LAMMPS version I’m using is 29 Aug 2024 (and older releases upto a year), on Linux (Ubuntu, Rhel, Rocky linux).

I am using the LAMMPS python module to automatically launch and manage various LAMMPS instances. In the process of initialising these simulations I noticed that some of them crash with segmentation faults. However, in order to keep things moving, I encapsulated each individual LAMMPS instance in a child process for the case where these faults happen.
I didn’t investigate the crashes further and attributed them to quirks of the force field.
However, upon trying to recreate the crashes by directly running the LAMMPS binary never resulted in similar crashes, which made it harder to track down.
Revisiting the stack trace, it seems that the creation of the box and mapping coordinates leads to a memory access issue, but I do not understand the underlying process sufficiently, so here is the stack trace.

[device] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fffc804ded8)
==== backtrace (tid: 290652) ====
 0  /lib/libucs.so.0(ucs_handle_error+0x254) [0x7fffb56b6b94]
 1  /lib/libucs.so.0(+0x27d4c) [0x7fffb56b6d4c]
 2  /lib/libucs.so.0(+0x27ff8) [0x7fffb56b6ff8]
 3  /lib64/libpsm2.so.2(+0x2750c) [0x7fffc07bd50c]
 4  /lib64/libpsm2.so.2(+0x258aa) [0x7fffc07bb8aa]
 5  /lib64/libpsm2.so.2(+0x1a5b4) [0x7fffc07b05b4]
 6  /lib64/libpsm2.so.2(+0x24b47) [0x7fffc07bab47]
 7  /lib64/libpsm2.so.2(psm2_mq_ipeek2+0x89) [0x7fffc07b3d89]
 8  /openmpi4-gnu12/4.1.4/lib/openmpi/mca_mtl_psm2.so(ompi_mtl_psm2_progress+0x61) [0x7fffb5276791]
 9  /openmpi4-gnu12/4.1.4/lib/libopen-pal.so.40(opal_progress+0x2c) [0x7fffcc892f2c]
10  /openmpi4-gnu12/4.1.4/lib/libopen-pal.so.40(ompi_sync_wait_mt+0x10d) [0x7fffcc89962d]
11 /openmpi4-gnu12/4.1.4/lib/libmpi.so.40(ompi_comm_nextcid+0x169) [0x7fffcd24a3e9]
12  /openmpi4-gnu12/4.1.4/lib/libmpi.so.40(ompi_comm_enable+0x49) [0x7fffcd245a79]
13  /openmpi4-gnu12/4.1.4/lib/libmpi.so.40(mca_topo_base_cart_create+0x1cc) [0x7fffcd2f3dbc]
14  /openmpi4-gnu12/4.1.4/lib/libmpi.so.40(MPI_Cart_create+0x222) [0x7fffcd27d532]
15 /python3.11/site-packages/lammps/liblammps.so(_ZN9LAMMPS_NS7ProcMap8cart_mapEiPiS1_PA2_iPPS1_+0x57) [0x7fffef8fe1d7]
16 /python3.11/site-packages/lammps/liblammps.so(_ZN9LAMMPS_NS4Comm13set_proc_gridEi+0x973) [0x7fffef56bc03]
17  /python3.11/site-packages/lammps/liblammps.so(_ZN9LAMMPS_NS9CreateBox7commandEiPPc+0xd68) [0x7fffefa3b3c8]
18 /python3.11/site-packages/lammps/liblammps.so(_ZN9LAMMPS_NS5Input15execute_commandEv+0x712) [0x7fffefacdd82]
19  /python3.11/site-packages/lammps/liblammps.so(_ZN9LAMMPS_NS5Input4fileEv+0x177) [0x7fffeface557]
20  /python3.11/site-packages/lammps/liblammps.so(_ZN9LAMMPS_NS5Input7includeEv+0xee) [0x7fffefacebbe]
21  /python3.11/site-packages/lammps/liblammps.so(_ZN9LAMMPS_NS5Input15execute_commandEv+0x7d0) [0x7fffefacde40]
22  /python3.11/site-packages/lammps/liblammps.so(_ZN9LAMMPS_NS5Input4fileEv+0x177) [0x7fffeface557]
23  /python3.11/site-packages/lammps/liblammps.so(_ZN9LAMMPS_NS5Input4fileEPKc+0xc2) [0x7fffeface942]
24  /python3.11/site-packages/lammps/liblammps.so(lammps_file+0x23) [0x7fffef3da763]
25  /lib64/libffi.so.6(ffi_call_unix64+0x4c) [0x7fffd098214e]
26  /lib64/libffi.so.6(ffi_call+0x36f) [0x7fffd0981aff]
27  /python/3.11.2/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so(+0xca4d) [0x7fffd0b91a4d]
28  /python/3.11.2/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so(+0x8340) [0x7fffd0b8d340]
29  python3(_PyObject_MakeTpCall+0x6b) [0x50469b]
30  python3(_PyEval_EvalFrameDefault+0x6c8) [0x562808]
31  python3() [0x56136b]
32  python3(_PyEval_EvalFrameDefault+0x3c1d) [0x565d5d]
33  python3() [0x56136b]
34  python3(_PyEval_EvalFrameDefault+0x3c1d) [0x565d5d]
35  python3() [0x56136b]
36  python3(_PyObject_Call_Prepend+0xd3) [0x505543]
37  python3() [0x53d254]
38  python3() [0x53ae43]
39  python3(_PyObject_MakeTpCall+0x6b) [0x50469b]
40  python3(_PyEval_EvalFrameDefault+0x6c8) [0x562808]
41  python3() [0x56136b]
42  python3(PyEval_EvalCode+0x93) [0x5dc913]
43  python3() [0x5f0867]
44  python3() [0x5f07ff]
45  python3() [0x5f0fa2]
46  python3(_PyRun_SimpleFileObject+0x190) [0x5f0ce0]
47  python3(_PyRun_AnyFileObject+0x44) [0x5f0984]
48  python3(Py_RunMain+0x2a4) [0x5f8424]
49  python3(Py_BytesMain+0x27) [0x5f8077]
50  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7ffff709acf3]
51  python3(_start+0x2e) [0x59639e]
=================================

[device] *** Process received signal ***
[device] Signal: Segmentation fault (11)
[device] Signal code:  (-6)
[device] Failing at address: 0x45900046f5c
[device] [ 0] /lib64/libpthread.so.0(+0x12ce0)[0x7ffff7bc1ce0]
[device] [ 1] /lib64/libpsm2.so.2(+0x2750c)[0x7fffc07bd50c]
[device] [ 2] /lib64/libpsm2.so.2(+0x258aa)[0x7fffc07bb8aa]
[device] [ 3] /lib64/libpsm2.so.2(+0x1a5b4)[0x7fffc07b05b4]
[device] [ 4] /lib64/libpsm2.so.2(+0x24b47)[0x7fffc07bab47]
[device] [ 5] /lib64/libpsm2.so.2(psm2_mq_ipeek2+0x89)[0x7fffc07b3d89]
[device] [ 6] /openmpi4-gnu12/4.1.4/lib/openmpi/mca_mtl_psm2.so(ompi_mtl_psm2_progress+0x61)[0x7fffb5276791]
[device] [ 7] /openmpi4-gnu12/4.1.4/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7fffcc892f2c]
[device] [ 8] /openmpi4-gnu12/4.1.4/lib/libopen-pal.so.40(ompi_sync_wait_mt+0x10d)[0x7fffcc89962d]
[device] [ 9] /openmpi4-gnu12/4.1.4/lib/libmpi.so.40(ompi_comm_nextcid+0x169)[0x7fffcd24a3e9]
[device] [10] /openmpi4-gnu12/4.1.4/lib/libmpi.so.40(ompi_comm_enable+0x49)[0x7fffcd245a79]
[device] [11] /openmpi4-gnu12/4.1.4/lib/libmpi.so.40(mca_topo_base_cart_create+0x1cc)[0x7fffcd2f3dbc]
[device] [12] /openmpi4-gnu12/4.1.4/lib/libmpi.so.40(MPI_Cart_create+0x222)[0x7fffcd27d532]
[device] [13] /python3.11/site-packages/lammps/liblammps.so(_ZN9LAMMPS_NS7ProcMap8cart_mapEiPiS1_PA2_iPPS1_+0x57)[0x7fffef8fe1d7]
[device] [14] /python3.11/site-packages/lammps/liblammps.so(_ZN9LAMMPS_NS4Comm13set_proc_gridEi+0x973)[0x7fffef56bc03]
[device] [15] /python3.11/site-packages/lammps/liblammps.so(_ZN9LAMMPS_NS9CreateBox7commandEiPPc+0xd68)[0x7fffefa3b3c8]
[device] [16] /python3.11/site-packages/lammps/liblammps.so(_ZN9LAMMPS_NS5Input15execute_commandEv+0x712)[0x7fffefacdd82]
[device] [17] /python3.11/site-packages/lammps/liblammps.so(_ZN9LAMMPS_NS5Input4fileEv+0x177)[0x7fffeface557]
[device] [18] /python3.11/site-packages/lammps/liblammps.so(_ZN9LAMMPS_NS5Input7includeEv+0xee)[0x7fffefacebbe]
[device] [19] /python3.11/site-packages/lammps/liblammps.so(_ZN9LAMMPS_NS5Input15execute_commandEv+0x7d0)[0x7fffefacde40]
[device] [20] /python3.11/site-packages/lammps/liblammps.so(_ZN9LAMMPS_NS5Input4fileEv+0x177)[0x7fffeface557]
[device] [21] /python3.11/site-packages/lammps/liblammps.so(_ZN9LAMMPS_NS5Input4fileEPKc+0xc2)[0x7fffeface942]
[device] [22] /python3.11/site-packages/lammps/liblammps.so(lammps_file+0x23)[0x7fffef3da763]
[device] [23] /lib64/libffi.so.6(ffi_call_unix64+0x4c)[0x7fffd098214e]
[device] [24] /lib64/libffi.so.6(ffi_call+0x36f)[0x7fffd0981aff]
[device] [25] /python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so(+0xca4d)[0x7fffd0b91a4d]
[device] [26] /python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so(+0x8340)[0x7fffd0b8d340]
[device] [27] python3(_PyObject_MakeTpCall+0x6b)[0x50469b]
[device] [28] python3(_PyEval_EvalFrameDefault+0x6c8)[0x562808]
[device] [29] python3[0x56136b]
[device] *** End of error message ***

While the problem seemed disregardable in its outset, it came to a head when every instance of LAMMPS that I launched on a different HPC platform crashed with a similar stack trace. I will appreciate any insight on this observation.

Cheers

It seems more like an MPI issue.

This is impossible without the means to reproduce it.

Hello @akohlmey,
I was able to reproduce the crash with this example system. Please let me know if this is sufficient to test for the behaviour.

data.particles (246 Bytes)
in.pour.drum (3.4 KB)
sim_launch.py (3.3 KB)

I cannot reproduce any kind of crash with this. The program will start and eventually stop.
But it seems to me you have a programming problem in this part:

def lammps_init(file_path, comm):

    args = ["-screen", "none", "-nb"]
    #lmp = lammps(comm=comm)
    lmp = lammps(cmdargs=args, comm=comm)
    print("Running sim")
    lmp.file(file_path)
    return 1

This will create new LAMMPS instances, but never releases them. So eventually you will run out of file descriptors or reach other limits. You need to call lmp.close() before returning to free the resources allocated by a LAMMPS instance.

That said, you are trying to solve a problem “the hard way™” for which LAMMPS has a built-in solution: multi-partition runs. Take the following script based on your python example:

variable input universe in.pour.drum in.pour.drum &
   in.pour.drum in.pour.drum in.pour.drum in.pour.drum &
   in.pour.drum in.pour.drum in.pour.drum in.pour.drum
variable run uloop 10 pad
variable world world 1 2 3 4

label loop

clear
log log.run.${run}
print "running job ${run} with input ${input} on partition ${world}"

include ${input}

next input run
jump SELF loop

print "done"

The variable “input” contains a list of 10 values that are the various inputs to run
The variable “run” is a similar, but contains the value 01, 02, to 10
The variable “world” has only 4 values, since the job is supposed to run on 4 partitions and each is numbered from 1 to 4.
When you launch LAMMPS with using mpirun -np 4 and adding the -p 4x1 flag then it will split the 4 processes into 4 independent partitions and the input and run variable will have the first 4 values distributed across the 4 partitions. The command “next” will distribute the next 4 values.
The lines between “label loop” and “jump SELF loop” will be repeated until the next command runs out of data.

If you want to run in parallel in each partition, this is also possible, e.g. with mpirun -np 8 and -p 4x2. For some more info on partitions see the LAMMPS manual.

When I run it on my local machine, the script works perfectly fine. However when I submit it to a cluster and try to run it across 40+ processes, some or all of them crash with a segmentation fault and I’m trying to mitigate it. This happens despite using lmp.close() to free the resources (it usually occurs when the resources are allocated in the primary run)

This was just a reduced example. In my actual case I need to actively control variables before launching a LAMMPS run, in a non-deterministic manner.

If you cannot provide me with a simple reproducer that can reproduce this on a local machine, then I cannot help you.

The use of the multiprocessing module is highly suspect. I would not expect that you can inherit MPI communicators this way.

My example is also extremely simplified. You can have per-partition commands using the “partition” command and you do not have to use a universe style variable.

Unfortunately I wasn’t able to reproduce this issue in a simple manner on a local machine.

However,

It seems as if the multiprocessing module was the source of the issue. I initially wrapped the LAMMPS object into its own child process since some simulations were crashing un-predictably with Segmentation faults. However, the underlying simulations do not do so anymore, and the segmentation faults reported in this thread seem to be from MPI weirdness.

I understand the use case for partitions, but I’m actively altering inputs and observing outputs from the simulation in a python script, I don’t think partitions will serve a similar purpose in this paradigm. Thank you however, for taking the time to make an example input script for me.

You can execute python or other scripts/commands from LAMMPS input with the “shell” command. You can also install the PYTHON package and then have a python command where you can register a python script and have it executed. As mentioned before, you can have commands run selective for single partitions with the “partition” command prefix.

1 Like