BAD TERMINATION: Killed by Signal 9 (more than two processors in the z-direction)

Hello everyone,

I’m encountering an error when running LAMMPS in hybrid MPI + OpenMP mode, specifically when using more than two processors in the z-direction. My command line is:

mpirun -np 8 \
  ./lmp -var nX 2 -var nY 2 -var nZ 2 -var nNp 4 -var ompTh 5 -in in.pour.toyoura.CDSS

In my LAMMPS input script, I specify:

processors      ${nX} ${nY} ${nZ} numa_nodes ${nNp}
package         omp ${ompTh} neigh yes

This configuration produces the following error:

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 302042 RUNNING AT HP-Z8
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 302043 RUNNING AT HP-Z8
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 2 PID 302044 RUNNING AT HP-Z8
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 3 PID 302045 RUNNING AT HP-Z8
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 4 PID 302046 RUNNING AT HP-Z8
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 5 PID 302047 RUNNING AT HP-Z8
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 6 PID 302048 RUNNING AT HP-Z8
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 7 PID 302049 RUNNING AT HP-Z8
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

However, if I reduce the processors along the z-direction to 1 (e.g., nZ = 1), the simulation runs without any issues. For example:

mpirun -np 4 \
  ./lmp -var nX 2 -var nY 2 -var nZ 1 -var nNp 4 -var ompTh 12 -in in.pour.toyoura.CDSS

runs successfully.

Below is some system information and additional details: (LAMMPS is built with intelOneAPI)

= System Information

$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   48
  On-line CPU(s) list:    0-47
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz
    CPU family:           6
    Model:                85
    Thread(s) per core:   1
    Core(s) per socket:   24
    Socket(s):            2
    Stepping:             7

= NUMA Configuration

numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 7 8 12 13 14 18 19 20
node 0 size: 31893 MB
node 0 free: 24640 MB
node 1 cpus: 4 5 6 9 10 11 15 16 17 21 22 23
node 1 size: 32251 MB
node 1 free: 29506 MB
node 2 cpus: 24 25 26 27 31 32 33 37 38 39 43 44
node 2 size: 32208 MB
node 2 free: 29176 MB
node 3 cpus: 28 29 30 34 35 36 40 41 42 45 46 47
node 3 size: 32248 MB
node 3 free: 28476 MB
node distances:
node   0   1   2   3 
  0:  10  11  21  21 
  1:  11  10  21  21 
  2:  21  21  10  11 
  3:  21  21  11  10 

= LAMMPS Information

$ ./lmp -help
                                                         
Large-scale Atomic/Molecular Massively Parallel Simulator - 29 Aug 2024 - Update 1
Git info (stable / stable_29Aug2024_update1)                              

Installed packages:
EXTRA-FIX GRANULAR INTEL MOLECULE OPENMP PYTHON RIGID VTK 

I have tried various processors grid configurations (e.g., grid onelevel, grid numa, grid twolevel) and different values for nZ, nNp, and ompTh.
Unfortunately, whenever I set nZ to 2 (or higher), I encounter a “BAD TERMINATION” error (signal 9 or signal 11).

Fail to run

mpirun -np 8 \
  ./lmp -var nX 2 -var nY 2 -var nZ 2 -var nNp 4 -var ompTh 1 -in in.pour.toyoura.CDSS

mpirun -np 2 \
  ./lmp -var nX 1 -var nY 1 -var nZ 2 -var nNp 2 -var ompTh 5 -in in.pour.toyoura.CDSS

processors      2 2 2 grid onelevel
processors      * * * grid numa
processors      * * * grid twolevel 8 2 2 2
processors      * * * grid twolevel 4 2 2 1

But this works well

mpirun -np 4 \
  ./lmp -var nX 2 -var nY 2 -var nZ 1 -var nNp 4 -var ompTh 1 -in in.pour.toyoura.CDSS

mpirun -np 8 \
  ./lmp -var nX 4 -var nY 2 -var nZ 1 -var nNp 4 -var ompTh 1 -in in.pour.toyoura.CDSS

processors      2 2 1 grid onelevel # this works

Has anyone seen a similar issue or have suggestions on what might cause this error?
Could it be related to memory, domain decomposition, or a NUMA configuration issue? Any guidance or troubleshooting tips would be greatly appreciated.

Thank you very much!

This looks like a severe case of “premature optimization” and your attempts to debug this make it only harder to figure out what is going on.

The first step to do is to simplify things, i.e. remove as many settings and variables as possible. This is best done by starting with a different, known to work input deck, e.g. one of the inputs in the “bench” folder. First try to run normal with MPI only. Then with OpenMP threading added.
You should not use the package omp command but simply -sf omp which will imply the explicit settings provided you set the environment variable OMP_NUM_THREADS accordingly.

This is lacking the information about how far in the input LAMMPS has progressed and which command specifically has caused the failure. The error you report is from the operating system and caused by mpirun terminating the run, most likely because there was an error in your input.

Before looking at the “processors” command and more importantly at the “numa_nodes” keyword, just try with some simple configurations, e.g. where you vary only one dimension and set the other two dimensions to “*” so that LAMMPS can try to adapt. Please keep in mind that when using the processors command, you are imposing restrictions on the domain decomposition and thus the product of px, py, and pz must be exactly the total number of MPI processes. If you are using “processors * * 2” the requirement is that you must have an even number of MPI processes.

But let’s take one more step back. Why do you want to use a domain decomposition that is different from the default?

Signal 9 is often due to Linux Out of Memory Killer (OOM), you could be running out of RAM.

Thank you for the advice.

I was able to identify the source of the error by creating a Minimal Working Example (MWE) and systematically varying the parameters. It turns out the issue was related to the cutoff distance.

Specifically, when I set more than two processors in the z-direction, using a cutoff distance of less than about 18% of the domain length in the z direction causes a “BAD TERMINATION” error. For example:

comm_modify      mode single vel yes                # NG
comm_modify      mode single vel yes cutoff 0.008   # NG
comm_modify      mode single vel yes cutoff 0.009   # OK

Unfortunately, I’m not sure why the cutoff distance is so crucial or why it seems to have this threshold.

Here is the script I used to divide the simulation domain:

processors       * * 2 numa_nodes 4

I run LAMMPS (built with Intel oneAPI) using:

mpirun -np 8 \
  -genv I_MPI_PIN 1 \
  -genv I_MPI_PIN_DOMAIN numa \
  -genv I_MPI_PIN_ORDER spread \
  -genv I_MPI_PERHOST 2 \
  -genv I_MPI_DEBUG 5 \
  ./lmp -in MWE_Q.in

I use this specific domain decomposition setup to improve multicore CPU usage and to take advantage of NUMA (NPS4). I have attached the test script for reference.

Regards,
MWE_Q.in (1.9 KB)
m1.data (1.8 KB)
m1.data (1.8 KB)
m2.data (879 Bytes)

Try running after applying the following one line change and recompiling.
It should work without further modifications when running without the /omp suffix.

  diff --git a/src/OPENMP/fix_rigid_small_omp.cpp b/src/OPENMP/fix_rigid_small_omp.cpp
  index 59fd274f95..3eac85c40a 100644
  --- a/src/OPENMP/fix_rigid_small_omp.cpp
  +++ b/src/OPENMP/fix_rigid_small_omp.cpp
  @@ -229,7 +229,7 @@ void FixRigidSmallOMP::compute_forces_and_torques()
   #if defined(_OPENMP)
   #pragma omp parallel for LMP_DEFAULT_NONE schedule(static)
   #endif
  -    for (int ibody = 0; ibody < nbody; ibody++) {
  +    for (int ibody = 0; ibody < nlocal_body; ibody++) {
         double * _noalias const fcm = body[ibody].fcm;
         const double mass = body[ibody].mass;
         fcm[0] += gvec[0]*mass;

Do you have any proof that this is actually causing any performance issues?
Or asked the other way around, how much faster does your calculation become compared to not using threads and not using any NUMA settings etc.

In general, MPI parallelization is faster than multi-threading in LAMMPS and when you have reached the limit of scaling with MPI, adding threads also adds overhead and specifically for models with short cutoffs like granular models, the potential benefit from threading is extremely small and the risk of adding more overhead than speed gains high.

Specifically for systems like yours, there may be more benefits from looking into improving the domain decomposition and avoiding load imbalances. Also, I assume that your production calculation is significantly larger than the posted examples. For 300 atoms, there is next to no gain from parallelization in the first place.

Thank you for your fix.

I am currently simulating the pluviation (free fall) of sand particles.

Because these particles have high natural frequencies, they require extremely small integration timesteps.

The sand particles are modeled using a multi-sphere approach, with more than 50,000 multi-sphere particles, and the total number of constituent spheres is even higher.

I previously ran this simulation for about a year using LIGGGHTS, which is based on LAMMPS but specialized for granular simulations.

However, I reached performance limits because LIGGGHTS does not provide acceleration packages (e.g., GPU, OpenMP) and only supports MPI.

Additionally, active support for LIGGGHTS ceased around six or seven years ago.

Consequently, I switched to LAMMPS with hybrid MPI/OpenMP and observed a significant performance improvement.

I am currently testing the MPI-only performance by varying the number of MPI ranks to analyze potential communication bottlenecks.

Thank you again for your advice.

Regards,