Question on Error “MPI_ERR_TRUNCATE: message truncated” in MPI+OpenMP Simulation

Hi all,

I encountered the following error during a simulation of granular clumps undergoing gravity-driven free fall. The simulation was run in MPI+OpenMP hybrid mode, executed with:

export OMP_NUM_THREADS=2
export OMP_PLACES=cores
export OMP_PROC_BIND=close

mpirun -np 48 \
  --map-by ppr:12:numa:pe=2 \
  --bind-to core \
  ./lmp -sf omp -pk omp 2 -in in.AK1

Error message:

[dell7875-Precision-7875-Tower:00000] *** An error occurred in MPI_Waitany
[dell7875-Precision-7875-Tower:00000] *** reported by process [2109276161,21]
[dell7875-Precision-7875-Tower:00000] *** on communicator MPI_COMM_WORLD
[dell7875-Precision-7875-Tower:00000] *** MPI_ERR_TRUNCATE: message truncated
[dell7875-Precision-7875-Tower:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[dell7875-Precision-7875-Tower:00000] ***    and MPI will try to terminate your MPI job as well)

The simulation runs without issue in pure MPI mode (i.e., with no OpenMP threads).

I’ve attached the input script and a short animation of the simulation to help illustrate the clump setup.


TEST-2

I found a related post here:

but I’m not sure whether it’s directly relevant or if the issue has since been resolved.

Any ideas about the root cause of this error in the hybrid setup? Could it be related to rigid body communication, memory alignment across threads, or known issues with the rigid/small fix?

Thanks in advance for your help!

System Information

CPU:

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          52 bits physical, 57 bits virtual
  Byte Order:             Little Endian
CPU(s):                   96
  On-line CPU(s) list:    0-95
Vendor ID:                AuthenticAMD
  Model name:             AMD Ryzen Threadripper PRO 7995WX 96-Cores
    CPU family:           25
    Model:                24
    Thread(s) per core:   1
    Core(s) per socket:   96
    Socket(s):            1
    Stepping:             1
    Frequency boost:      enabled
    CPU(s) scaling MHz:   12%
    CPU max MHz:          5187.0000
    CPU min MHz:          545.0000
    BogoMIPS:             4992.50
Caches (sum of all):
  L1d:                    3 MiB (96 instances)
  L1i:                    3 MiB (96 instances)
  L2:                     96 MiB (96 instances)
  L3:                     384 MiB (12 instances)
NUMA:
  NUMA node(s):           4
  NUMA node0 CPU(s):      0-7,32-39,64-71
  NUMA node1 CPU(s):      16-23,48-55,80-87
  NUMA node2 CPU(s):      24-31,56-63,88-95
  NUMA node3 CPU(s):      8-15,40-47,72-79
Vulnerabilities:
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Mitigation; Safe RET
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eI
                          BRS Not affected; BHI Not affected
  Srbds:                  Not affected
  Tsx async abort:        Not affected

Memory Configuration (numactl --hardware):

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39 64 65 66 67 68 69 70 71
node 0 size: 128131 MB
node 0 free: 22513 MB
node 1 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 80 81 82 83 84 85 86 87
node 1 size: 128971 MB
node 1 free: 127394 MB
node 2 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63 88 89 90 91 92 93 94 95
node 2 size: 129015 MB
node 2 free: 126950 MB
node 3 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79
node 3 size: 128968 MB
node 3 free: 125703 MB
node distances:
node   0   1   2   3
  0:  10  12  12  12
  1:  12  10  12  12
  2:  12  12  10  12
  3:  12  12  12  10

LAMMPS Build Info:

Large-scale Atomic/Molecular Massively Parallel Simulator - 4 Feb 2025 - Development
Git info (develop / patch_4Feb2025-105-gaaa81b2576)

OS: Linux "Ubuntu 24.04.2 LTS" 6.11.0-21-generic x86_64

Compiler: Clang C++ AMD Clang 17.0.6 (CLANG: AOCC_5.0.0-Build#1377 2024_09_24) with OpenMP 5.1
C++ standard: C++17
Embedded fmt library version: 10.2.0

MPI v3.1: Open MPI v5.0.6, package: Open MPI dell7875@dell7875-Precision-7875-Tower Distribution, ident: 5.0.6, repo rev: v5.0.6, Nov 15, 2024

Accelerator configuration:

OPENMP package API: OpenMP
OPENMP package precision: double
OpenMP standard: OpenMP 5.1

FFT information:

FFT precision  = double
FFT engine  = mpiFFT
FFT library = KISS

Active compile time flags:

-DLAMMPS_GZIP
-DLAMMPS_PNG
-DLAMMPS_JPEG
-DLAMMPS_SMALLBIG
sizeof(smallint): 32-bit
sizeof(imageint): 32-bit
sizeof(tagint):   32-bit
sizeof(bigint):   64-bit

Installed packages:

EXTRA-FIX GRANULAR MOLECULE OPENMP PYTHON RIGID VTK 

List of individual style options included in this LAMMPS executable

Script in.AK1

# 1. setup variables and calculate critical time steps
units           si

variable        fileOrigin universe in.AK1                  ## input value

jump            subIn.subCal l_variables                                                                # Call subscript for calculate void ratio
label           l_variableMain

timestep	    ${dt}
#timestep	      1E-6

# 2. setup simulation environments
#newton          on
newton          off
boundary        f f f
dimension       3


variable        skinD       equal 5E-4
variable        forceCutoff equal 5E-4     # Radius X 2
variable        neighCutoff equal ${forceCutoff} #+${skinD}
variable        commCutoff  equal ${skinD}/2

atom_style      hybrid sphere molecular
atom_modify     map array sort 1000 ${skinD}                                                 # Must be declared before simulation box definition

neighbor        ${neighCutoff} bin
neigh_modify    delay 0 every 20 check yes once no cluster no exclude molecule/intra all # page 10000000 one 200000

#comm_style	    brick
comm_style	    tiled
comm_modify     mode single group all vel yes cutoff ${commCutoff} 

#processors      4 4 1 numa_nodes 4
processors      * * * numa_nodes 4

# 2.1 get current date and time
python          timeString return v_strvar format s here """
def timeString():
    import datetime
    return datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
"""
variable        strvar python timeString
python          timeString invoke
variable        date0S string ${strvar} # initial time in string

python          timeFloat return v_fltvar format f here """
def timeFloat():
    import time
    return time.time()
"""

variable fltvar python timeFloat
python timeFloat invoke
variable date0F equal ${fltvar} # initial time in float

#print "Float timestamp = ${fltvar}"

variable elapsedTime equal (v_fltvar-v_date0F)

# 3. create box
variable        domainXlo   equal -1.5*${dimB}/2
variable        domainXhi   equal  1.5*${dimB}/2
variable        domainYlo   equal -1.5*${dimL}/2
variable        domainYhi   equal  1.5*${dimL}/2
variable        domainZlo   equal -1.5*${dimH}/2
variable        domainZhi   equal  4.0*${dimH}/2

region          domain_3D block ${domainXlo} ${domainXhi} ${domainYlo} ${domainYhi} ${domainZlo} ${domainZhi}
create_box      2 domain_3D      # Use 2 type of atoms

# 4. setup walls
#variable        sway 	equal 0.0
#variable        rotate 	equal 0.0

region          bottom_plate    plane 0.0 0.0 ${zminB} 0.0 0.0 1.0 side in # move v_sway NULL NULL
region          moving_plateL   plane ${xminB} 0.0 0.0  1.0 0.0 0.0 side in # move v_sway NULL NULL rotate v_rotate ${xminB} 0.0 ${zminB} 0 1 0
region          moving_plateR   plane ${xMaxB} 0.0 0.0 -1.0 0.0 0.0 side in # move v_sway NULL NULL rotate v_rotate ${xMaxB} 0.0 ${zminB} 0 1 0

fix             bottom          all wall/gran/region hertz/history ${kN} NULL ${gammaN} ${gammaT} 0.5 1 region bottom_plate #contacts
fix             mL              all wall/gran/region hertz/history ${kN} NULL ${gammaN} ${gammaT} 0.5 1 region moving_plateL #contacts
fix             mR              all wall/gran/region hertz/history ${kN} NULL ${gammaN} ${gammaT} 0.5 1 region moving_plateR #contacts

fix             yside_plate     all wall/gran hertz/history ${kN} NULL ${gammaN} ${gammaT} 0.5 1 yplane ${yminB} ${yMaxB} #contacts

# 5. Read clumps and setup pairs, EQ of motions
jump            subIn.msData l_readClumps                                                                # Call subscript for calculate void ratio
label           l_RCMain

variable        genBoxXlo  equal ${xminB}+${MaxClumpDia}
variable        genBoxXhi  equal ${xMaxB}-${MaxClumpDia}
variable        genBoxYlo  equal ${yminB}+${MaxClumpDia}
variable        genBoxYhi  equal ${yMaxB}-${MaxClumpDia}
variable        genBoxZlo  equal 0.5*${domainZhi}
variable        genBoxZhi  equal 0.7*${domainZhi}
region          gen_area block ${genBoxXlo} ${genBoxXhi} ${genBoxYlo} ${genBoxYhi} ${genBoxZlo} ${genBoxZhi} # Set pouring Space

fix             make_clumps_1 all rigid/small molecule mol clumps_01 gravity grav_acc reinit no

compute         adjust_DOF all temp/sphere
thermo_modify   temp adjust_DOF

pair_style      gran/hertz/history ${kN} ${kT} ${gammaN} ${gammaT} 0.5 1
pair_coeff      * *

group           temp_rigid      empty
fix             grav_acc        temp_rigid gravity 9.81 vector 0.0 0.0 -1.0
fix             viscous_damping all viscous 0.0001

# 6. Setup dump
shell           if [ -d "post_1" ]; then rm -rf post_1; fi         # make directory for post
shell           mkdir post_1
dump            dump_atoms all vtk ${screenNstep} post_1/atoms*.vtk fx fy fz xu yu id type radius diameter mol x y z vx vy vz

# 7. Pouring
compute         compute_atom_vzmax all reduce min vz           # Compute the z-velocity component for all atoms and select the maximum
compute         compute_atom_zmax all  reduce max z            # Compute the z-position for all atoms and select the maximum

variable        runStep equal 0
variable        accNinserts equal 0                               # Accumulated number of the inserted clumps
variable        runTime   equal 0

variable        nPour equal f_pour_clumps1
variable        atomVzmax equal abs(c_compute_atom_vzmax)         # Get the largest vertical absoulute velocity of atoms
variable        atomZmax  equal c_compute_atom_zmax

variable        runTime   equal ${runTime}+cpu
variable        nStep     equal step
variable        stepPerf  equal v_nStep/v_elapsedTime

#fix             loadBalance all balance ${screenNstep} 0.9 shift zxy 20 1.1 out info.balancing
fix             loadBalance all balance ${screenNstep} 1.01 rcb out info.balancing
timer           full

variable        printNstep equal v_screenNstep*50

label           loopPluviation
variable        indexP loop 30
  print "                               "
  print "==============================="
  print "Pouring stage ${indexP} / 30"
  print "==============================="
  print "                               "
  fix               pour_clumps1 all pour 500 0 4767548 region gen_area mol clumps_01 molfrac 0.05 0.05 0.05 0.05 0.05 &
                                                                                              0.05 0.05 0.05 0.05 0.05 &
                                                                                              0.05 0.05 0.05 0.05 0.05 &
                                                                                              0.05 0.05 0.05 0.05 0.05 rigid make_clumps_1

  variable          runStep equal ${runStep}+${screenNstep}
  run               ${runStep} upto
  variable          accNinserts equal ${accNinserts}+${nPour}

  unfix             pour_clumps1

  print "                               "  
  print "==============================="
  print "Settling stage ${indexP} / 30"
  print "==============================="
  print "                               "

  label freeFall
  print "                               "
  print "=============================================================="  
  print "Atoms are still falling, continue freeFall, Atom_Vzmax is - $(v_atomVzmax:%8.5f) m/s"
  print "=============================================================="
  print "                               "
  fix                settling_print1 all print ${printNstep} "Start at:${date0S}, Step:${nStep}, ElapsedTime:$(v_elapsedTime:%8.0f), Steps/sec:$(v_stepPerf:%8.2f)"
  fix                settling_print2 all print ${printNstep} "      Inserted so far:${accNinserts}, Atom_vzmax:$(v_atomVzmax:%8.5f), Atom_zmax:$(v_atomZmax:%8.5f), Atom_Zthresh:$(v_atomZthreshold:%8.5f)"

  variable           runStep equal ${runStep}+${screenNstep}*100
  run                ${runStep} upto
#  unfix              settling_print 

  if "(${atomZmax} > ${atomZthreshold}) && (${atomVzmax} < 0.001)" then "jump SELF stopPluviation" &
  elif "(${atomZmax} > ${atomZthreshold}) && (${atomVzmax} > 0.001)" "jump SELF freeFall"
  
  next  indexP                                                                # Next step for Pouring
  jump  SELF loopPluviation

  label stopPluviation

variable             domainZhiR equal v_atomZmax + 5*${MaxClumpDia}
change_box           all z final ${domainZlo} ${domainZhiR}

run             100000

write_restart   restart.pour.CDSS
write_data      data.pour.CDSS

250403.zip (256.8 KB)

Can you produce an input deck that reproduces the issue that can be run with fewer MPI ranks and faster?

Also you should take out any commands that are not essential for triggering the issue.

I have run your input with 8 MPI tasks and 2 OpenMP threads and it ran for 300000 steps (while I was on a conference call) without any error. I’ve removed the processors command and dump vtk.

If I cannot reproduce the issue, I cannot debug and fix it.

Dear @akohlmey,

Apologies for the late reply, and thank you for your quick response earlier.

It took some time to investigate the issue and prepare minimal working examples.

I have removed all VTK-related dumps and significantly reduced the number of atoms composing the clumps, which improved the simulation speed substantially.

Error Behavior Based on MPI Rank

The error still occurs even with a small number of MPI ranks and OpenMP threads. Interestingly, the fewer the MPI ranks, the later the error appears in the simulation.

For example, using the in.AK5 input script:

  • With 48 MPI ranks: Error occurs at step 10,000
export OMP_NUM_THREADS=2
export OMP_PLACES=cores
export OMP_PROC_BIND=close

mpirun -np 48 \
  --map-by ppr:12:numa:pe=2 \
  --bind-to core \
  ./lmp -sf omp -pk omp 2 -in in.AK5

[dell7875-Precision-7875-Tower:00000] *** An error occurred in MPI_Waitany
[dell7875-Precision-7875-Tower:00000] *** reported by process [1975255041,3]
[dell7875-Precision-7875-Tower:00000] *** on communicator MPI_COMM_WORLD
[dell7875-Precision-7875-Tower:00000] *** MPI_ERR_TRUNCATE: message truncated
[dell7875-Precision-7875-Tower:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[dell7875-Precision-7875-Tower:00000] ***    and MPI will try to terminate your MPI job as well)
  • With 8 MPI ranks: Error occurs at step 1,700,000
export OMP_NUM_THREADS=2
export OMP_PLACES=cores
export OMP_PROC_BIND=close

mpirun -np 8 \
  --map-by ppr:2:numa:pe=2 \
  --bind-to core \
  ./lmp -sf omp -pk omp 2 -in in.AK5

[dell7875-Precision-7875-Tower:00000] *** An error occurred in MPI_Waitany
[dell7875-Precision-7875-Tower:00000] *** reported by process [1237123073,7]
[dell7875-Precision-7875-Tower:00000] *** on communicator MPI_COMM_WORLD
[dell7875-Precision-7875-Tower:00000] *** MPI_ERR_TRUNCATE: message truncated
[dell7875-Precision-7875-Tower:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[dell7875-Precision-7875-Tower:00000] ***    and MPI will try to terminate your MPI job as well)
[AK.zip|attachment](upload://yLzW1vz7xCIXtOMtMARSvvpT1Ai.zip) (49.0 KB)

Moreover, input scripts with more clumps (e.g., in.AK5) tend to trigger errors faster than lighter scripts (e.g., in.AK3).

Tested Hardware

1. AMD Ryzen Threadripper PRO 7995WX (96 cores)

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          52 bits physical, 57 bits virtual
  Byte Order:             Little Endian
CPU(s):                   96
  On-line CPU(s) list:    0-95
Vendor ID:                AuthenticAMD
  Model name:             AMD Ryzen Threadripper PRO 7995WX 96-Cores
    CPU family:           25
    Model:                24
    Thread(s) per core:   1
    Core(s) per socket:   96
    Socket(s):            1
    Stepping:             1
    Frequency boost:      enabled
    CPU(s) scaling MHz:   12%
    CPU max MHz:          5187.0000
    CPU min MHz:          545.0000
    BogoMIPS:             4992.50

2. AMD Ryzen Threadripper PRO 5975WX (32 cores)**

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          48 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   32
  On-line CPU(s) list:    0-31
Vendor ID:                AuthenticAMD
  Model name:             AMD Ryzen Threadripper PRO 5975WX 32-Cores
    CPU family:           25
    Model:                8
    Thread(s) per core:   1
    Core(s) per socket:   32
    Socket(s):            1
    Stepping:             2
    Frequency boost:      enabled
    CPU(s) scaling MHz:   30%
    CPU max MHz:          7006.6401
    CPU min MHz:          1800.0000
    BogoMIPS:             7186.74

Both AMD CPUs experience the same error.

However, the intel CPU has been running the same simulation (in.AK5) without any error so far:

3. Intel Xeon E5-2699 v3 (2 sockets, 36 cores total)

export OMP_NUM_THREADS=2
export OMP_PLACES=cores
export OMP_PROC_BIND=close

mpirun -np 18 \
  -genv I_MPI_PIN=1 \
  -genv I_MPI_PIN_DOMAIN=core \
  -genv I_MPI_PIN_ORDER=compact \
  -genv I_MPI_DEBUG=5 \
  ./lmp -sf omp -pk omp 2 -in in.AK5

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   36
  On-line CPU(s) list:    0-35
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
    CPU family:           6
    Model:                63
    Thread(s) per core:   1
    Core(s) per socket:   18
    Socket(s):            2
    Stepping:             2
    CPU(s) scaling MHz:   56%
    CPU max MHz:          3600.0000
    CPU min MHz:          1200.0000
    BogoMIPS:             4589.34
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
                          rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl smx e
                          st tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cp
                          uid_fault epb pti intel_ppin ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc d
                          therm ida arat pln pts md_clear flush_l1d
Caches (sum of all):
  L1d:                    1.1 MiB (36 instances)
  L1i:                    1.1 MiB (36 instances)
  L2:                     9 MiB (36 instances)
  L3:                     90 MiB (2 instances)
NUMA:
  NUMA node(s):           2
  NUMA node0 CPU(s):      0-17
  NUMA node1 CPU(s):      18-35

Another question on NUMA Behavior and Efficiency

I’m unsure why the NUMA behavior is so different (numa_miss and numa_foreign) between AMD and Intel CPUs. Could it be due to the number of sockets? Or am I missing any command-line options for proper NUMA-aware execution?

Here are the numastat results:

For AMD 5975WX

NUMA:
  NUMA node(s):           4
  NUMA node0 CPU(s):      0-7
  NUMA node1 CPU(s):      8-15
  NUMA node2 CPU(s):      16-23
  NUMA node3 CPU(s):      24-31

                           node0           node1           node2           node3
numa_hit               390978356       128471774       136555662       124535747
numa_miss                 954893       294766920       215564195       147681654
numa_foreign           648094686         4325154         5988072          559751
interleave_hit               105              97              98              95
local_node             390857388       128226334       136300420       124325049
other_node               1075861       295012360       215819437       147892352

For intel E5-2699 2CPUs with NUMA NPS2

NUMA:
  NUMA node(s):           2
  NUMA node0 CPU(s):      0-17
  NUMA node1 CPU(s):      18-35

                           node0           node1
numa_hit                38109935        16503364
numa_miss                      0               0
numa_foreign                   0               0
interleave_hit               459             460
local_node              38099222        16477670
other_node                 10710           25694

I have attached source file, Regards

AK.zip (49.0 KB)

Eventually, IntelCPU gives similar error, too

Per MPI rank memory allocation (min/avg/max) = 27.72 | 28.62 | 29.39 Mbytes
   Step          Temp          E_pair         E_mol          TotEng         Press          Volume
   1006566   8.4267576e+12  0              0              3.4885767e-07  0.0046984198   4.95e-05
   1010000   8.368072e+12   0              0              3.4642816e-07 -1.7936805      4.95e-05
   1020000   8.3377908e+12  0              0              3.4517455e-07 -1.7592752      4.95e-05
   1030000   8.3117219e+12  0              0              3.4409533e-07 -1.7320218      4.95e-05
Abort(873121806) on node 11 (rank 11 in comm 0): Fatal error in internal_Waitany: Unknown error class, error stack:
internal_Waitany(44251): MPI_Waitany(count=5, array_of_requests=0x5697950, indx=0x7ffd9921a074, status=0x1) failed
MPIR_Waitany(1185).....:
(unknown)(): Unknown error class