System Reboots When Running LAMMPS with >24 MPI Ranks via Python (Windows 11, i9-13900K)

When launching LAMMPS via a Python script using subprocess.Popen with mpiexec, the system hard reboots if the number of MPI ranks exceeds 24 (equal to the machine’s physical cores).

np ≤ 24 → runs normally
np > 24 → immediate system reboot

Running the same input file on a different machine with a 12-core CPU does not trigger a reboot. Running the exact same command manually in a terminal (e.g., mpiexec -np 32 lmp -in inputfile) completes successfully, suggesting that the issue may be exposed specifically when using Python + MPI.

Other input files, such as standard LAMMPS benchmark cases, do not reproduce the problem. Therefore, the crash may be related to a specific section of the input script.

WhoCrashed Analysis:
Bugcheck name: KMODE_EXCEPTION_NOT_HANDLED
Description: Indicates a kernel-mode program generated an exception that the error handler did not catch.
Analysis: Could be caused by either a software or hardware issue. Hardware failures are often related to overheating. See Resplendence Software - WhoCrashed, automatic crash dump analyzer

Summary:
The crash appears to be triggered at the kernel level and may involve interactions between the operating system, MPI, and Python. It is reproducible on this machine only when the number of MPI ranks exceeds the number of physical cores.

The following is the equipment information

Operating System: Microsoft Windows 11 Pro, Version 10.0.26100
Manufacturer: Microsoft Corporation
System Name: DESKTOP-CAPJTIK
System Manufacturer: Gigabyte Technology Co., Ltd.
System Model: Z790 UD AX
System Type: x64-based PC
Processor: 13th Gen Intel Core i9-13900K, 3000 MHz, 24 cores, 32 logical processors
Installed RAM: 128 GB
BIOS Version/Date: American Megatrends International, LLC. F4, 2023/3/7
BIOS Mode: UEFI
Secure Boot: Off
Windows Directory: C:\WINDOWS
System Directory: C:\WINDOWS\system32
Time Zone: China Standard Time
Virtualization Support: Enabled (Hyper-V extensions present)
Available Physical Memory: 120 GB
Available Virtual Memory: 127 GB
Page File: C:\pagefile.sys, 8 GB

python code:

import os
import subprocess

def run_lammps(filepath, ncpu):
“”"
Minimal reproducible script: runs LAMMPS with mpiexec.
On my machine, ncpu > 24 causes immediate system reboot.
“”"
workdir = os.path.dirname(filepath)
os.chdir(workdir)

print(f"Running {filepath} with {ncpu} cores...")
command = f"mpiexec -np {ncpu} lmp -in {filepath}"

process = subprocess.Popen(
    command,
    shell=True,
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
    text=True
)
out, err = process.communicate()

print("LAMMPS stdout:", out)
print("LAMMPS stderr:", err)
return process.returncode

if name == “main”:
# Replace with the actual path to your input file
run_lammps(r"C:\path\to\in.test", 32)

input file:

#! bin/bash

file header

atom_style sphere
atom_modify map array
dimension 3

boundary p p f
newton off
comm_modify vel yes
units si
region reg block 0 0.06 0 0.06 -0.04 0.126 units box
create_box 2 reg
neighbor 0.001 bin
neigh_modify every 1 delay 0

pair_style granular
pair_coeff * * hertz 36630036630.03663 7491221.8125004005 tangential mindlin 10465724751.439035 0.5 0 damping viscoelastic

region bc_t block 0 0.06 0 0.06 0.1175 0.1225 units box
lattice sc 0.005
create_atoms 1 region bc_t
group tlattice id 1:144
set group tlattice diameter 0.005 density 2500.0
velocity tlattice zero linear
velocity tlattice zero angular
fix trlc tlattice aveforce NULL NULL NULL

region bc_b block 0 0.06 0 0.06 -0.0075 -0.0025 units box
lattice sc 0.005
create_atoms 1 region bc_b
group blattice id 145:288
set group blattice diameter 0.005 density 2500.0
velocity blattice zero linear
velocity blattice zero angular
fix brlc blattice aveforce NULL NULL NULL

group plate union tlattice blattice
fix rigid_ltb plate rigid group 2 tlattice blattice force 1 off off off torque 1 off off off force 2 off off off torque 2 off off off

timestep 4.686980429165339e-08

fix gravi all gravity 9.81 vector 0.0 0.0 -1.0

region region_gouge_0 block 0 0.06 0 0.06 -0.0025 0.0000 units box
region region_gouge_1 block 0 0.06 0 0.06 0.0000 0.012 units box
region region_gouge_2 block 0 0.06 0 0.06 0.012 0.024 units box
region region_gouge_3 block 0 0.06 0 0.06 0.024 0.036000000000000004 units box
region region_gouge_4 block 0 0.06 0 0.06 0.036000000000000004 0.048 units box
region region_gouge_5 block 0 0.06 0 0.06 0.048 0.06 units box
region region_gouge_6 block 0 0.06 0 0.06 0.06 0.07200000000000001 units box
region region_gouge_7 block 0 0.06 0 0.06 0.07200000000000001 0.084 units box
region region_gouge_8 block 0 0.06 0 0.06 0.084 0.096 units box
region region_gouge_9 block 0 0.06 0 0.06 0.096 0.1105 units box

group nve_group region region_gouge_0
group nve_group region region_gouge_1
group nve_group region region_gouge_2
group nve_group region region_gouge_3
group nve_group region region_gouge_4
group nve_group region region_gouge_5
group nve_group region region_gouge_6
group nve_group region region_gouge_7
group nve_group region region_gouge_8
group nve_group region region_gouge_9

fix ins2_0 nve_group pour 479 2 14499 region region_gouge_0 vol 0.7 200 dens 2500 2500 diam poly 10 0.001202 0.0346411 0.001602 0.0580134 0.002002 0.1060100 0.002401 0.1406511 0.002801 0.1594324 0.003201 0.1544240 0.003601 0.1410684 0.004000 0.1060100 0.004400 0.0642738 0.004800 0.0354758
fix ins2_1 nve_group pour 479 2 14499 region region_gouge_1 vol 0.7 200 dens 2500 2500 diam poly 10 0.001202 0.0346411 0.001602 0.0580134 0.002002 0.1060100 0.002401 0.1406511 0.002801 0.1594324 0.003201 0.1544240 0.003601 0.1410684 0.004000 0.1060100 0.004400 0.0642738 0.004800 0.0354758
fix ins2_2 nve_group pour 479 2 14499 region region_gouge_2 vol 0.7 200 dens 2500 2500 diam poly 10 0.001202 0.0346411 0.001602 0.0580134 0.002002 0.1060100 0.002401 0.1406511 0.002801 0.1594324 0.003201 0.1544240 0.003601 0.1410684 0.004000 0.1060100 0.004400 0.0642738 0.004800 0.0354758
fix ins2_3 nve_group pour 479 2 14499 region region_gouge_3 vol 0.7 200 dens 2500 2500 diam poly 10 0.001202 0.0346411 0.001602 0.0580134 0.002002 0.1060100 0.002401 0.1406511 0.002801 0.1594324 0.003201 0.1544240 0.003601 0.1410684 0.004000 0.1060100 0.004400 0.0642738 0.004800 0.0354758
fix ins2_4 nve_group pour 479 2 14499 region region_gouge_4 vol 0.7 200 dens 2500 2500 diam poly 10 0.001202 0.0346411 0.001602 0.0580134 0.002002 0.1060100 0.002401 0.1406511 0.002801 0.1594324 0.003201 0.1544240 0.003601 0.1410684 0.004000 0.1060100 0.004400 0.0642738 0.004800 0.0354758
fix ins2_5 nve_group pour 479 2 14499 region region_gouge_5 vol 0.7 200 dens 2500 2500 diam poly 10 0.001202 0.0346411 0.001602 0.0580134 0.002002 0.1060100 0.002401 0.1406511 0.002801 0.1594324 0.003201 0.1544240 0.003601 0.1410684 0.004000 0.1060100 0.004400 0.0642738 0.004800 0.0354758
fix ins2_6 nve_group pour 479 2 14499 region region_gouge_6 vol 0.7 200 dens 2500 2500 diam poly 10 0.001202 0.0346411 0.001602 0.0580134 0.002002 0.1060100 0.002401 0.1406511 0.002801 0.1594324 0.003201 0.1544240 0.003601 0.1410684 0.004000 0.1060100 0.004400 0.0642738 0.004800 0.0354758
fix ins2_7 nve_group pour 479 2 14499 region region_gouge_7 vol 0.7 200 dens 2500 2500 diam poly 10 0.001202 0.0346411 0.001602 0.0580134 0.002002 0.1060100 0.002401 0.1406511 0.002801 0.1594324 0.003201 0.1544240 0.003601 0.1410684 0.004000 0.1060100 0.004400 0.0642738 0.004800 0.0354758
fix ins2_8 nve_group pour 479 2 14499 region region_gouge_8 vol 0.7 200 dens 2500 2500 diam poly 10 0.001202 0.0346411 0.001602 0.0580134 0.002002 0.1060100 0.002401 0.1406511 0.002801 0.1594324 0.003201 0.1544240 0.003601 0.1410684 0.004000 0.1060100 0.004400 0.0642738 0.004800 0.0354758
fix ins2_9 nve_group pour 481 2 14499 region region_gouge_9 vol 0.7 200 dens 2500 2500 diam poly 10 0.001202 0.0346411 0.001602 0.0580134 0.002002 0.1060100 0.002401 0.1406511 0.002801 0.1594324 0.003201 0.1544240 0.003601 0.1410684 0.004000 0.1060100 0.004400 0.0642738 0.004800 0.0354758

set group nve_group density 2500.0
fix integr nve_group nve/sphere

compute 1 blattice com
compute 2 tlattice com

variable thk equal c_2[3]-c_1[3]

variable vol_atom atom 4/33.14radius^3
compute vol_total nve_group reduce sum v_vol_atom
variable vol_frac equal c_vol_total/(0.060.06v_thk)

compute contact_atom nve_group contact/atom
compute contact_total nve_group reduce sum c_contact_atom
variable coordination equal c_contact_total/(atoms-288)

compute rot_ke nve_group erotate/sphere

thermo_style custom step atoms f_trlc[1] f_trlc[3] c_1[1] c_1[3]
thermo 1000
thermo_modify lost ignore norm no

shell mkdir post
shell mkdir restart

variable m_time equal time
variable m_atoms equal atoms

fix ave_data all ave/time 10 100 1000 v_m_time v_m_atoms f_trlc[1] f_trlc[3] f_brlc[1] f_brlc[3] c_1[1] v_thk v_vol_frac c_rot_ke file plate.txt title1 “” title2 “step time atoms sf nf sf2 nf2 lx thk vol rk”

dump dmp all cfg 1000000 post/dump*.cfg mass type xs ys zs id type x y z vx vy vz fx fy fz radius

run 1

restart 5000000 restart/restart*.bishear

run 5000000 upto

I would say it is either a bug in Windows or a hardware problem. Generally speaking, user-mode programs (which should always be the case unless you run programs with admin privileges) should not crash the whole system, otherwise it will be a denial-of-service attack (imagine you are working on a HPC cluster and anyone can crash it by running some special program).

It is a very bad idea to run with more MPI processes than you have physical cores. In fact, even using as many MPI processes as you have logical cores (i.e. 24 in your case) is leading to very inefficient performance. Most HPC machines therefore turn off SMT (aka hyper-threading).

Note that you are running a fairly complex simulation and so the problem may be as simple as LAMMPS needing too much memory on more processes, intersecting with something or other happening when mpiexec switches processes in and out of cores (as it must if there are more processes than cores!).

Regardless, you should not be running more processes than cores. Not only is this bad practice in general, it is even worse practice on Windows 11 where I imagine the OS is constantly consuming a significant amount of system resources itself. Indeed if I recall correctly OpenMPI will refuse to run such a request, at all, unless the --oversubscribe keyword is explicitly added.

1 Like