Running perfactly in mpirun -np 40 but giving error while running in 3 node each 40 core

Hi,

I am facing an issue when running my simulation on multiple nodes.

When I run the job on my local system using
mpirun -np 40 lmp -in equi.in,
the simulation runs perfectly.

However, when I run the same input on the HPC system using
mpirun -np $((SLURM_NNODES * SLURM_NTASKS_PER_NODE)) $LMP_EXEC -in $INPUT_FILE
(80 cores across 2 nodes), I encounter problems.

I am using the following equilibration script (equi.in):

===========================================================

# EQUILIBRATION SCRIPT (equi.in)

# ===========================================================

units real
atom_style full
boundary p p p
newton on

— Force Field —

pair_style lj/charmmfsw/coul/long 10.0 12.0
pair_modify mix arithmetic
special_bonds charmm
bond_style harmonic
angle_style hybrid charmm cosine/periodic fourier
dihedral_style hybrid charmmfsw harmonic
improper_style hybrid fourier harmonic

— Communication Settings (CRITICAL FOR 4 NODES) —

Use tiled style to allow RCB load balancing

comm_style tiled

Reduced cutoff from 30.0 to 16.0 to prevent errors on high core counts

(12.0 pair cutoff + 4.0 buffer is sufficient)

comm_modify vel yes cutoff 16.0

— Read Structure —

read_data final_production.data
include combined_params_cleaned.in

— KSpace —

kspace_style pppm 1.0e-4

Now valid because we set ‘comm_style tiled’ above

fix balance all balance 1000 1.1 rcb

— Groups —

group enzyme id 1:5604
group sol id 5605:52320
group mof id 52321:59394

— Variables for Restraints —

variable K_enzyme equal 2.0
variable K_mof equal 0.5

Apply restraints (Apply these BEFORE minimization)

fix restrain_enzyme enzyme spring/self {K_enzyme} fix restrain_mof mof spring/self {K_mof}

— System Settings —

neigh_modify delay 0 every 1 check yes
neighbor 2.0 bin

===========================================================

STAGE 0: MINIMIZATION

===========================================================

thermo 100
print “>>> STARTING MINIMIZATION <<<”
minimize 1.0e-4 1.0e-6 1000 10000
print “>>> MINIMIZATION COMPLETE <<<”

reset_timestep 0

— Velocity Init —

velocity all create 10.0 54321 mom yes rot yes dist gaussian

===========================================================

STAGE 0.5: Pre-Equilibration (Soft Start)

===========================================================

Limits max distance atoms can move (0.1 Angstrom) to prevent crashes

fix soft_start all nve/limit 0.1
thermo 1000
run 10000
unfix soft_start

===========================================================

Stage A: Restrained Heating (NVT)

===========================================================

timestep 1.0
fix mom_remove all momentum 1000 linear 1 1 1

fix stage_A all nvt temp 10.0 300.0 100.0
thermo 1000
thermo_style custom step temp pe etotal press density
dump dump_A all custom 5000 dump_eq_A_heat.lammpstrj id type xu yu zu

run 200000 # 200 ps

write_restart restart.stage_A
unfix stage_A
undump dump_A

===========================================================

Stage B: Restrained Relaxation (NVT)

===========================================================

fix stage_B all nvt temp 300.0 300.0 100.0
thermo 1000
dump dump_B all custom 5000 dump_eq_B_relax.lammpstrj id type xu yu zu

run 1000000 # 1 ns

write_restart restart.stage_B
unfix stage_B
undump dump_B

===========================================================

Stage C: Release Restraints & Final Equilibration

===========================================================

print “>>> Releasing restraints for Stage C <<<”
unfix restrain_enzyme
unfix restrain_mof

fix stage_C2 all nvt temp 300.0 300.0 100.0

thermo 10000
thermo_style custom step temp pe etotal press density vol ebond eangle edihed evdwl ecoul lx ly lz

dump dump_C2 all custom 50000 dump_eq_C2_final.lammpstrj id type xu yu zu

run 1000000 # 1 ns

unfix stage_C2
undump dump_C2

===========================================================

Save Final Outputs

===========================================================

write_data equilibrated_final_ghpc.data
write_restart equilibrated_final_ghpc.restart

print “>>> FULL Staged Equilibration Finished (A+B+C) <<<”

On the HPC system, the output shows:

Force max component initial, final = 1877.7739 242.71567

Whereas on the local system, the output is:

Force max component initial, final = 1877.7739 85.027049

Eventually, the HPC run fails with a “bond missing” error, while the local run completes successfully.

I have attached a Google Drive link containing two folders:

  1. HPC – the run that fails with the bond missing error
  2. Local – the run that completes successfully

I would really appreciate any help in understanding why the behavior differs between the local and HPC runs and how to resolve this issue.

Thank you in advance for your help.

I don’t think that the cause is a local versus remote issue. LAMMPS doesn’t know or care about that. It just sees MPI tasks. The problem is more likely due to the number of MPI processes in total, which reduces the size of the subdomains and thus makes any issues due to atoms moving too fast and passing through a subdomain within a single step more likely. It will probably happen even faster when using even more nodes, and should be the same as the local run if you request just 2 nodes with 20 processors per node.

The possible reasons and remedies for “lost atoms” or “bond missing” errors have been widely discussed in the forum and thus I suggest to have a look. What is particularly worrisome is that you seem to require an enlarged communication cutoff. That is usually an indication of a force field or topology issue and laborious to debug from remote. My recommendation is to simplify your input and remove anything that is not essential (e.g. load balancing and tiled domain decomposition) and then try to identify which bonds are giving the problems and monitor their length and visualizing the trajectory in detail.

P.S.: This appears to be similar, if not the same system we discussed before. It seems you are ignoring the advice that was given before and are not using fix shake even though it is required to correctly implement the CHARMM force field. You are not likely to get much more help in the future if you are making such kind of mistakes and ignore advice being given about it.

Also, to expect using a timestep of 1fs and getting a meaningful trajectory without fix shake is pure folly. This confirms my guess that the cause for your problem is related to your force field. To use fix shake with minimization reliably (it is a better approach than using fix nve/limit anyway) you must upgrade your LAMMPS versions. There was an important bugfix impacting the reliability of using fix shake with minimization that came after either of the LAMMPS versions you are using (you need at least update 1 of version 22 July 2025 or the 10 September 2025 version; there have been further bugfix update or new feature releases since with additional corrections and improvements since).

my experience with HPC clusters with SLURM is that srun is much better than mpirun because it correctly carries over the SLURM environment to MPI (caveat: likely conditional on having competent cluster admins) so

srun $LMP_EXEC -in $INPUT_FILE

and make sure you are using sbatch properly (--nodes, --ntasks-per-node, --mem, …)