Hi,
I am facing an issue when running my simulation on multiple nodes.
When I run the job on my local system using
mpirun -np 40 lmp -in equi.in,
the simulation runs perfectly.
However, when I run the same input on the HPC system using
mpirun -np $((SLURM_NNODES * SLURM_NTASKS_PER_NODE)) $LMP_EXEC -in $INPUT_FILE
(80 cores across 2 nodes), I encounter problems.
I am using the following equilibration script (equi.in):
===========================================================
# EQUILIBRATION SCRIPT (equi.in)
# ===========================================================
units real
atom_style full
boundary p p p
newton on
— Force Field —
pair_style lj/charmmfsw/coul/long 10.0 12.0
pair_modify mix arithmetic
special_bonds charmm
bond_style harmonic
angle_style hybrid charmm cosine/periodic fourier
dihedral_style hybrid charmmfsw harmonic
improper_style hybrid fourier harmonic
— Communication Settings (CRITICAL FOR 4 NODES) —
Use tiled style to allow RCB load balancing
comm_style tiled
Reduced cutoff from 30.0 to 16.0 to prevent errors on high core counts
(12.0 pair cutoff + 4.0 buffer is sufficient)
comm_modify vel yes cutoff 16.0
— Read Structure —
read_data final_production.data
include combined_params_cleaned.in
— KSpace —
kspace_style pppm 1.0e-4
Now valid because we set ‘comm_style tiled’ above
fix balance all balance 1000 1.1 rcb
— Groups —
group enzyme id 1:5604
group sol id 5605:52320
group mof id 52321:59394
— Variables for Restraints —
variable K_enzyme equal 2.0
variable K_mof equal 0.5
Apply restraints (Apply these BEFORE minimization)
fix restrain_enzyme enzyme spring/self {K_enzyme} fix restrain_mof mof spring/self {K_mof}
— System Settings —
neigh_modify delay 0 every 1 check yes
neighbor 2.0 bin
===========================================================
STAGE 0: MINIMIZATION
===========================================================
thermo 100
print “>>> STARTING MINIMIZATION <<<”
minimize 1.0e-4 1.0e-6 1000 10000
print “>>> MINIMIZATION COMPLETE <<<”
reset_timestep 0
— Velocity Init —
velocity all create 10.0 54321 mom yes rot yes dist gaussian
===========================================================
STAGE 0.5: Pre-Equilibration (Soft Start)
===========================================================
Limits max distance atoms can move (0.1 Angstrom) to prevent crashes
fix soft_start all nve/limit 0.1
thermo 1000
run 10000
unfix soft_start
===========================================================
Stage A: Restrained Heating (NVT)
===========================================================
timestep 1.0
fix mom_remove all momentum 1000 linear 1 1 1
fix stage_A all nvt temp 10.0 300.0 100.0
thermo 1000
thermo_style custom step temp pe etotal press density
dump dump_A all custom 5000 dump_eq_A_heat.lammpstrj id type xu yu zu
run 200000 # 200 ps
write_restart restart.stage_A
unfix stage_A
undump dump_A
===========================================================
Stage B: Restrained Relaxation (NVT)
===========================================================
fix stage_B all nvt temp 300.0 300.0 100.0
thermo 1000
dump dump_B all custom 5000 dump_eq_B_relax.lammpstrj id type xu yu zu
run 1000000 # 1 ns
write_restart restart.stage_B
unfix stage_B
undump dump_B
===========================================================
Stage C: Release Restraints & Final Equilibration
===========================================================
print “>>> Releasing restraints for Stage C <<<”
unfix restrain_enzyme
unfix restrain_mof
fix stage_C2 all nvt temp 300.0 300.0 100.0
thermo 10000
thermo_style custom step temp pe etotal press density vol ebond eangle edihed evdwl ecoul lx ly lz
dump dump_C2 all custom 50000 dump_eq_C2_final.lammpstrj id type xu yu zu
run 1000000 # 1 ns
unfix stage_C2
undump dump_C2
===========================================================
Save Final Outputs
===========================================================
write_data equilibrated_final_ghpc.data
write_restart equilibrated_final_ghpc.restart
print “>>> FULL Staged Equilibration Finished (A+B+C) <<<”
On the HPC system, the output shows:
Force max component initial, final = 1877.7739 242.71567
Whereas on the local system, the output is:
Force max component initial, final = 1877.7739 85.027049
Eventually, the HPC run fails with a “bond missing” error, while the local run completes successfully.
I have attached a Google Drive link containing two folders:
- HPC – the run that fails with the bond missing error
- Local – the run that completes successfully
I would really appreciate any help in understanding why the behavior differs between the local and HPC runs and how to resolve this issue.
Thank you in advance for your help.