Have you read the manual: Yes.
Lammps experience: No. Started recently
Lammps version: 20230208
Lammps execution command: mpirun -np xxx lmp -in in.tip4p
Computer Science experience: novice level.
Can you provide your input scripts and results: Yes. Shown below.
The hardware spec you run Lammps on: Compute nodes each with 2x Intel 28-Core Xeon Gold 6330, and 256 GB RAM
Where, and how are you running Lammps: The computational cluster from my institution; The Lammps is pre-installed there supposedly with parallel support; I submit the Lammps execution command (shown above) with SLURM script.
Dear All:
Firstly, I want to express my great appreciation to the people who helped me in the previous thread about the big energy fluctuation for TIP4p water simulation, which initially made me unable to get reasonable thermodynamic results. The scientific issues are successfully resolved now. If you are interested in the topic, please see:
Huge Etotal difference between TIP4p implicit VS explicit methods from the manual script
With Srtee and Akohlmeyâs suggestions, both the implicit and the explicit methods can be conducted successfully, and the C_V from both of them are reasonable. Below are the revised input files I used:
For implicit:
units real
atom_style full
region box block 0 18.6824 0 18.6824 0 18.6824
create_box 2 box bond/types 1 angle/types 1 &
extra/bond/per/atom 2 extra/angle/per/atom 1 extra/special/per/atom 2
mass 1 15.9994
mass 2 1.008
pair_style lj/cut/tip4p/long 1 2 1 1 0.15 8.0
pair_coeff 1 1 0.1550 3.1536
pair_coeff 2 2 0.0 1.0
kspace_style pppm/tip4p 1e-4
bond_style zero
bond_coeff 1 0.9574
angle_style zero
angle_coeff 1 104.52
molecule water tip3p.mol # this uses the TIP3P geometry
create_atoms 0 random 216 34564 NULL mol water 25367 overlap 1.33
# must change charges for TIP4P
set type 1 charge -1.040
set type 2 charge 0.520
fix rigid all shake 0.001 10 0 b 1 a 1
minimize 0.0 0.0 1000 10000
reset_timestep 0
timestep 1.0
velocity all create 300.0 5463576
fix integrate all nvt temp 300 300 500.0
thermo_style custom step time temp press etotal pe
thermo 1000
run 2000000
write_data tip4p-implicit.data nocoeff
For explicit:
units real
atom_style charge
atom_modify map array
region box block 0 18.6824 0 18.6824 0 18.6824
create_box 3 box
mass 1 15.9994
mass 2 1.008
mass 3 1.0e-100
pair_style lj/cut/coul/long 8.0
pair_coeff 1 1 0.1550 3.1536
pair_coeff 2 2 0.0 1.0
pair_coeff 3 3 0.0 1.0
kspace_style pppm 1.0e-4
fix mol all property/atom mol
molecule water tip4p.mol
create_atoms 0 random 216 34564 NULL mol water 25367 overlap 1.33
timestep 1
fix integrate all rigid/nvt/small molecule temp 300.0 300.0 500.0
velocity all create 300.0 5463576
thermo_style custom step time temp press etotal density pe ke
thermo 1000
run 2000000
write_data tip4p-explicit.data nocoeff
The molecular file read by both scripts can be found in the Lammps manual and I will omit them here.
Both of the scripts works from the scientific perspective, and the results are resonable.
However, it seems the execution with MPI
now starts to show problems:
- For implicit, regardless of how many
-np xxx
I assigned, it can be completed without issue. - For explicit, however, only when
-np 8
, or,-np 4
, or less than 4, can it run successfully. As long as-np
is assigned with any number more than8
, or, if it is6
, the execution is guaranteed to fail. I have tried multiple possible-np
numbers, and find it is just a matter of time how many steps it can do before the failure. The error messages are provided as below:
[node033:1319617] *** An error occurred in MPI_Wait
[node033:1319617] *** reported by process [xxxxx,x]
[node033:1319617] *** on communicator MPI_COMM_WORLD
[node033:1319617] *** MPI_ERR_TRUNCATE: message truncated
[node033:1319617] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node033:1319617] *** and potentially your MPI job)
[warn] Epoll MOD(1) on fd yyyyy failed. Old events were 6; read change was 0 (none); write change was 2 (del); close change was 0 (none): Bad file descriptor
The process number [xxxxx,x] in the error message varies. The node index also varies. I anticipate they are not something useful so I will omit the comparison for it. But if you believe such info is useful, I am very happy to replenish them later.
If -np 4
, it can complete normally;
If -np 6
, it can complete ~19000 steps before failure, the fd yyyyy
in the error message is fd 22
;
If -np 8
, it can complete normally.
If -np 16
, it can complete <1000 steps before failure, the fd yyyyy
in the error message is fd 30
;
If -np 28
, it can complete <1000 steps before failure, the fd yyyyy
in the error message is fd 32
;
If -np 32
, it can complete <1000 steps before failure, the fd yyyyy
in the error message is fd 78
;
I do understand that one can simply submit more jobs to counter-attack the slow execution caused by low np
numbers. However, I would still greatly appreciate if anyone could provide some comments or suggestions about the following question so I can try to resolve the current issue:
- Is there any bad setup in my input file to cause such
MPI
issues? - Is this solely caused by the bad/improper installation of Lammps on our cluster? If yes, what possible installation issues can you think of?
- If the cause of the issue cannot be precisely identified, is there any good debugging or solution I can try to further diagnose the issue?
Thank you in advance!
Sincerely,
Hanbo