Mpi_run cannot be conducted with more than -np 4

Hanbo_Hong · December 14, 2023, 12:55am

Have you read the manual: Yes.
Lammps experience: No. Started recently
Lammps version: 20230208
Lammps execution command: mpirun -np xxx lmp -in in.tip4p
Computer Science experience: novice level.
Can you provide your input scripts and results: Yes. Shown below.
The hardware spec you run Lammps on: Compute nodes each with 2x Intel 28-Core Xeon Gold 6330, and 256 GB RAM
Where, and how are you running Lammps: The computational cluster from my institution; The Lammps is pre-installed there supposedly with parallel support; I submit the Lammps execution command (shown above) with SLURM script.

Dear All:

Firstly, I want to express my great appreciation to the people who helped me in the previous thread about the big energy fluctuation for TIP4p water simulation, which initially made me unable to get reasonable thermodynamic results. The scientific issues are successfully resolved now. If you are interested in the topic, please see:
Huge Etotal difference between TIP4p implicit VS explicit methods from the manual script

With Srtee and Akohlmey’s suggestions, both the implicit and the explicit methods can be conducted successfully, and the C_V from both of them are reasonable. Below are the revised input files I used:

For implicit:

units real
atom_style full

region box block 0 18.6824 0 18.6824 0 18.6824

create_box 2 box bond/types 1 angle/types 1 &
            extra/bond/per/atom 2 extra/angle/per/atom 1 extra/special/per/atom 2

mass 1 15.9994
mass 2 1.008

pair_style lj/cut/tip4p/long 1 2 1 1 0.15 8.0
pair_coeff 1 1 0.1550 3.1536
pair_coeff 2 2 0.0    1.0

kspace_style pppm/tip4p 1e-4

bond_style zero
bond_coeff 1 0.9574

angle_style zero
angle_coeff 1 104.52

molecule water tip3p.mol  # this uses the TIP3P geometry

create_atoms 0 random 216 34564 NULL mol water 25367 overlap 1.33

# must change charges for TIP4P
set type 1 charge -1.040
set type 2 charge  0.520

fix rigid all shake 0.001 10 0 b 1 a 1
minimize 0.0 0.0 1000 10000

reset_timestep 0
timestep 1.0
velocity all create 300.0 5463576

fix integrate all nvt temp 300 300 500.0

thermo_style custom step time temp press etotal pe

thermo 1000

run 2000000
write_data tip4p-implicit.data nocoeff

For explicit:

units real
atom_style charge
atom_modify map array

region box block 0 18.6824 0 18.6824 0 18.6824
create_box 3 box

mass 1 15.9994
mass 2 1.008
mass 3 1.0e-100

pair_style lj/cut/coul/long 8.0
pair_coeff 1 1 0.1550 3.1536
pair_coeff 2 2 0.0    1.0
pair_coeff 3 3 0.0    1.0

kspace_style pppm 1.0e-4

fix mol all property/atom mol
molecule water tip4p.mol


create_atoms 0 random 216 34564 NULL mol water 25367 overlap 1.33

timestep 1

fix integrate all rigid/nvt/small molecule temp 300.0 300.0 500.0
velocity all create 300.0 5463576

thermo_style custom step time temp press etotal density pe ke

thermo 1000

run 2000000
write_data tip4p-explicit.data nocoeff

The molecular file read by both scripts can be found in the Lammps manual and I will omit them here.

Both of the scripts works from the scientific perspective, and the results are resonable.

However, it seems the execution with MPI now starts to show problems:

For implicit, regardless of how many -np xxx I assigned, it can be completed without issue.
For explicit, however, only when -np 8, or, -np 4, or less than 4, can it run successfully. As long as -np is assigned with any number more than 8, or, if it is 6, the execution is guaranteed to fail. I have tried multiple possible -np numbers, and find it is just a matter of time how many steps it can do before the failure. The error messages are provided as below:

[node033:1319617] *** An error occurred in MPI_Wait
[node033:1319617] *** reported by process [xxxxx,x]     
[node033:1319617] *** on communicator MPI_COMM_WORLD
[node033:1319617] *** MPI_ERR_TRUNCATE: message truncated
[node033:1319617] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node033:1319617] ***    and potentially your MPI job)
[warn] Epoll MOD(1) on fd yyyyy failed. Old events were 6; read change was 0 (none); write change was 2 (del); close change was 0 (none): Bad file descriptor

The process number [xxxxx,x] in the error message varies. The node index also varies. I anticipate they are not something useful so I will omit the comparison for it. But if you believe such info is useful, I am very happy to replenish them later.

If -np 4, it can complete normally;
If -np 6, it can complete ~19000 steps before failure, the fd yyyyy in the error message is fd 22;
If -np 8, it can complete normally.
If -np 16, it can complete <1000 steps before failure, the fd yyyyy in the error message is fd 30;
If -np 28, it can complete <1000 steps before failure, the fd yyyyy in the error message is fd 32;
If -np 32, it can complete <1000 steps before failure, the fd yyyyy in the error message is fd 78;

I do understand that one can simply submit more jobs to counter-attack the slow execution caused by low np numbers. However, I would still greatly appreciate if anyone could provide some comments or suggestions about the following question so I can try to resolve the current issue:

Is there any bad setup in my input file to cause such MPI issues?
Is this solely caused by the bad/improper installation of Lammps on our cluster? If yes, what possible installation issues can you think of?
If the cause of the issue cannot be precisely identified, is there any good debugging or solution I can try to further diagnose the issue?

Thank you in advance!

Sincerely,
Hanbo

akohlmey · December 14, 2023, 3:57am

This is from the MPI library and thus not very useful information. What is needed is the error message before the MPI error output.

Fix rigid is far more sensitive to the choice of timestep than simulations with fix shake. Especially water molecules with its near linear geometry may need a small timestep to accurately do time integration of the rotation around the H-O-H “axis”. Typically these issues manifest in atoms getting too close, potential energy getting higher and then causing larger acceleration of atoms leading to lost atoms, lost body atom type of errors. These are more likely to happen the more subdomains you have and the smaller your system is. Sometimes this can also be avoided by increasing the communication cutoff.

No. The only installation issue that I could think of would be miscompilation due to a compiler bug and those are very rare. The compiler that most often has lead to miscompilation (at high optimization level) is the (classic) Intel compiler.

It is currently not possible to identify the exact cause, because you didn’t reprot the entire error message. So I have to guess. As for debugging, if you can reliably pinpoint the step number when this issue happens, you can break the run down to multiple segments and increase the output frequency just before it crashes, e.g. use

run 1000 post no
thermo 10
run 500 pre no

instead of

run 1500

Too large a timestep usually manifests itself is sudden “jumps” of the potential energy.

Other than that one would just use ridiculously conservative settings and gradually make them more aggressive until an optimal performance is achieved without crashes. Other parameters outside the timestep to look at are: communication cutoff, neighbor list skin, interaction cutoffs and system topology.

Bottom line, if you want to do TIP4P you are almost always better off using custom TIP4P styles.

Hanbo_Hong · December 14, 2023, 8:14am

Thanks a lot for the timely response again! I might respond to your prompts in multiple threads for clarity.

This is from the MPI library and thus not very useful information. What is needed is the error message before the MPI error output.

I cannot find any other error message before the MPI error messages as shown above. However, I might be wrong due to not being able to get the most detailed output. The process of finding all the error message are stated as below:

So far, the only two output/log files I can get from Lammps are:

The regular log.lammps, where the context is empty for my failed executions, or say, the context were still staying in the .qlog file before it can be “transferred” to it due to the failure.
A .qlog file I requested the node to output for me in the SLURM submission script.

In the .qlog file, all the info I can get is the following:

LAMMPS (8 Feb 2023)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task
Created orthogonal box = (0 0 0) to (18.6824 18.6824 18.6824)
  1 by 2 by 3 MPI processor grid
WARNING: Fix property/atom mol or charge or rmass w/out ghost communication (src/src/fix_property_atom.cpp:173)
Read molecule template water:
  1 molecules
  0 fragments
  4 atoms with max type 3
  0 bonds with max type 0
  0 angles with max type 0
  0 dihedrals with max type 0
  0 impropers with max type 0
Created 864 atoms
  using lattice units in orthogonal box = (0 0 0) to (18.6824 18.6824 18.6824)
  create_atoms CPU = 0.009 seconds
  create bodies CPU = 0.000 seconds
  216 rigid bodies with 864 atoms
  0.87347849 = max distance from body owner to body atom
WARNING: Cannot count rigid body degrees-of-freedom before bodies are fully initialized (src/src/RIGID/fix_rigid_small.cpp:1137)
PPPM initialization ...
  using 12-bit tables for long-range coulomb (src/src/kspace.cpp:342)
  G vector (1/distance) = 0.35216833
  grid = 15 15 15
  stencil order = 5
  estimated absolute RMS force accuracy = 0.017506412
  estimated relative force accuracy = 5.2720041e-05
  using double precision FFTW3
  3d grid and FFT values/proc = 3696 675
Generated 3 of 3 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
  update: every = 1 steps, delay = 0 steps, check = yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 10
  ghost atom cutoff = 10
  binsize = 5, bins = 4 4 4
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair lj/cut/coul/long, perpetual
      attributes: half, newton on
      pair build: half/bin/atomonly/newton
      stencil: half/bin/3d
      bin: standard
Setting up Verlet run ...
  Unit style    : real
  Current step  : 0
  Time step     : 1
Per MPI rank memory allocation (min/avg/max) = 9.991 | 10.02 | 10.06 Mbytes
   Step          Time           Temp          Press          TotEng        Density         PotEng         KinEng    
         0   0              289.87151      95158.368     -72716.827      0.99094311    -73089.233      372.40638    
      1000   1000           876.76747      14205.09      -76030.019      0.99094311    -77156.427      1126.4087    
     ...
     more or less steps might survive depending on the aforementioned cases
     ...
     18000   18000          327.34777      1753.5975     -77718.558      0.99094311    -78139.112      420.55322    
     19000   19000          300.62006     -644.36526     -77714.257      0.99094311    -78100.473      386.21535    
[node033:1319617] *** An error occurred in MPI_Wait
[node033:1319617] *** reported by process [4059299841,2]
[node033:1319617] *** on communicator MPI_COMM_WORLD
[node033:1319617] *** MPI_ERR_TRUNCATE: message truncated
[node033:1319617] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node033:1319617] ***    and potentially your MPI job)
[warn] Epoll MOD(1) on fd 22 failed. Old events were 6; read change was 0 (none); write change was 2 (del); close change was 0 (none): Bad file descriptor

I have already diff the output from the successful execution, -np 4, VS the failed executions, e.g. -np -6 as above.

Before it starts to output the thermo, all the info, including the warnings, are extremely similar, if not exactly identical. The only differences spotted by the diff command are:

x by x by x MPI processor grid;
create-atoms CPU = xxx seconds
3d grid and FFT values/proc = xxxx xxx
Per MPI rank memory allocation (min/avg/max) = xxx ...

After it starts to output the thermo, the successful execution will complete as expected, followed by the performance summary. The failed execution will immediately prompt me with the above error messages, and these are exactly what I reported in my post.

So, unless I missed anything, I am afraid I really do not have any other error messages other than those I already reported in the original post.

srtee · December 14, 2023, 10:10am

Please try using atom_style full, instead of using atom_style charge and “tacking on” the molecule numbers with fix property/atom mol, and let us know how it goes. I leave it as an exercise what else you need to change or not (hint: it will look very similar to your “implicit” script when you’re done).

~~Alternatively, please try adding ghost yes to your fix property/atom without changing anything else and see if that works.~~

~~@akohlmey, I think this might be an example of when a fix (in this case fix rigid) needs to check when fix property/atom has ghost communication on for a custom property…~~

UPDATE (see below): I can confirm that this doesn’t remove the crash on my machine.

Hanbo_Hong · December 14, 2023, 10:24am

Thank you for pointing it out. I did see the example script provided in the manual uses 0.5 fs timestep for fix rigid, and it also stated that fix rigid requires a “small(er)” timestep. It can be due to my wrong perception that SHAKE should had worked very similar to, if not exactly the same, as fix rigid, and thus I mistakenly thought the 1 fs would still be perfectly fine for fix rigid.

I will now try fix rigid with no larger than 0.5 fs of timestep and see if the issue still persists.

For the increase of communication cutoff, after reading the manual, I found the corresponding command to be:

comm_modify keyword value ...

Is there any tips to evaluate or make educated guess about how much larger should the communication cutoff be increased to?

hothello · December 14, 2023, 5:23pm

I think your problem is related to the vectors of your simulation box

I wouldn’t use an MPI grid of more than 2 in any direction, as you will create very narrow domains containing a few molecules each. Interestingly, your calculation runs just fine on 8 processors but fails with a 1 by 2 by 3 grid.

srtee · December 14, 2023, 8:36pm

Previous post body

FYI - I can’t replicate this on my machine, using Intel compilers, Intel MPI and Intel MKL, and the latest develop. That is, I can’t get it to crash either on 6 or 16 processors, even after some choices to destabilize the simulation such as increasing temperature to 450 K or using processors 1 1 8. The simulation looks stable energy-wise with your settings of timestep and thermostat. I am compiling the patch_8Feb2023 version and will see what happens.

It may also help us if you run lmp_mpi -help (your LAMMPS executable may have a different name) and report the compilation flags that were used (or if you compiled it yourself, you should be able to tell us).

UPDATE: Using patch_8Feb2023 on my machine, I can confirm that I see crashes at -np 8 and above (tested so far: 10, 12, 16). Using ghost yes in fix property/atom, or atom_style full, does not change this.

Using patch_21Nov2023, the crashes do not reappear. I see a new warning before LAMMPS starts:

MPI startup(): I_MPI_WAITMODE is unsupported for shm:ofi fabrics please specify I_MPI_FABRICS=ofi or I_MPI_FABRICS=shm
[repeats n times]
                                                                                                                                                                            LAMMPS (21 Nov 2023)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
...

Whatever bug this was was probably fixed in between.

Compile info:

OS: Linux "Ubuntu 22.04.3 LTS" 5.15.133.1-microsoft-standard-WSL2 x86_64
Compiler: GNU C++ 11.4.0 with OpenMP 4.5                                                                                                                                    
C++ standard: C++11                                                                                                                                                         
MPI v3.1: Intel(R) MPI Library 2021.10 for Linux* OS

Active compile time flags:
-DLAMMPS_GZIP
-DLAMMPS_PNG
-DLAMMPS_SMALLBIG
sizeof(smallint): 32-bit
sizeof(imageint): 32-bit
sizeof(tagint):   32-bit
sizeof(bigint):   64-bit

akohlmey · December 14, 2023, 9:10pm

There was a bug fix after 8Feb2023:

github.com/lammps/lammps

bug fix for fix rigid nh/npt small error in MPI message sizes

lammps:develop ← lammps:drude-mpi-wait-error

opened 05:34PM - 23 Mar 23 UTC

sjplimp

+28 -20

**Summary** This fixes a bug in fix_rigid_nh.cpp where the count of the comm_…forward() values used in initial time integration did not match the number of values packed/unpacked in fix_rigid_small.cpp. Because it was too small, this could sometimes produce buffer overflows, leading to MPI errors when running a simulation. **Related Issue(s)** Fixes #3313 for a Drude model with fix rigid npt/small **Author(s)** Steve **Licensing** By submitting this pull request, I agree, that my contribution will be included in LAMMPS and redistributed under either the GNU General Public License version 2 (GPL v2) or the GNU Lesser General Public License version 2.1 (LGPL v2.1). **Backward Compatibility** N/A **Post Submission Checklist** - [ ] The feature or features in this pull request is complete - [ ] Licensing information is complete - [ ] Corresponding author information is complete - [ ] The source code follows the LAMMPS formatting guidelines - [ ] Suitable new documentation files and/or updates to the existing docs are included - [ ] The added/updated documentation is integrated and tested with the documentation build system - [ ] The feature has been verified to work with the conventional build system - [ ] The feature has been verified to work with the CMake based build system - [ ] Suitable tests have been added to the unittest tree. - [ ] A package specific README file has been included or updated - [ ] One or more example input decks are included **Further Information, Files, and Links**

Hanbo_Hong · December 15, 2023, 2:48am

Thank you so much for trying the version I have reported!

I tried your second solution in the original thread, which is crossed out now. (I tried the first one about atom_style full as well but due to the lack of experience and knowledge, I cannot confirm my modification is correct). So far, it seems the issue persists on my end too, just as you mentioned. I also tried @akohlmey instruction about decreasing the time step, which doesn’t help either.

I anticipate @hothello made sense in the response, which might be similar to akohlmey’s instruction about increasing the communication cutoff. I just simply increased my system to 500 molecules while maintaining the same density, and now it can run with at least -np 28 without a problem.

All in all, thank you for more precisely identifying the issue! Now, I think either way works for me: for my project, I do need the size to be 500 molecules, which will presumably work well on the “more buggy” 20230208 version. I will also try installing the latest 20231121 on my account so it can simply avoid the issue at the first place.

akohlmey · December 15, 2023, 2:56am

Please note that a simulation that does not crash is not automatically correct.
The bugfix indicates that LAMMPS was not communicating all data that needs to be communicated, so some internal data structures will not be updated correctly when atoms migrate between subdomains.
This will lead to a crash only in extreme cases. Otherwise you will just get slightly wrong data.

Bottom line, there is no alternative to updating to either the latest stable version (we just posted the second update a few minutes ago) or the latest feature release.

srtee · December 15, 2023, 3:21am

Please perform a strong scaling analysis before you do too much - that is, check the simulation speeds when run on fewer cores, and multiply by the number of cores to get the simulation time available per core-hour. I would guess that running a simulation of 2000 particles over 28 cores will be very inefficient, and you will save lots of time if you plan your simulations so you can run four at once over 6 cores each or three at once over 8 cores each.

Hanbo_Hong · December 16, 2023, 5:38am

Thank you for the comment.

Hanbo_Hong · December 16, 2023, 5:51am

Thank you for the comment and I get your point. This is what I will do after I have passed the initial trial stage. Soon I will need to run tens of different (temperature, density) combinations for my water and try many different water models. I will definitely conduct this benchmark before doing large batches of job submissions.

srtee · December 16, 2023, 10:00am

I hope you’re up to date with literature - this seems to have been done a lot already over the past ten years, such as https://pubs.acs.org/doi/full/10.1021/acs.jctc.3c00562. All the best!