I am running a LAMMPS simulation on an HPC cluster. The simulation appears to get stuck with no error message. I found this in the LAMMPS documentation 11.1 Common problems.
“In parallel, one way LAMMPS can hang is due to how different MPI implementations handle buffering of messages. If the code hangs without an error message, it may be that you need to specify an MPI setting or two (usually via an environment variable) to enable buffering or boost the sizes of messages that can be buffered.”
I saw a few other posts similar to this that were posted here several years ago and no one gave any solutions (if they answered at all). I have scoured the MPI documentation and forums and could not find which environment variables to set. Any one have any ideas?
System info:
HPE server with 2 64 Core AMD Epyc Rome processors
512 GB of RAM
OS is RHEL 8
LAMMPS version 29Sep2021
OpenMPI version 4.0.5
It is not trivial to determine the cause of the “hang”. There are multiple possible reasons and most likely it is not the one that you might think.
The fact that you are using a rather old version of LAMMPS doesn’t make it easier, more recent versions of LAMMPS have more built-in methods for testing and also have known bugs fixed.
But the biggest problem is that you do not provide a simple way to reproduce the hang.
That makes it impossible to determine the cause and provide a suggestion.
Thank you so much for your quick responses, and I apologize for my slow response. Based on your suggestions I tried some troubleshooting.
First, I installed version 29Aug2024 and ran the simulation again with identical results (i.e. the program hangs).
Second, I ran three separate simulations. One using mpi with 42 ranks, one using mpi with 1 rank, and one run serially without invoking mpirun at all.
The 42 rank simulation stalled out at step 0. The one rank mpi and serial run (no mpi) ran as expected.
Attached is a zip folder containing the following:
All files needed to run the simulation
Log files parallel.out, one_rank.out, and serial.out which shows the output of all three simulations.
Sbatch scripts that I am using to launch the simulations using the Slurm work scheduler on our cluster. These are labeled parallel.sbatch, onr_rank.sbatch and serial.sbatch.
Thanks you so much for your time and effort.
P.s. I got an error saying new users cannot upload files so here is a link to a OneDrive download of the zip file.
@shesprich I have no problem running your input with either the stable version or the current development tree. I tried 42, 64, 128 CPUs and there was no problem with any and no indication of any issues.
Large-scale Atomic/Molecular Massively Parallel Simulator - 29 Aug 2024 - Development
Git info (collected-small-changes / patch_29Aug2024-591-ge6118412b1)
[...]
OS: Linux "Fedora Linux 40 (Forty)" 6.10.11-200.fc40.x86_64 x86_64
Compiler: GNU C++ 14.2.1 20240912 (Red Hat 14.2.1-3) with OpenMP 4.5
C++ standard: C++17
MPI v3.1: Open MPI v5.0.2, package: Open MPI mockbuild@02dc1f9e2ab145fdb212b01bdd462369 Distribution, ident: 5.0.2, repo rev: v5.0.2, Feb 06, 2024
Hmm. Maybe something went wrong with my build of either openmpi or lammps. Let me try rebuilding both and trying again. Are there any special flags or packages I need to enable to get it to work properly?