Issue with code hanging when running in parallel

I am running a LAMMPS simulation on an HPC cluster. The simulation appears to get stuck with no error message. I found this in the LAMMPS documentation 11.1 Common problems.

“In parallel, one way LAMMPS can hang is due to how different MPI implementations handle buffering of messages. If the code hangs without an error message, it may be that you need to specify an MPI setting or two (usually via an environment variable) to enable buffering or boost the sizes of messages that can be buffered.”

https://docs.lammps.org/Errors_common.html

I saw a few other posts similar to this that were posted here several years ago and no one gave any solutions (if they answered at all). I have scoured the MPI documentation and forums and could not find which environment variables to set. Any one have any ideas?

System info:

HPE server with 2 64 Core AMD Epyc Rome processors
512 GB of RAM
OS is RHEL 8
LAMMPS version 29Sep2021
OpenMPI version 4.0.5

Thanks in advance to anyone who can help!

Shane,

It is not trivial to determine the cause of the “hang”. There are multiple possible reasons and most likely it is not the one that you might think.

The fact that you are using a rather old version of LAMMPS doesn’t make it easier, more recent versions of LAMMPS have more built-in methods for testing and also have known bugs fixed.

But the biggest problem is that you do not provide a simple way to reproduce the hang.
That makes it impossible to determine the cause and provide a suggestion.

As Axel said, we need more information

  • Does it give more output (e.g. an error message) when using unbuffered output: How to debug lammps input script that generates no log or output files
  • Does LAMMPS run for a while (e.g. thousands of timesteps), give output as expected, but then hang on a certain timestep?
  • Or does it hang on startup, with little or no output from LAMMPS at the beginning?
  • Does it also hang when running on only a single MPI rank instead of multiple ranks?
1 Like

Thank you so much for your quick responses, and I apologize for my slow response. Based on your suggestions I tried some troubleshooting.

First, I installed version 29Aug2024 and ran the simulation again with identical results (i.e. the program hangs).

Second, I ran three separate simulations. One using mpi with 42 ranks, one using mpi with 1 rank, and one run serially without invoking mpirun at all.

The 42 rank simulation stalled out at step 0. The one rank mpi and serial run (no mpi) ran as expected.

Attached is a zip folder containing the following:

All files needed to run the simulation
Log files parallel.out, one_rank.out, and serial.out which shows the output of all three simulations.
Sbatch scripts that I am using to launch the simulations using the Slurm work scheduler on our cluster. These are labeled parallel.sbatch, onr_rank.sbatch and serial.sbatch.

Thanks you so much for your time and effort.

P.s. I got an error saying new users cannot upload files so here is a link to a OneDrive download of the zip file.

Download Here

@shesprich I have no problem running your input with either the stable version or the current development tree. I tried 42, 64, 128 CPUs and there was no problem with any and no indication of any issues.

Strange. I wonder if there is an issue with how it was built or a compatibility issue with the MPI we have. What version/flavor of MPI are you using?

From lmp -h:

Large-scale Atomic/Molecular Massively Parallel Simulator - 29 Aug 2024 - Development
Git info (collected-small-changes / patch_29Aug2024-590-g6a46fb034d-modified)

[...]

OS: Linux "Fedora Linux 40 (Forty)" 6.10.11-200.fc40.x86_64 x86_64

Compiler: GNU C++ 14.2.1 20240912 (Red Hat 14.2.1-3) with OpenMP 4.5
C++ standard: C++17
MPI v4.0: MPICH Version:      4.1.2
MPICH Release date: Wed Jun  7 15:22:45 CDT 2023
MPICH ABI:          15:1:3

and

Large-scale Atomic/Molecular Massively Parallel Simulator - 29 Aug 2024 - Maintenance
Git info (maintenance / stable_29Aug2024_update1-5-g884961f267)

[...]

OS: Linux "Fedora Linux 40 (Forty)" 6.10.11-200.fc40.x86_64 x86_64

Compiler: GNU C++ 14.2.1 20240912 (Red Hat 14.2.1-3) with OpenMP 4.5
C++ standard: C++11
MPI v4.0: MPICH Version:      4.1.2
MPICH Release date: Wed Jun  7 15:22:45 CDT 2023
MPICH ABI:          15:1:3

No problem with OpenMPI either:

Large-scale Atomic/Molecular Massively Parallel Simulator - 29 Aug 2024 - Development
Git info (collected-small-changes / patch_29Aug2024-591-ge6118412b1)

[...]

OS: Linux "Fedora Linux 40 (Forty)" 6.10.11-200.fc40.x86_64 x86_64

Compiler: GNU C++ 14.2.1 20240912 (Red Hat 14.2.1-3) with OpenMP 4.5
C++ standard: C++17
MPI v3.1: Open MPI v5.0.2, package: Open MPI mockbuild@02dc1f9e2ab145fdb212b01bdd462369 Distribution, ident: 5.0.2, repo rev: v5.0.2, Feb 06, 2024

Hmm. Maybe something went wrong with my build of either openmpi or lammps. Let me try rebuilding both and trying again. Are there any special flags or packages I need to enable to get it to work properly?

I am just using packages that are provided by the corresponding Linux distribution.

At this point, I would actually first look into your HPC environment and a possible hardware glitch or small misconfiguration or similar.

1 Like