Issue with code hanging when running in parallel

shesprich · September 30, 2024, 3:40pm

I am running a LAMMPS simulation on an HPC cluster. The simulation appears to get stuck with no error message. I found this in the LAMMPS documentation 11.1 Common problems.

“In parallel, one way LAMMPS can hang is due to how different MPI implementations handle buffering of messages. If the code hangs without an error message, it may be that you need to specify an MPI setting or two (usually via an environment variable) to enable buffering or boost the sizes of messages that can be buffered.”

https://docs.lammps.org/Errors_common.html

I saw a few other posts similar to this that were posted here several years ago and no one gave any solutions (if they answered at all). I have scoured the MPI documentation and forums and could not find which environment variables to set. Any one have any ideas?

System info:

HPE server with 2 64 Core AMD Epyc Rome processors
512 GB of RAM
OS is RHEL 8
LAMMPS version 29Sep2021
OpenMPI version 4.0.5

Thanks in advance to anyone who can help!

akohlmey · September 30, 2024, 3:50pm

Shane,

It is not trivial to determine the cause of the “hang”. There are multiple possible reasons and most likely it is not the one that you might think.

The fact that you are using a rather old version of LAMMPS doesn’t make it easier, more recent versions of LAMMPS have more built-in methods for testing and also have known bugs fixed.

But the biggest problem is that you do not provide a simple way to reproduce the hang.
That makes it impossible to determine the cause and provide a suggestion.

stamoor · September 30, 2024, 4:25pm

As Axel said, we need more information

Does it give more output (e.g. an error message) when using unbuffered output: How to debug lammps input script that generates no log or output files
Does LAMMPS run for a while (e.g. thousands of timesteps), give output as expected, but then hang on a certain timestep?
Or does it hang on startup, with little or no output from LAMMPS at the beginning?
Does it also hang when running on only a single MPI rank instead of multiple ranks?

shesprich · October 7, 2024, 5:43pm

Thank you so much for your quick responses, and I apologize for my slow response. Based on your suggestions I tried some troubleshooting.

First, I installed version 29Aug2024 and ran the simulation again with identical results (i.e. the program hangs).

Second, I ran three separate simulations. One using mpi with 42 ranks, one using mpi with 1 rank, and one run serially without invoking mpirun at all.

The 42 rank simulation stalled out at step 0. The one rank mpi and serial run (no mpi) ran as expected.

Attached is a zip folder containing the following:

All files needed to run the simulation
Log files parallel.out, one_rank.out, and serial.out which shows the output of all three simulations.
Sbatch scripts that I am using to launch the simulations using the Slurm work scheduler on our cluster. These are labeled parallel.sbatch, onr_rank.sbatch and serial.sbatch.

Thanks you so much for your time and effort.

P.s. I got an error saying new users cannot upload files so here is a link to a OneDrive download of the zip file.

Download Here

akohlmey · October 7, 2024, 6:26pm

@shesprich I have no problem running your input with either the stable version or the current development tree. I tried 42, 64, 128 CPUs and there was no problem with any and no indication of any issues.

shesprich · October 7, 2024, 6:41pm

Strange. I wonder if there is an issue with how it was built or a compatibility issue with the MPI we have. What version/flavor of MPI are you using?

akohlmey · October 7, 2024, 6:50pm

From lmp -h:

Large-scale Atomic/Molecular Massively Parallel Simulator - 29 Aug 2024 - Development
Git info (collected-small-changes / patch_29Aug2024-590-g6a46fb034d-modified)

[...]

OS: Linux "Fedora Linux 40 (Forty)" 6.10.11-200.fc40.x86_64 x86_64

Compiler: GNU C++ 14.2.1 20240912 (Red Hat 14.2.1-3) with OpenMP 4.5
C++ standard: C++17
MPI v4.0: MPICH Version:      4.1.2
MPICH Release date: Wed Jun  7 15:22:45 CDT 2023
MPICH ABI:          15:1:3

and

Large-scale Atomic/Molecular Massively Parallel Simulator - 29 Aug 2024 - Maintenance
Git info (maintenance / stable_29Aug2024_update1-5-g884961f267)

[...]

OS: Linux "Fedora Linux 40 (Forty)" 6.10.11-200.fc40.x86_64 x86_64

Compiler: GNU C++ 14.2.1 20240912 (Red Hat 14.2.1-3) with OpenMP 4.5
C++ standard: C++11
MPI v4.0: MPICH Version:      4.1.2
MPICH Release date: Wed Jun  7 15:22:45 CDT 2023
MPICH ABI:          15:1:3

akohlmey · October 7, 2024, 6:58pm

No problem with OpenMPI either:

Large-scale Atomic/Molecular Massively Parallel Simulator - 29 Aug 2024 - Development
Git info (collected-small-changes / patch_29Aug2024-591-ge6118412b1)

[...]

OS: Linux "Fedora Linux 40 (Forty)" 6.10.11-200.fc40.x86_64 x86_64

Compiler: GNU C++ 14.2.1 20240912 (Red Hat 14.2.1-3) with OpenMP 4.5
C++ standard: C++17
MPI v3.1: Open MPI v5.0.2, package: Open MPI mockbuild@02dc1f9e2ab145fdb212b01bdd462369 Distribution, ident: 5.0.2, repo rev: v5.0.2, Feb 06, 2024

shesprich · October 7, 2024, 7:24pm

Hmm. Maybe something went wrong with my build of either openmpi or lammps. Let me try rebuilding both and trying again. Are there any special flags or packages I need to enable to get it to work properly?

akohlmey · October 7, 2024, 7:54pm

I am just using packages that are provided by the corresponding Linux distribution.

At this point, I would actually first look into your HPC environment and a possible hardware glitch or small misconfiguration or similar.

shesprich · October 18, 2024, 1:15pm

So I think the issue is some combination of OpenMPI, Slurm, and LAMMPS. I ran the simulation again, but this time I bypassed our workload manager (Slurm) and ran directly on one of the compute nodes. The job ran as expected, I know the issue is not OpenMPI or Slurm as I have run other programs compiled with the same version of OpenMPI through Slurm and it does not have any performance issues. Also, since LAMMPS runs fine with OpenMPI when bypassing Slurm it can’t be some issue between OpenMPI and LAMMPS.

I recompiled LAMMPS using NVIDIA’s HPC SDK (which comes with its own version of MPI). When I run my LAMMPS simulation again, it runs as expected regardless of whether I submit it as a Slurm job, or run it directly on an compute node. So I know the issue is not between Slurm and LAMMPS.

The issue appears to only occur when running LAMMPS with OpenMPI and Slurm. Strange, but the issue is fixed by compiling with a different flavor of MPI. Thanks so much for the help! Without you I probably would have spun my wheels for weeks without ever considering that Slurm may be part of the issue.

akohlmey · October 18, 2024, 1:47pm

FWIW, the MPI library shipped with Nvidia’s HPC SDK is OpenMPI.

shesprich · October 18, 2024, 2:03pm

Hmm. Then I wonder if it was an issue with some ancillary library like BLAS, FFTW, or SCALAPACK. I am not even sure if those are dependencies of LAMMPS, but Nvidia’s SDK comes with versions of those libraries as well.

akohlmey · October 18, 2024, 2:22pm

What you describe sounds more like a hardware setup issue or local communication hardware failure of the cluster you are running on. The fact that you can run on a single node means, that when the fast communication network is not required, the job completes. When using the NVIDIA SDK bundled MPI library, that is likely not compiled with support for your local hardware and thus will use less efficient communication channels like TCP.

shesprich · October 18, 2024, 3:39pm

I am not an MPI expert, so correct me if I am wrong here, but your explanation doesn’t make sense to me. Our nodes have 128 cores and the simulation only used a maximum of 64. My Slurm submission script also specifies that execution should be restricted to a single node. Therefore, it should not be using any fast communication networks.

Additionally, as an intermediate step between running as a batch job through Slurm and running directly on the node, I ran it as an interactive job through Slurm. The simulation finished, but it was about 20x slower than it should have been, which made me wonder if the other job wasn’t hanging it was just running excruciatingly slow. I am not sure why a hardware setup issue would cause an identical job running on the same node (I specified which node I wanted it executed on in my Slurm submissions to control for hardware) would run at 3 different speeds depending on how it was presented to the node (i.e. batch, interactive, or directly).

I am not trying to be argumentative, I just want to understand what is going on to prevent it in the future, especially if it is a configuration/hardware issue and I just don’t understand how it is interacting with MPI.

Thanks.

akohlmey · October 18, 2024, 3:59pm

@shesprich I don’t know anything about your hardware or what kind job parameters or scripts you were using. I can only speculate on the basis of what is common in machines that I know about and what people do.

At any rate, all of what you described in your last post are issues that you would need to discuss with the staff operating the HPC resource you are using. There are many issues that depend on the local configuration and whether you are using it correctly. This cannot be resolved from remote. On the other hand, you are too eagerly jumping to conclusions and that is a problem, too.

If you can run the job to completion in parallel interactively with the same number of MPI processes than the stalled calculation, then you can rule out that there is an issue with LAMMPS.
This seems to be the case and thus I doubt that there is anything else that people in this forum can do for you.

shesprich · October 18, 2024, 4:11pm

Fair enough. Thanks for all your help

srtee · October 18, 2024, 11:32pm

Examples of possible issues to track down:

A 64-proc job might be scheduled to run across both EPYCs. I can’t see why that should crash, but if SLURM sometimes puts your job on one chip and sometimes spreads it across both then you would see non-deterministic run speeds.
MPI procs may not be correctly pinned to cores and mpirun may somehow be shuffling them
OpenMPI may not be correctly configured for the hardware configuration
Inefficient partitioning inside LAMMPS (for example, if you tried to simulate a sphere with 4x4x4 partitioning in X, Y and z, the “corner” procs would finish calculating first and sit idle while the “centre” procs were overloaded, giving imbalance.