Exit code 7 error on long simulations

Hi,

I am using LAMMPS to run large coarse-grained molecular dynamics simulations of cells on a HPC (typically 1 node of 28 cores – unfortunately I don’t see much speedup running on more nodes due to overheads). Due to the scale of the systems and the number of timesteps needing to be performed, these simulations can require days if not weeks to complete. However, I have been having a consistent issue where all my simulations eventually tend towards a “EXIT CODE: 7” error, after some length of time. The length of time the simulations manage to run vary between systems, but are fairly consistent when re-running the same system. I had assumed the issue would be on the network side of my HPC, but after speaking with them they are confident it’s an issue with the code being run. Here is an example of how the error will eventually present itself:

Setting up Verlet run …

Unit style : lj

Current step : 7625000

Time step : 0.02

Per MPI rank memory allocation (min/avg/max) = 22.27 | 28.91 | 35.08 Mbytes

Step Temp Press v_mass

7625000 0.22519325 0.03103656 0.91186293

7625200 0.22498806 0.031055463 0.91186293

7625400 0.22565916 0.030970625 0.91186293

7625600 0.22530917 0.03103847 0.91186293

7625800 0.22549436 0.030974444 0.91186293

7626000 0.22497009 0.031009229 0.91186293

7626200 0.2253697 0.030979694 0.91186293

7626400 0.22521208 0.031062362 0.91186293

7626600 0.22561673 0.031007716 0.91186293

7626800 0.22515772 0.031020282 0.91186293

7627000 0.22559942 0.031046368 0.91186293

7627200 0.2257485 0.03098984 0.91186293

paul,

there is not enough information here to identify the cause and thus
provide a suggestion to resolve this situation.
there are basically two issues: 1) to identify why you are seeing bad
parallel scaling and 2) what is triggering the memory issue (i.e. the
unexpected abort).

some questions:
- what LAMMPS version are using? what platform are you running on?
what compiler/toolchain do you use? what hardware are you running on?
- is this a version of LAMMPS that you compiled yourself or somebody
else did? can you provide the output of the LAMMPS executable run with
the -help flag?
- does that bus error happen instantly or after a long walltime? does
it happen always at the same timestep number? do all jobs terminate or
only some?
- does the bus error also happen when you break down your simulation
into smaller chunks and use restarts to continue?
- can you provide your input file and the log file from a (short)
simulation that completes?
- what kind of (strong) parallel scaling do you see?
- is your system homogeneous/dense or sparse with voids?
- how did you determine that multi-node calculations are not scaling
well? did you see where the time is lost?

Axel.

Hi Axel,

1. lammps-12Dec18. Running on linux, with my LAMMPS scripts using the Python interface (Python 3.6.5)
2. GNU g++ compiler. Running on HPC with "Each of the 525 Lenovo nx360 m5 compute nodes has two 14 core 2.4 GHz Intel E5-2680 v4 (Broadwell) CPUs, and 128 GiB of RAM."
3. I compiled it myself but quite a long time ago so can't fully remember the process. I've attached the -help output.
4. Bus error happens after a long wall time. Running two simulations of the same system (with tiny randomised differences) will crash at a similar timestep, but not identical. All simulations seem to eventually hit this error if left long enough (i.e. if I left something to run for an infinite number of timesteps, eventually the error would occur)
5. I currently start simulations from a restart file after "evolving" my cell system to a stable state. Unfortunately I can’t seem to split up the subsequent experiment simulations any further, as without first re-equilibrating the system it instantly crashes (normally with a non-numeric pressure or lost bond/atoms error)
6. I've attached an example python input script and output log file for a (in my case very short) simulation
7. The system scales well by increasing the number of CPU's available on a single node, but very poorly when increasing the number of nodes. I've attached a graph of my speed-up testing. The graph shows the total wall-time against cell size (think of D^M as just a total number of particles in the system) for 1-4 nodes of 28 cores. As you can see, at the smaller cell sizes there's next to no speedup (actually slows down in some cases) and only manages to increase marginally at the larger sizes.
8/9. It’s a (roughly) homogenous system which I use " fix balance all balance 5000 1.0 shift xyz 20 1.0 weight group 2 water 0.5 bilayer 6.0" To realign processors accordingly. I believe the issue with scaling comes from the use of a custom pair-potential file I use. Testing on boxes of pure lennard-jones water (i.e. without the need for this potential) the system seems to scale much better.

Thanks,
Paul

help.txt (18.6 KB)

slurm.rbctube_D50AA9m12_N55978W1_@…9330… (166 KB)

rbctube_D50AA9m12_N55978W1_@…9331… (33.3 KB)

time.png

Hi Axel,

1. lammps-12Dec18. Running on linux, with my LAMMPS scripts using the Python interface (Python 3.6.5)

you may consider updating to a more recent LAMMPS version, although
chances are limited that what is causing your problems is addressed.
a newer version would also include more information useful for
debugging in the help message.

2. GNU g++ compiler. Running on HPC with "Each of the 525 Lenovo nx360 m5 compute nodes has two 14 core 2.4 GHz Intel E5-2680 v4 (Broadwell) CPUs, and 128 GiB of RAM."

what would be interesting to know is the interconnect hardware that
carries the communication between those nodes. that would have a
significant impact on the multi-node performance.

3. I compiled it myself but quite a long time ago so can't fully remember the process. I've attached the -help output.
4. Bus error happens after a long wall time. Running two simulations of the same system (with tiny randomised differences) will crash at a similar timestep, but not identical. All simulations seem to eventually hit this error if left long enough (i.e. if I left something to run for an infinite number of timesteps, eventually the error would occur)

this hints at some kind of memory leak or memory corruption issue
caused by migration of atoms between MPI ranks.
unfortunately, the way how you run them through the python interface
makes it *extremely* difficult to track down any issues.
this is further complicated by using a custom pair style. we do tests
on code that is included with the LAMMPS distribution to minimize the
risk of memory leaks or other related errors, but that is - of course
- impossible to external code. also we are limited by the code
coverage of the tests we have. trying to increase that systematically
(we are currently at about 25 % coverage) is an ongoing goal, but that
can take many years to get near completion and for some parts of the
code will be next to impossible.

if you can create input decks that can run without using the python
interface, you should see, if you can reproduce the bus error (or
something similar) with that input deck. that will make it easier to
debug things. then it would be imperative to determine whether this
issue is coupled to the custom pair style or can be reproduced
without. based on those findings, it would be easier to narrow down
where to look to identify and fix the memory issue.
recent versions of the GNU/Clang compilers support a memory checking
infrastructure with low performance impact (unlike valgrind's memory
check tool) using the -fsanitize=address flag and linking with the
corresponding libraries. compiling LAMMPS with a compatible compiler
and debug info and then run with this thus instrumented LAMMPS binary
might help to identify memory issues and provides some information
about the location.

another thing you can do to monitor memory usage during a run would be
to periodically execute the "info memory" command. it should give you
a measure of how much memory is being managed by the malloc library on
MPI rank 0. this may have some fluctuations and take some time to
plateau but it should not increase much over time once a plateau is
reached.

5. I currently start simulations from a restart file after "evolving" my cell system to a stable state. Unfortunately I can’t seem to split up the subsequent experiment simulations any further, as without first re-equilibrating the system it instantly crashes (normally with a non-numeric pressure or lost bond/atoms error)

that would indicate that something is not restarting correctly. the
most suspicious piece would be the custom code that is not part of
LAMMPS.

6. I've attached an example python input script and output log file for a (in my case very short) simulation

sadly this is missing the post run information that would be helpful
to identify where in the LAMMPS code time is spend and what degree of
load-imbalance, if any, you are confronted with.

the srun errors at the end are also a bit worrisome. they point toward
some issue or incompatibility with the batch system.

7. The system scales well by increasing the number of CPU's available on a single node, but very poorly when increasing the number of nodes. I've attached a graph of my speed-up testing. The graph shows the total wall-time against cell size (think of D^M as just a total number of particles in the system) for 1-4 nodes of 28 cores. As you can see, at the smaller cell sizes there's next to no speedup (actually slows down in some cases) and only manages to increase marginally at the larger sizes.

scaling performance is mostly independent from whether you have a
single or multiple nodes unless the interconnect between the nodes is
not of a low-latency and high bandwidth kind (like ethernet) or the
MPI library has not been properly compiled/installed/configured to
utilize that hardware.

8/9. It’s a (roughly) homogenous system which I use " fix balance all balance 5000 1.0 shift xyz 20 1.0 weight group 2 water 0.5 bilayer 6.0" To realign processors accordingly. I believe the issue with scaling comes from the use of a custom pair-potential file I use. Testing on boxes of pure lennard-jones water (i.e. without the need for this potential) the system seems to scale much better.

well, that is something that we cannot help you with. for that you
have to debug the custom code or contact its author and ask for
assistance.

axel.