[lammps-users] Collective abort of all ranks at the start of the simulation

Dear all,

I am writing to you regarding a question I have where I get a collective abort of all ranks at the start of the simulation. My simulation is a granular sim and it is two fold. My first step is compressing granules with the default in built Hertzian contact law and letting the system equilibrate. After equilibration, this step is followed by equilibrating that converged state using a custom written contact law. By equilibrating I mean let the system evolve by itself till it comes to minimum potential energy without external forces. All of the above (2 step procedure) works good for a system containing around 700 granules. Now I change the system to have 5000 granules. The Hertzian equilibration step works fine but as soon as I switch to the next step by switching the contact law to the custom written contact law, the simulation fails at the first time step giving the following message:
caused collective abort of all ranks exit status of rank 1: killed by signal 11.

Can someone please give a hint as to what might be happening? Why does it work for a small system but not the larger? I am using Feb 2016 LAMMPS and MPI compiled exectubable.

Thanks in advance!

Best,
Aved

You asked the same kind of question before and thus the response is the same as well.
In summary:

  • it is not possible to determine the specific reason for this from such a general description. There are many reasons for segmentation faults. It can be due to a bug in the code, it can be due to bad input.
  • in order to know more about what is causing this, the very minimum information would be reporting a stack trace of the failure. that requires a proper compilation of LAMMPS and using debugging tools. An example for that is given in the manual: https://docs.lammps.org/Errors_debug.html
  • it is strongly recommended to upgrade LAMMPS to the latest version. None of the LAMMPS developers has an interest to fix problems with a 5 year old version of LAMMPS. Over the last few years we have spent a considerable effort on using a variety of tool to identify possible bugs and implemented various tests and features that allow confirming that changes to the code will not create new issues (at least not in the parts of the code that are subject to automated testing).

Axel.

Dear Axel,

Thanks for your reply. I realized my previous mistakes in the emails you mentioned were resolved by appropriate changes to the input script… However this problem seems to require something beyond this (although I might be wrong). I don’t see why my input script works if I use the default Hertzian potential as opposed to a custom written contact law (which works for 600 granules as opposed to the current sim of 2720 granules). I took your advice into consideration and tried debugging using valgrind. I am not well versed with this but I gave it a shot from the documentation you linked and this is what I get for the MPI version:

[akesnoff2@ssm-serv-03 new]$ valgrind mpirun -np 4 ./lmp_mpi -in jamming_3_cont3 _ep_ep.spheres1
==233344== Memcheck, a memory error detector
==233344== Copyright © 2002-2017, and GNU GPL’d, by Julian Seward et al.
==233344== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==233344== Command: /home/akesnof2/LAMMPS/bin/mpirun -np 4 ./lmp_mpi -in jamming _3_cont3_ep_ep.spheres1
==233344==
LAMMPS (16 Feb 2016)
Reading data file …
triclinic box = (-1e+07 -1e+07 -1e+06) to (1e+07 1e+07 1e+06) with tilt (0 0 0 )
2 by 2 by 1 MPI processor grid
reading atoms …
2720 atoms
reading velocities …
2720 velocities
Changing box …
triclinic box = (-1e+07 -1e+07 -1e+06) to (1e+07 1e+07 1e+06) with tilt (0 0 0 )
2716 atoms in group granules
4 atoms in group walls
1 atoms in group tw
1 atoms in group bw
1 atoms in group lw
1 atoms in group rw
2 atoms in group stationary
2718 atoms in group non_stationary
1109 atoms in group atoms
137 atoms in group atoms2
Neighbor list info …
2 neighbor list requests
update every 1 steps, delay 100000 steps, check yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 200000
ghost atom cutoff = 200000
binsize = 100000, bins = 200 200 20
Setting up Verlet run …
Unit style : si
Current step: 0
Time step : 2e-07
rank 1 in job 224 ssm-serv-03.cluster.edu_33462 caused collective abort o f all ranks
exit status of rank 1: killed by signal 11

Similarly for the Serial compiled version, I get this:

[ akesnoff2@ssm-serv-03 new ]$ valgrind ./lmp_serial -in jamming_3_cont3_ep_ep.spheres1
==239204== Memcheck, a memory error detector
==239204== Copyright © 2002-2017, and GNU GPL’d, by Julian Seward et al.
==239204== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==239204== Command: ./lmp_serial -in jamming_3_cont3_ep_ep.spheres1
==239204==
LAMMPS (16 Feb 2016)
Reading data file …
triclinic box = (-1e+07 -1e+07 -1e+06) to (1e+07 1e+07 1e+06) with tilt (0 0 0)
1 by 1 by 1 MPI processor grid
reading atoms …
2720 atoms
reading velocities …
2720 velocities
Changing box …
triclinic box = (-1e+07 -1e+07 -1e+06) to (1e+07 1e+07 1e+06) with tilt (0 0 0)
2716 atoms in group granules
4 atoms in group walls
1 atoms in group tw
1 atoms in group bw
1 atoms in group lw
1 atoms in group rw
2 atoms in group stationary
2718 atoms in group non_stationary
1109 atoms in group atoms
137 atoms in group atoms2
Neighbor list info …
2 neighbor list requests
update every 1 steps, delay 100000 steps, check yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 200000
ghost atom cutoff = 200000
binsize = 100000, bins = 200 200 20
Setting up Verlet run …
Unit style : si
Current step: 0
Time step : 2e-07
==239204== Invalid write of size 8
==239204== at 0x5F8A8A: LAMMPS_NS::Neighbor::granular_bin_no_newton(LAMMPS_NS::NeighList*) (neigh_gran.cpp:396)
==239204== by 0x6490B3: LAMMPS_NS::Neighbor::build(int) (neighbor.cpp:1598)
==239204== by 0x4E2881: LAMMPS_NS::Verlet::setup() (verlet.cpp:117)
==239204== by 0x6A411F: LAMMPS_NS::Run::command(int, char**) (run.cpp:170)
==239204== by 0x477E15: void LAMMPS_NS::Input::command_creator<LAMMPS_NS::Run>(LAMMPS_NS::LAMMPS*, int, char**) (input.cpp:723)
==239204== by 0x476488: LAMMPS_NS::Input::execute_command() (input.cpp:706)
==239204== by 0x476F41: LAMMPS_NS::Input::file() (input.cpp:243)
==239204== by 0x402932: main (main.cpp:31)
==239204== Address 0x0 is not stack’d, malloc’d or (recently) free’d
==239204==
==239204==
==239204== Process terminating with default action of signal 11 (SIGSEGV)
==239204== Access not within mapped region at address 0x0
==239204== at 0x5F8A8A: LAMMPS_NS::Neighbor::granular_bin_no_newton(LAMMPS_NS::NeighList*) (neigh_gran.cpp:396)
==239204== by 0x6490B3: LAMMPS_NS::Neighbor::build(int) (neighbor.cpp:1598)
==239204== by 0x4E2881: LAMMPS_NS::Verlet::setup() (verlet.cpp:117)
==239204== by 0x6A411F: LAMMPS_NS::Run::command(int, char**) (run.cpp:170)
==239204== by 0x477E15: void LAMMPS_NS::Input::command_creator<LAMMPS_NS::Run>(LAMMPS_NS::LAMMPS*, int, char**) (input.cpp:723)
==239204== by 0x476488: LAMMPS_NS::Input::execute_command() (input.cpp:706)
==239204== by 0x476F41: LAMMPS_NS::Input::file() (input.cpp:243)
==239204== by 0x402932: main (main.cpp:31)
==239204== If you believe this happened as a result of a stack
==239204== overflow in your program’s main thread (unlikely but
==239204== possible), you can try to increase the size of the
==239204== main thread stack using the --main-stacksize= flag.
==239204== The main thread stack size used in this run was 8388608.
==239204==
==239204== HEAP SUMMARY:
==239204== in use at exit: 31,507,245 bytes in 890 blocks
==239204== total heap usage: 1,166 allocs, 276 frees, 34,100,453 bytes allocated
==239204==
==239204== LEAK SUMMARY:
==239204== definitely lost: 0 bytes in 0 blocks
==239204== indirectly lost: 0 bytes in 0 blocks
==239204== possibly lost: 0 bytes in 0 blocks
==239204== still reachable: 31,507,245 bytes in 890 blocks
==239204== of which reachable via heuristic:
==239204== stdstring : 7,062 bytes in 204 blocks
==239204== newarray : 320 bytes in 5 blocks
==239204== suppressed: 0 bytes in 0 blocks
==239204== Rerun with --leak-check=full to see details of leaked memory
==239204==
==239204== For lists of detected and suppressed errors, rerun with: -s
==239204== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Segmentation fault

I’m working on using a newer LAMMPS but since there are custom written files from past members in my lab, the previous emails from you suggest that incorporating these features won’t be trivial and I have been struggling a bit with the installation process. My apologies.
Do you happen to see anything glaringly wrong from the above error message for the MPI version?

Thanks as always!

Best,
Aved

The error from the serial run indicates a serious problem in building neighbor lists where the code tries to access an unallocated pointer.

If you create a version of your input that can run with an unmodified LAMMPS version, I have a look at what the problem could be.

Otherwise you are on your own. No LAMMPS developer has the time to try and track down a possible bug in a 5 year old version (so chances are that it is already fixed) and a modified version of that to boot (so there is also the chance that the issue does not exist with the unmodified code) or spend their time to construct an input deck that would allow to identify the possible problem with the procedure that you are implementing in your input or any possible mistake in it.

While LAMMPS is open source and people are free to make their own modifications and keep them to themselves instead of sharing them with the community, there is a “price” associated with that: you either have to keep up with the development (and there are continuous changes in the core code as we try to improve it, keep up with and accommodate all the new features and ideas that people want to add, and take advantage of how the C++ programming language evolves) or be stuck in time with an old version and disconnected from all bugfixes and improvements in the upstream version and help from the core developers.

Axel.