Segmentation fault using Granular pair style

SourabBarathVij · March 14, 2024, 10:50am

Hello everyone,

I’m running LAMMPS version 21 Nov 2023 on Ubuntu 22.04. I’m compressing a system using NPT and holding the system at a specific pressure during NPT. However, my simulation crashes every time between unfixing and fixing NPT on this system.

I’m writing over here for help analysing the stack trace from the debugger. It seems to be an issue with atom type indices and the transfer_history function of the granular FF. I would like more insight into this stack trace.

Thread 1 "lmp_omp1" received signal SIGSEGV, Segmentation fault.
0x00005555563239f3 in LAMMPS_NS::PairGranular::transfer_history (this=0x555559b127a0, source=0x55555b2b6060, target=0x55555a22a3a0, itype=<optimized out>, jtype=2) at ../pair_granular.cpp:836
836       class GranularModel* model = models_list[types_indices[itype][jtype]];
(gdb) where
#0  0x00005555563239f3 in LAMMPS_NS::PairGranular::transfer_history (this=0x555559b127a0, source=0x55555b2b6060, 
    target=0x55555a22a3a0, itype=<optimized out>, jtype=2) at ../pair_granular.cpp:836
#1  0x0000555555d7a087 in LAMMPS_NS::FixNeighHistory::pre_exchange_newton (this=0x555559b4fe10)
    at ../fix_neigh_history.cpp:435
#2  0x0000555555d78df0 in LAMMPS_NS::FixNeighHistory::pre_exchange (this=0x555559b4fe10)
    at ../fix_neigh_history.cpp:232
#3  LAMMPS_NS::FixNeighHistory::pre_exchange (this=0x555559b4fe10) at ../fix_neigh_history.cpp:227
#4  LAMMPS_NS::FixNeighHistory::write_restart (this=0x555559b4fe10, fp=0x555559b27470)
    at ../fix_neigh_history.cpp:872
#5  0x000055555585dc9c in LAMMPS_NS::Modify::write_restart (this=0x5555597f9ec0, fp=0x555559b27470)
    at ../modify.cpp:1467
#6  0x00005555559da916 in LAMMPS_NS::WriteRestart::write (this=0x5555599a5a30, file="Post_compress.restart")
    at ../write_restart.cpp:241
#7  0x00005555559dbabb in LAMMPS_NS::WriteRestart::command (this=0x5555599a5a30, narg=1, arg=0x555559b412d0)
    at ../write_restart.cpp:113
#8  0x00005555557faf4c in LAMMPS_NS::Input::execute_command (this=0x555559705fb0) at ../input.cpp:868
#9  0x00005555557fb927 in LAMMPS_NS::Input::file (this=0x555559705fb0) at ../input.cpp:313
#10 0x00005555557e9721 in main (argc=<optimized out>, argv=<optimized out>) at ../main.cpp:77

Any insight would be appreciated, thanks in advance!

akohlmey · March 14, 2024, 11:47am

There is not more that can be said. The stack trace tells you exactly which command fails and the line in the source code. So you need to inspect the variables and arrays accessed on that line and determine from the source code and your settings whether it works as expected.

For any advice beyond that you need to provide a suitable simple test case, information about how you configured and compiled LAMMPS, what platform you are running it on and with which command line you start it.

SourabBarathVij · March 14, 2024, 1:02pm

Hello @akohlmey ,

Thank you for your reply.

I’ve built LAMMPS using makefile.omp with the OpenMP, Granular, and Extra packages enabled (and a few other ones, which aren’t being used here, so this isn’t a comprehensive list).

I’m running it as mpiexec -n 6 lmp -in inp_jkr.lmp -log crash_test.log

While trying to work around the above issue, I tried using ‘neighbor multi’ due to a large difference in my particle sizes. During this I ran into an issue where the neighbor multi command does not work while used with omp acceleration.

Here’s a simplified representation of my input file with which I face this issue:

units           micro
package omp 1
suffix omp
dimension       3
atom_style      hybrid sphere dipole
boundary        p p p

neighbor 0.3 multi
atom_modify     map array
comm_modify     vel yes

region          box block -3 3 -3 3 -15 15 units box
create_box      2 box
create_atoms 1 random 386480 34875 NULL units box
create_atoms 2 random 39452 79285 NULL units box

set type 1 diameter 0.1 density 0.5
set type 2 diameter 0.03  density 16.00


neigh_modify     every 1000 delay 5000 check yes one 60000 page 600000

pair_style     hybrid/overlay lj/sf/dipole/sf  0.2 granular

pair_coeff      1 1 lj/sf/dipole/sf 0.04 0.08 0.1
pair_coeff      2 2 lj/sf/dipole/sf 0.083 0.2575 0.1575

pair_coeff      1 2 lj/sf/dipole/sf 0.0476235235991626 0.10875 0.12875

pair_coeff    1 1 granular jkr 300 300 0.3 7 tangential mindlin_rescale/force NULL 1 0.5
pair_coeff    2 2 granular jkr 1000 550 0.27 5 tangential mindlin_rescale/force NULL 1 0.1

variable tim_step equal 0.000005
timestep        ${tim_step}


min_style cg
minimize 1e-150 1e-150 5000000 10000000
write_data cood.data
write_restart test.res
restart 10000 slurry.*.restart

compute mytemp all temp/sphere
compute mypress all pressure mytemp
compute ke_rot all erotate/sphere

velocity all create 300.0 406659 mom yes rot yes dist gaussian temp mytemp loop geom


variable tdamp equal 100*${tim_step}
variable pdamp equal 1000*${tim_step}

fix NPT all npt/sphere temp 300.0 300.0 ${tdamp} iso 0.0 20.0 ${pdamp}
fix_modify NPT temp mytemp press mypress
run 140000
unfix NPT

Upon running gdb to debug this issue, this is the stack trace.

Thread 1 “lmp” received signal SIGSEGV, Segmentation fault.
0x00005555558c6ed7 in _ZN9LAMMPS_NS27NPairHalfSizeMultiNewtonOmp5buildEPNS_9NeighListE._omp_fn.0(void) () at …/npair_half_size_multi_newton_omp.cpp:182
182 for (j = js; j >= 0; j = bins[j]) {
(gdb) where
#0 0x00005555558c6ed7 in _ZN9LAMMPS_NS27NPairHalfSizeMultiNewtonOmp5buildEPNS_9NeighListE._omp_fn.0(void) ()
at …/npair_half_size_multi_newton_omp.cpp:182
#1 0x00007ffff5a93a16 in GOMP_parallel () from /lib/x86_64-linux-gnu/libgomp.so.1
#2 0x00005555558c6886 in LAMMPS_NS::NPairHalfSizeMultiNewtonOmp::build (this=0x555559cf40b0, list=0x555559cf3470)
at …/npair_half_size_multi_newton_omp.cpp:53
#3 0x00005555558857cf in LAMMPS_NS::Neighbor::build (this=0x5555599852b0, topoflag=1) at …/neighbor.cpp:2472
#4 0x000055555584f0a1 in LAMMPS_NS::Min::setup (this=0x5555597399b0, flag=1) at …/min.cpp:271
#5 0x000055555585013f in LAMMPS_NS::Minimize::command (this=0x5555599a9c80, narg=,
arg=0x555559b4fce0) at …/minimize.cpp:59
#6 0x00005555557faf4c in LAMMPS_NS::Input::execute_command (this=0x555559705dd0) at …/input.cpp:868
#7 0x00005555557fb927 in LAMMPS_NS::Input::file (this=0x555559705dd0) at …/input.cpp:313
#8 0x00005555557e9721 in main (argc=, argv=) at …/main.cpp:77
(gdb) print j
$1 =
(gdb) print js
$2 =
(gdb) print bins
value has been optimized out
(gdb) print bins[j]
value has been optimized out
(gdb)

This issue does not occur when I run without omp acceleration. This is just something I wanted to bring to your attention.

Regards,
Sourab

akohlmey · March 14, 2024, 1:17pm

This is a completely different issue.

akohlmey · March 14, 2024, 1:50pm

This is too large a box/system for easy debugging.

Let’s not pile additional issues on top of the existing issue but rather try to have as few “complications” and address them one at a time.

You may want to consider using the “overlap” keyword to create_atoms to avoid close contacts right away rather than depending on the minimization to resolve them.

These settings are just crazy and make no sense at all. Why not stick with the (conservative) defaults first and see if additional changes would be needed or helpful later?

Does it run with “bin” for you? It does for me.
Do you see the code spends excessive amounts of time in the neighbor list builds? I don’t. And that makes perfect sense because your “small” particles are only 10% of your total system so the total speedup from optimizing the neighbor list build would be rather small. Have a look at Amdah’s Law about how much speedup you can theoretically achieve when you can improve only a small part of your calculation.

In summary, before looking at the details of crashes, you first need to build an input deck that works and gives meaningful results. There is no point in debugging something that is just a hodge-podge of some meaningful input mixed with settings that make no sense or are options to improve performance.
But before it is worth testing and debugging you need to provide a solid baseline.

SourabBarathVij · March 14, 2024, 3:42pm

Hello @akohlmey , I’ve been experimenting with the above simplified script and my original script.
I’ve made the box smaller while also introducing the overlap keyword, to decrease the burden of minimization.

You’re right about the neigh_modify. I’d progressively increased the one and page sizes since my simulations always crashed with the error “ERROR on proc 11: Neighbor history overflow, boost neigh_modify one (…/fix_neigh_history_omp.cpp:277)”. However after further testing it seems that the crash occurs only when I try to write a restart file using the write_restart command.
I’m still working on reproducing this issue with my simplified system to see how this can be narrowed down further.

I was trying to use multi since it was suggested for systems with significant difference in particle sizes. I though these sizes were significantly different, it seems that I was wrong. However, I see that a significant chunk of my time is spent in communication (about 25% using 8 MPI tasks). This seems unnaturally high for a system with 40000 particles. Do you have any recommendations on any approach I can use to improve my scaling efficiency?

Meanwhile, I’ll continue working on simplifying this issue further and post a final script once I narrow it down.

Thank you for taking the time to analyze my script

akohlmey · March 14, 2024, 4:56pm

You didn’t pay attention to what I wrote. They are significantly different, but “multi” neighbor lists are only significantly beneficial if the smaller kind of particle is in a significant majority. Please study the most recent LAMMPS paper to learn about neighbor list stencils and so on and study the LAMMPS manual. You are simplifying the situation too much. Yes, you could speed up the search for neighbors of the small particles a bit, but since those are only 10% of your system, there is little potential for overall speedup. As I mentioned before, have a look at Amdahl’s Law. While this covers speedup through parallelization, it also applies to speedup through algorithmic improvements. And Amdahl’s law doesn’t consider overhead. The multi neighbor list style has additional overhead over bin. So it is not even obvious whether there would be an improvement.

If this happens over time, then chances are, there is something wrong with your model and the system is collapsing. You can confirm this easily with visualization.

My recommendation is to worry about correctness first and performance later. You are obviously infected by a sickness that is called “premature optimization™”.