Segmentation fault with Hydrogen bonds in ReaxFF simulation

Hello,

I’m simulating a system with 418497 atoms (C, H, O, Si) with ReaxFF force field in Lammps.

However with the default pair_style reax/c keywords we got segmentation fault…

This problem happens only if H bonds are taken into account. If I turn off H bonds with hbond_cutoff = 0.0 everything runs smoothly.

However since H bonds don’t take much computational resources I wish to keep it in the system to get full physics simulation results… According to the Lammps website increasing safezone, mincap, and minhbonds can halp us avoid this memory problem.

But I don’t know how to properly set safezone, mincap, and minhbonds. Therefore I wish to know if anyone who have experience could help me to decide suitable values for these parameters ?

Please find the log file attached below:

Switching to atp/3.8.1.
Switching to cray-libsci/20.09.1.
Switching to cray-mpich/7.7.16.
Switching to craype/2.7.3.
Switching to gcc/9.3.0.
Switching to modules/3.2.11.4.
Switching to perftools-base/20.10.0.
Switching to pmi/5.0.17.
LAMMPS (29 Oct 2020)
Reading data file …
orthogonal box = (-27.114037 -27.114037 90.681449) to (27.114037 27.114037 1717.5237)
2 by 3 by 64 MPI processor grid
reading atoms …
418497 atoms
reading velocities …
418497 velocities
read_data CPU = 0.828 seconds
WARNING: Changed valency_val to valency_boc for X (…/reaxc_ffield.cpp:315)
1 atoms in group leftend
1 atoms in group rightend
2 atoms in group ends
418495 atoms in group mobile
377 atoms in group Si
376 atoms in group O
Neighbor list info …
update every 1 steps, delay 0 steps, check yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 12
ghost atom cutoff = 12
binsize = 6, bins = 10 10 272
2 neighbor lists, perpetual/occasional/extra = 2 0 0
(1) pair reax/c, perpetual
attributes: half, newton off, ghost
pair build: half/bin/newtoff/ghost
stencil: half/ghost/bin/3d/newtoff
bin: standard
(2) fix qeq/reax, perpetual, copy from (1)
attributes: half, newton off, ghost
pair build: copy
stencil: none
bin: none
Setting up Verlet run …
Unit style : real
Current step : 0
Time step : 0.1
Per MPI rank memory allocation (min/avg/max) = 14.30 | 196.7 | 229.4 Mbytes
Step Temp Press
0 299.85684 1400.3017
100 300.31343 946.60227
200 300.15235 1193.3041
300 300.61709 1197.7735
400 300.51174 1139.3772
500 300.15202 1300.5926
600 300.01009 1255.0846
700 299.94328 1294.3292
800 299.68773 1438.4328
900 300.37089 1179.9571
1000 299.89691 1148.9351
1100 300.406 1141.2664
1200 299.97559 1119.1885
1300 300.0801 1156.9803
1400 299.64831 1244.9206
1500 299.66684 931.84998
1600 300.05785 1138.7361
1700 300.00166 868.96266
1800 299.93156 1345.1016
1900 299.61989 954.07765
2000 299.56349 1138.134
2100 299.47682 1512.8995
2200 299.63595 1144.6486
2300 299.1849 1151.1576
2400 299.55125 1185.6015
2500 299.08828 1220.9978
2600 299.33538 1145.6643
2700 299.42813 1047.8522
2800 299.64344 1191.9957
2900 299.0358 1006.7449
3000 299.53771 1123.1697
3100 299.14231 1054.6382
3200 299.85462 1287.5144
3300 299.54528 976.68343
3400 299.90895 1102.6466
3500 299.82367 1107.1733
3600 299.7679 916.07419
3700 299.52757 1109.2563
3800 299.70418 985.02675
3900 299.9915 1253.0415
4000 299.7047 1110.0561
4100 299.99289 1470.641
4200 299.82057 1023.1005
4300 299.84273 1276.7899
4400 299.53671 1176.1674
4500 299.64436 1208.5341
4600 299.68949 1059.2215
4700 300.1913 1378.6916
4800 299.55159 1224.6685
4900 299.86485 791.08323
5000 299.32549 1427.6871
5100 299.61656 898.04491
5200 300.02969 1300.9473
5300 300.34324 1006.5084
5400 300.21975 1209.4758
5500 300.24578 1067.4637
5600 300.07319 1369.5069
5700 300.44473 1234.9089
5800 299.85523 1140.9252
5900 300.06583 1066.7465
6000 300.45539 1255.8554
6100 300.22232 1215.8111
6200 300.19014 821.76173
6300 300.10501 1158.3549
6400 299.82213 1246.6867
6500 300.06389 1007.5353
6600 300.28497 1132.2325
6700 300.30042 922.22323
6800 300.04256 1149.966
6900 299.99464 1128.0622
7000 299.94716 1280.9031
7100 300.09931 1244.0356
7200 299.84195 1324.9132
7300 299.96295 1228.6205
7400 300.06434 1232.2774
7500 299.77306 1099.8563
7600 299.67707 1216.0967
7700 299.53759 890.09544
srun: error: nid07488: task 77: Segmentation fault (core dumped)
srun: launch/slurm: _step_signal: Terminating StepId=33673796.0
slurmstepd: error: *** STEP 33673796.0 ON nid07482 CANCELLED AT 2021-09-06T16:59:50 ***
7800 299.76941 srun: error: nid07494: tasks 144-155: Terminated

Not sure whether you are on the right path with respect to tracking down the cause.

What is rather suspicious is the box geometry and consequently the decomposition of MPI ranks. Can you provide more information about the system you are simulating and particularly, if it has some vacuum areas. Perhaps also add a (small) snapshot image and the input file.

Please also note that the ReaxFF code has seen some refactoring, both when using the KOKKOS version (some time ago) and the regular version (rather recently) that reduces the risk of bad memory accesses.

That still won’t help with the basic assumption in the reaxff code base that the number of interactions per atom doesn’t change too much and that there are no empty spaces.

Thanks a lot for your reply. I really appreciate your help.

I will double check the MPI grid and the empty spaces in the box.

Unfortunately, the newest versions have not been updated on our computing platform… But I have confidence in the refactoring in the new versions. In fact I may not be able to compile the newest versions myself because my access to the platform will soon expire. I will try to figure out a suitable setting myself.

It may not help if there is a principal problem where the number of hydrogen bonds changes a lot during the run of a simulation. That is currently causing issues with any of the ReaxFF variants in LAMMPS (just to a different degree). it is enhanced if you have a large number of processors with some vacuum, because then in that area the changes are the relatively largest .

This simulation that fails is actually part of a long marathon… and at the begining I used npt to equilibrate the system to its minimum volume (hopefully this can eliminate the vacuum…)

This simulation is about the stretching/extension of a polymer chain. Perhaps the conformational change of the polymer makes the number of hydrogen bonds change a lot…

I agree that more processors will increase the risk of the memory issue. In my case if I reduce number of CPUs, the segfault is delayed…

A possible way to reduce that risk further (and to speed up your calculation a little bit if you get lucky) would be to use the balance or fix balance command which will shift the per processor subbox divisions to have a similar number of atoms on each MPI rank. fix balance command — LAMMPS documentation