Random segmentation faults

Hello,

I am running LAMMPS (29 Aug 2024) on a cluster under Rocky Linux 8.10, most of the time it works well but for some calculations I get a segmentation fault. I looked at this page (7.4. Debugging crashes — LAMMPS documentation) and tried some things suggested, I also looked at topics here on segmentation faults but I can’t figure out what is the issue exactly.

My issue is that for some of my calculations, whatever the node I am running on and whatever the number of CPU I am using the calculation stop at the same point with an error message with segmentation fault. For instance

[node09:64850:0:64850] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xef9bdbd0)
==== backtrace (tid: 64850) ====
0 /home/hgeindre/miniconda3/envs/lammps-env/bin/…/lib/libucs.so.0(ucs_handle_error+0x2fd) [0xa36484d]
1 /home/hgeindre/miniconda3/envs/lammps-env/bin/…/lib/libucs.so.0(+0x2fa3f) [0xa364a3f]
2 /home/hgeindre/miniconda3/envs/lammps-env/bin/…/lib/libucs.so.0(+0x2fc0a) [0xa364c0a]
3 /lib64/libc.so.6(+0x4e5b0) [0x99b95b0]
4 /home/hgeindre/miniconda3/envs/lammps-env/bin/…/lib/liblammps.so.0(+0x1178c03) [0x8a28c03]
5 /home/hgeindre/miniconda3/envs/lammps-env/bin/…/lib/liblammps.so.0(_ZN6ReaxFF14Compute_ForcesEPNS_11reax_systemEPNS_14control_paramsEPNS_15simulation_dataEPNS_7storageEPPNS_9reax_listE+0x31e) [0x8a29b1e]
6 /home/hgeindre/miniconda3/envs/lammps-env/bin/…/lib/liblammps.so.0(_ZN9LAMMPS_NS10PairReaxFF7computeEii+0x12b) [0x8a1c84b]
7 /home/hgeindre/miniconda3/envs/lammps-env/bin/…/lib/liblammps.so.0(_ZN9LAMMPS_NS6Verlet3runEi+0x22d) [0x837d9bd]
8 /home/hgeindre/miniconda3/envs/lammps-env/bin/…/lib/liblammps.so.0(_ZN9LAMMPS_NS3Run7commandEiPPc+0xe0c) [0x83056ac]
9 /home/hgeindre/miniconda3/envs/lammps-env/bin/…/lib/liblammps.so.0(_ZN9LAMMPS_NS5Input15execute_commandEv+0x8c8) [0x812b118]
10 /home/hgeindre/miniconda3/envs/lammps-env/bin/…/lib/liblammps.so.0(_ZN9LAMMPS_NS5Input4fileEv+0x192) [0x812bfa2]
11 lmp(main+0x51) [0x10a281]
12 /lib64/libc.so.6(__libc_start_main+0xe5) [0x99a57e5]
13 lmp(+0x2303) [0x10a303]
=================================

I tried to use valgrind but I have a hard time reading the output. If someone could help me identify where the issue comes from it would be great, I thought it could be a RAM issue but it also happens with pretty small systems and I looked at the RAM usage it seems quite low compared to the available RAM.

here’s a link with the input, output, submission file and valgrind output of an instance of calculation failing since I can’t upload files as a new user: https://user.fm/files/v2-69b60c25a7167b684535aec6d9b82d34/msci-segfault-lammps.tar.gz

Best,
Hugo Geindre.

Your tar file is missing the data file, so it is not possible to reproduce your issue independently.

However, from what I see it is most likely, that you are suffering from a general problem of the standard implementation of ReaxFF in LAMMPS. It uses a heuristic guess on how much memory to allocate for bonded interactions and hydrogen bonds based on the existing density. But that will fail when you are compressing your system and thus the number of those will grow and eventually you will be having out-of-bounds memory accesses since the allocated buffers are insufficient.

To some degree, this can be accommodated by increasing the safezone parameter to the pair style command, but it is generally recommended to compile the KOKKOS package version (you don’t need a GPU for that, but can also use MPI and OpenMP parallelization), as it has a different, more robust memory management approach.

If that is not sufficient, you can break down your simulation into multiple chunks and thus have a re-initialization of the internal array buffers at each step.

You may also consider upgrading to the most recent version of LAMMPS for additional improvements and bug fixes (unrelated to your specific issue at hand).

Thanks a lot. I updated the tar file with the .data and the force field parameters. I will try the different solutions you proposed.

I tried to increase safezone and it’s working for now for what I want to do. I keep the other solutions in mind in case I run into other issues, but for now it’s all good. Thanks a lot for your help!