Hello,
I am running LAMMPS (29 Aug 2024) on a cluster under Rocky Linux 8.10, most of the time it works well but for some calculations I get a segmentation fault. I looked at this page (7.4. Debugging crashes — LAMMPS documentation) and tried some things suggested, I also looked at topics here on segmentation faults but I can’t figure out what is the issue exactly.
My issue is that for some of my calculations, whatever the node I am running on and whatever the number of CPU I am using the calculation stop at the same point with an error message with segmentation fault. For instance
[node09:64850:0:64850] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xef9bdbd0)
==== backtrace (tid: 64850) ====
0 /home/hgeindre/miniconda3/envs/lammps-env/bin/…/lib/libucs.so.0(ucs_handle_error+0x2fd) [0xa36484d]
1 /home/hgeindre/miniconda3/envs/lammps-env/bin/…/lib/libucs.so.0(+0x2fa3f) [0xa364a3f]
2 /home/hgeindre/miniconda3/envs/lammps-env/bin/…/lib/libucs.so.0(+0x2fc0a) [0xa364c0a]
3 /lib64/libc.so.6(+0x4e5b0) [0x99b95b0]
4 /home/hgeindre/miniconda3/envs/lammps-env/bin/…/lib/liblammps.so.0(+0x1178c03) [0x8a28c03]
5 /home/hgeindre/miniconda3/envs/lammps-env/bin/…/lib/liblammps.so.0(_ZN6ReaxFF14Compute_ForcesEPNS_11reax_systemEPNS_14control_paramsEPNS_15simulation_dataEPNS_7storageEPPNS_9reax_listE+0x31e) [0x8a29b1e]
6 /home/hgeindre/miniconda3/envs/lammps-env/bin/…/lib/liblammps.so.0(_ZN9LAMMPS_NS10PairReaxFF7computeEii+0x12b) [0x8a1c84b]
7 /home/hgeindre/miniconda3/envs/lammps-env/bin/…/lib/liblammps.so.0(_ZN9LAMMPS_NS6Verlet3runEi+0x22d) [0x837d9bd]
8 /home/hgeindre/miniconda3/envs/lammps-env/bin/…/lib/liblammps.so.0(_ZN9LAMMPS_NS3Run7commandEiPPc+0xe0c) [0x83056ac]
9 /home/hgeindre/miniconda3/envs/lammps-env/bin/…/lib/liblammps.so.0(_ZN9LAMMPS_NS5Input15execute_commandEv+0x8c8) [0x812b118]
10 /home/hgeindre/miniconda3/envs/lammps-env/bin/…/lib/liblammps.so.0(_ZN9LAMMPS_NS5Input4fileEv+0x192) [0x812bfa2]
11 lmp(main+0x51) [0x10a281]
12 /lib64/libc.so.6(__libc_start_main+0xe5) [0x99a57e5]
13 lmp(+0x2303) [0x10a303]
=================================
I tried to use valgrind but I have a hard time reading the output. If someone could help me identify where the issue comes from it would be great, I thought it could be a RAM issue but it also happens with pretty small systems and I looked at the RAM usage it seems quite low compared to the available RAM.
here’s a link with the input, output, submission file and valgrind output of an instance of calculation failing since I can’t upload files as a new user: https://user.fm/files/v2-69b60c25a7167b684535aec6d9b82d34/msci-segfault-lammps.tar.gz
Best,
Hugo Geindre.