[lammps-users] Problem with pair_style quip

Anup_Pandey · July 20, 2021, 12:22am

Hi,

I am trying to use ‘quip’ pair_stype and build the package with ML-QUIP. I constantly get the “Segmentation fault” (output below) while testing with example files. Just wondering if someone can help me with this. Thank you.

akohlmey · July 20, 2021, 11:02am

Can you try to get a stack trace of the segfault. There are instructions for how this could be done in the manual.

The code has not been updated in years so there could have been a change in the API.

akohlmey · July 20, 2021, 3:25pm

I just made a test using the latest source from the QUIP git repo and the input examples work for me.
However when running with valgrind, I noticed that the QUIP code requires a large amount of stack space and thus may lead to errors, if the stack is not set to be large enough.

so please check the output of: ulimit -s
and try to increase that value (e.g. with: ulimit -s unlimited)
and see if that changes things.

If that doesn’t resolve the situation, please describe in detail how you compiled and configured QUIP and LAMMPS.

Axel.

Anup_Pandey · July 20, 2021, 3:54pm

Hi Axel,

The output of ulimit -s is unlimited. I have used the following steps while compiling the lammps with QUIP:

$ cd QUIP

$ export QUIP_DIR=/path/to/QUIP

$ export QUIP_ARCH=linux_x86_64_gfortran

$ make config

$ make libquip

cd lammps

mkdir build; cd build

cmake3 -C …/cmake/presets/most.cmake -D PKG_ML-QUIP=yes -D BUILD_MPI=no -D QUIP_LIBRARY=/misc/projects/meso4d/PANDEY/Lammps-build/QUIP/build/linux_x86_64_gfortran/libquip.a …/cmake/

cmake3 --build .

Thank you.

akohlmey · July 20, 2021, 4:11pm

What platform are you compiling on?
Perhaps your Fortran compiler is not “modern” enough to correctly compile the QUIP library.
As mentioned before, without seeing a stack trace (and for that you better compile QUIP with the -g flag added) it is difficult to tell from remote where the segmentation fault is happening.

From what I can tell on my machine, the LAMMPS side of the code looks correct and thus the segfault is likely happening from inside the QUIP library (the stack trace can confirm this), which then makes it an issue to discuss with the QUIP library developers. They can probably also assist in telling you the minimum requirement and how to debug this further.

Axel.

p.s.: https://docs.lammps.org/Errors_debug.html#using-the-gdb-debugger-to-get-a-stack-trace and https://docs.lammps.org/Errors_debug.html#using-valgrind-to-get-a-stack-trace

Anup_Pandey · July 20, 2021, 5:24pm

Hi Alex,

I am compiling on a linux cluster. The GAP/QUIP runs well on the cluster and so does the lammps if I use it with another pair_style. I think the problem is with linking the quip library with lammps (missing something). I have attached the error from valgrind run.

Thank you for your support.

crash-30408.txt (30.2 KB)

akohlmey · July 20, 2021, 6:16pm

Which version of LAMMPS is this with?
I cannot match the source lines in the valgrind output with the current version of LAMMPS.

Anup_Pandey · July 20, 2021, 6:22pm

It’s LAMMPS (29 Oct 2020).

Thanks,

akohlmey · July 20, 2021, 7:06pm

sorry, but that still does not match. your valgrind output reports:

==30408== Invalid write of size 1
==30408== at 0x4C2D203: strcpy (vg_replace_strmem.c:513)
==30408== by 0xDC44E0: LAMMPS_NS::PairQUIP::coeff(int, char**) (pair_quip.cpp:233)
==30408== by 0x414345: LAMMPS_NS::Input::pair_coeff() (input.cpp:1686)
==30408== by 0x41A4BC: LAMMPS_NS::Input::execute_command() (input.cpp:743)
==30408== by 0x41A83E: LAMMPS_NS::Input::file() (input.cpp:263)
==30408== by 0x40E272: main (main.cpp:64)
==30408== Address 0x93f247f is 0 bytes after a block of size 15 alloc’d
==30408== at 0x4C2AC38: operator new[](unsigned long) (vg_replace_malloc.c:433)
==30408== by 0xDC44CC: LAMMPS_NS::PairQUIP::coeff(int, char**) (pair_quip.cpp:232)
==30408== by 0x414345: LAMMPS_NS::Input::pair_coeff() (input.cpp:1686)
==30408== by 0x41A4BC: LAMMPS_NS::Input::execute_command() (input.cpp:743)
==30408== by 0x41A83E: LAMMPS_NS::Input::file() (input.cpp:263)
==30408== by 0x40E272: main (main.cpp:64)
==30408==
==30408== Invalid write of size 1
==30408== at 0x4C2D203: strcpy (vg_replace_strmem.c:513)
==30408== by 0xDC450D: LAMMPS_NS::PairQUIP::coeff(int, char**) (pair_quip.cpp:237)
==30408== by 0x414345: LAMMPS_NS::Input::pair_coeff() (input.cpp:1686)
==30408== by 0x41A4BC: LAMMPS_NS::Input::execute_command() (input.cpp:743)
==30408== by 0x41A83E: LAMMPS_NS::Input::file() (input.cpp:263)
==30408== by 0x40E272: main (main.cpp:64)
==30408== Address 0x93f2300 is 0 bytes after a block of size 48 alloc’d
==30408== at 0x4C2AC38: operator new[](unsigned long) (vg_replace_malloc.c:433)
==30408== by 0xDC44F9: LAMMPS_NS::PairQUIP::coeff(int, char**) (pair_quip.cpp:236)
==30408== by 0x414345: LAMMPS_NS::Input::pair_coeff() (input.cpp:1686)
==30408== by 0x41A4BC: LAMMPS_NS::Input::execute_command() (input.cpp:743)
==30408== by 0x41A83E: LAMMPS_NS::Input::file() (input.cpp:263)
==30408== by 0x40E272: main (main.cpp:64)

This obviously refers to the following block of code, but you can see that the line numbers don’t match.

266 n_quip_file = strlen(arg[2]);
267 quip_file = new char[n_quip_file+1];
268 strcpy(quip_file,arg[2]);
269
270 n_quip_string = strlen(arg[3]);
271 quip_string = new char[n_quip_string+1];
272 strcpy(quip_string,arg[3]);

Also the quoted code is correct, while your valgrind output suggests that this code has a bug that was fixed in commit 17aff29fe2e70bbb68d44b0b92ecb105124bdd86 in July 2017

axel.

Anup_Pandey · July 20, 2021, 9:38pm

Alex,
Here is the output again;

LAMMPS (29 Oct 2020)

OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:94)

using 1 OpenMP thread(s) per MPI task

Reading data files …

orthogonal box = (0.0000000 0.0000000 0.0000000) to (10.968476 10.968476 10.968476)

1 by 1 by 1 MPI processor grid

reading atoms …

64 atoms

read_data CPU = 0.001 seconds

Neighbor list info …

update every 1 steps, delay 10 steps, check yes

max neighbors/atom: 2000, page size: 100000

master list distance cutoff = 4.3

ghost atom cutoff = 4.3

binsize = 2.15, bins = 6 6 6

1 neighbor lists, perpetual/occasional/extra = 1 0 0

(1) pair quip, perpetual

attributes: full, newton on

pair build: full/bin/atomonly

stencil: full/bin/3d

bin: standard

Setting up Verlet run …

Unit style : metal

Current step : 0

Time step : 0.001

Segmentation fault (core dumped)

I tried recompiling the lammps from git but for some reason the ML-QUIP package gets excluded during the build. Were you able to build the lammps with QUIP from the git stable?

Again, the steps I have adopted are as follows:

$ git clone --recursive https://github.com/libAtoms/QUIP.git QUIP

$ cd QUIP

$ export QUIP_DIR=/path/to/QUIP

$ export QUIP_ARCH=linux_x86_64_gfortran

$ make config

$ make libquip

git clone -b stable https://github.com/lammps/lammps.git mylammps #cloning stable version

cd lammps # change to the LAMMPS distribution directory

mkdir build; cd build # create and use a build directory

cmake3 -C …/cmake/presets/most.cmake -D PKG_ML-QUIP=yes -D BUILD_MPI=no -D QUIP_LIBRARY=/misc/projects/meso4d/PANDEY/Lammps-build/QUIP/build/linux_x86_64_gfortran/libquip.a …/cmake/

cmake3 --build .

akohlmey · July 20, 2021, 10:02pm

Yes, It is possible to compile LAMMPS from the stable branch of the git repo with QUIP enabled. And yes, I was able to do so. And yes, it runs fine for me.
However, since the packages were renamed recently, you have to use the old flag, -DPKG_USER-QUIP=yes, instead of -DPKG_ML-QUIP=yes.
When running with valgrind I have to add the --max-stackframe=2169448 flag due to the large stack use of the QUIP library.
I have run the provided examples, e.g. in.gap:

LAMMPS (29 Oct 2020)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:94)
using 1 OpenMP thread(s) per MPI task
Reading data file …
orthogonal box = (0.0000000 0.0000000 0.0000000) to (10.968476 10.968476 10.968476)
1 by 1 by 1 MPI processor grid
reading atoms …
64 atoms
read_data CPU = 0.227 seconds
Neighbor list info …
update every 1 steps, delay 10 steps, check yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 4.3
ghost atom cutoff = 4.3
binsize = 2.15, bins = 6 6 6
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair quip, perpetual
attributes: full, newton on
pair build: full/bin/atomonly
stencil: full/bin/3d
bin: standard
Setting up Verlet run …
Unit style : metal
Current step : 0
Time step : 0.001
Per MPI rank memory allocation (min/avg/max) = 3.064 | 3.064 | 3.064 Mbytes
Step Temp E_pair E_mol TotEng Press
0 0 -10412.677 0 -10412.677 -107490.01
10 173.98393 -10414.096 0 -10412.679 -91270.969
20 417.38493 -10416.08 0 -10412.681 -42816.133
30 434.34789 -10416.217 0 -10412.68 2459.83
40 423.05899 -10416.124 0 -10412.679 22936.209
Loop time of 144.601 on 1 procs for 40 steps with 64 atoms

Performance: 0.024 ns/day, 1004.171 hours/ns, 0.277 timesteps/s
99.5% CPU use with 1 MPI tasks x 1 OpenMP threads

or in.sw:
LAMMPS (29 Oct 2020)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:94)
using 1 OpenMP thread(s) per MPI task
Reading data file …
orthogonal box = (0.0000000 0.0000000 0.0000000) to (5.4310000 5.4310000 5.4310000)
1 by 1 by 1 MPI processor grid
reading atoms …
8 atoms
read_data CPU = 0.218 seconds
Neighbor list info …
update every 1 steps, delay 10 steps, check yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 4.2258
ghost atom cutoff = 4.2258
binsize = 2.1129, bins = 3 3 3
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair quip, perpetual
attributes: full, newton on
pair build: full/bin/atomonly
stencil: full/bin/3d
bin: standard
Setting up Verlet run …
Unit style : metal
Current step : 0
Time step : 0.001
Per MPI rank memory allocation (min/avg/max) = 3.059 | 3.059 | 3.059 Mbytes
Step Temp E_pair E_mol TotEng Press
0 10 -34.68 0 -34.670952 32.206289
10 4.5659178 -34.675073 0 -34.670942 46.253731
20 1.606683 -34.672391 0 -34.670937 44.736892
30 6.7007748 -34.677011 0 -34.670948 16.403049
40 5.682757 -34.676087 0 -34.670945 18.696408
50 2.2140716 -34.672942 0 -34.670939 37.592282
60 5.0475382 -34.675512 0 -34.670944 37.331666
70 7.0990979 -34.677369 0 -34.670946 40.533757
80 5.7306189 -34.676128 0 -34.670943 47.748813
90 5.0895648 -34.675549 0 -34.670944 38.092721
100 4.1070919 -34.674659 0 -34.670943 28.737864
Loop time of 50.7807 on 1 procs for 100 steps with 8 atoms

My LAMMPS configuration is:

$ ./lmp -h
Large-scale Atomic/Molecular Massively Parallel Simulator - 29 Oct 2020
Git info (stable / stable_29Oct2020)

[…]

OS: Linux 5.12.15-300.fc34.x86_64 on x86_64

Compiler: GNU C++ 11.1.1 20210531 (Red Hat 11.1.1-3) with OpenMP 4.5
C++ standard: C++11
MPI v3.1: MPICH Version: 3.4.1
MPICH Release date: Fri Jan 22 14:17:48 CST 2021
MPICH ABI: 13:10:1

…skipping 1 line
Active compile time flags:

-DLAMMPS_GZIP
-DLAMMPS_PNG
-DLAMMPS_JPEG
-DLAMMPS_FFMPEG
-DLAMMPS_SMALLBIG
sizeof(smallint): 32-bit
sizeof(imageint): 32-bit
sizeof(tagint): 32-bit
sizeof(bigint): 64-bit

Installed packages:

USER-QUIP

And for reference the valgrind output:

==264057== Memcheck, a memory error detector
==264057== Copyright (C) 2002-2017, and GNU GPL’d, by Julian Seward et al.
==264057== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==264057== Command: /home/akohlmey/compile/lammps-quip/build/lmp -in in.sw
==264057==
==264057==
==264057== HEAP SUMMARY:
==264057== in use at exit: 867,165 bytes in 897 blocks
==264057== total heap usage: 500,250 allocs, 499,353 frees, 15,924,949,385 bytes allocated
==264057==
==264057== 1 bytes in 1 blocks are still reachable in loss record 1 of 52
==264057== at 0x484086F: malloc (vg_replace_malloc.c:380)
==264057== by 0x84F2C4: __extendable_str_module_MOD_extendable_str_concat (ExtendableStr.f95:276)
==264057== by 0x85B7ED: __dictionary_module_MOD_add_entry (Dictionary.f95:2573)
==264057== by 0x85FF2F: __dictionary_module_MOD_dictionary_add_array_i (Dictionary.f95:1702)
==264057== by 0x7F37EC: __atoms_types_module_MOD_atoms_add_property_int (Atoms_types.f95:778)
==264057== by 0x81B4FA: __atoms_module_MOD_atoms_initialise (Atoms.f95:420)
==264057== by 0x7A0B09: quip_lammps_wrapper (quip_lammps_wrapper.f95:83)
==264057== by 0x785B37: LAMMPS_NS::PairQUIP::compute(int, int) (pair_quip.cpp:155)
==264057== by 0x5325E1: LAMMPS_NS::Verlet::setup(int) (verlet.cpp:134)
==264057== by 0x4EEA78: LAMMPS_NS::Run::command(int, char**) (run.cpp:178)
==264057== by 0x44BBB6: void LAMMPS_NS::Input::command_creator<LAMMPS_NS::Run>(LAMMPS_NS::LAMMPS*, int, char**) (input.cpp:791)
==264057== by 0x449CED: LAMMPS_NS::Input::execute_command() (input.cpp:774)

Anup_Pandey · July 20, 2021, 10:44pm

Bingo! build from the git stable version and it worked ! Thanks for your help, Axel.

Just to summarize, the following are the steps that I used.

#Step 1:

git clone --recursive https://github.com/libAtoms/QUIP.git QUIP

$ cd QUIP

$ export QUIP_DIR=/path/to/QUIP

$ export QUIP_ARCH=linux_x86_64_gfortran

$ make config

$ make libquip

#Step 2:

git clone -b stable https://github.com/lammps/lammps.git mylammps

cd mylammps # change to the LAMMPS distribution directory

mkdir build; cd build # create and use a build directory

cmake3 -C …/cmake/presets/most.cmake -D PKG_USER-QUIP=yes -D BUILD_MPI=no -D QUIP_LIBRARY=/path/to/QUIP/Library/libquip.a …/cmake/

cmake3 --build .

P.S. I had issue with the source downloaded directly from https://www.lammps.org/download.html instead of git

Thanks,

akohlmey · July 21, 2021, 12:33am

You still have not provided a convincing explanation for the reported segfaults and valgrind outputs.

The source you download from www.lammps.org and what is in the git repo is identical. So that does not explain anything.
Moreover, since you apparently were not aware of the renaming of packages in the 2 July 201 version, how would it be possible to have an older version of LAMMPS compiled with QUIP included. Thus, again, you are also providing inconsistent information here.

As I already have shown, the only way I can think of to get the reported segfault and valgrind output is with a much older LAMMPS version. Also, the LAMMPS interface in the quip library has not changed for the last 4 years. So any LAMMPS version since about August 2017 and any QUIP library version since then should be API compatible with each other and thus not lead to segfaults elsewhere for the example input (since those are used for testing).

It is very important to know how it is possible to get those segfaults so that their cause can be addressed.
At the moment the only explanation is that you didn’t provide the correct information, which will put you on the “naughtly” list.

Axel.

Anup_Pandey · July 21, 2021, 12:54am

I don’t have the explanation for the segmentation fault that occurred while compiling lammps by directly downloading from the tar source (and not from git). It’s strange that LAMMPS (29 Oct 2020) compiled with the old package name and spitted segfault. I provided you with all the details and you pointed out the updated name. Now can you please explain where I miss to be on a “naughty” list?

I can try compiling the old src with the new USER package name and do the debugging.