Two Questions about QUIP in LAMMPS

lzq123 · December 28, 2021, 9:35am

Hi,
When I installed according to the steps on QUIP’s official website (exactly the same), I encounter the following problems during simulation. Following is the in file I adopted.
xml file is from Deringer’s model(aC_GAP_data_main\training\full_das\carbon.xml)

units          metal
dimension      3
atom_style     atomic
boundary       p p p

# create region including Fixed layer, Constant temperature layer, Newton layer
# whole system
region         box block  -0.446 31.7  -0.446 31.7 -38 80 units box 
create_box     1 box 
lattice        diamond    3.56683

# for cutting1 tool
region         bulkn   block 0 31.4 0 31.4 -34 -10 units box

# create atoms
create_atoms   1  region bulkn  units box

# set mass for all atoms
mass           1  12.0110 

neighbor       2.0 bin     #2.0nm
neigh_modify   every 1 delay 5 check yes

# add interacions
pair_style	quip
pair_coeff	* * carbon.xml "IP GAP label=GAP_2016_10_18_60_23_11_22_108" 1

timestep       0.0005

fix            1 all   nve

run            10

Problem 1:
When I run with these files, something is wrong. The first figure shows the error message of single-core operation, the second figure shows the error message of multi-core operation, and the third figure shows single-core operation detected by Valgrind.

The correctness of in file has been verified much times with xml file in LAMMPS’s examples and CEDIP. Further testing revealed that the offending code was in line 9 of the in file. When I replaced this line with “region box block -0.446 10 -0.446 10 -38 80 units box” , everything is OK.

Problem 2:
The simulation speed is too slow to satisfy my requirements.
Each step takes three to five minutes(About 4000 atoms) and four cores require as much computation time as a single core. I don’t know what’s wrong with my QUIP installation or Parallel computing.
I test some xml files and found that all parallel operations did not improve the speed, but my installation steps are exactly the same as the official website. What might be the problem?

Best regards,
LZQ

akohlmey · December 28, 2021, 12:40pm

To be able to comment on your compilation and how you run LAMMPS, you have to provide exact details of the steps you did and what your LAMMPS version and compiler, OS environment, build commands etc. were and what your command lines were. Stating “exactly as on website” is not sufficient and since those instructions are not exact and need to be adapted which is where people make different choices and those have an effect.

I cannot reproduce the failure you are reporting with the current stable version of LAMMPS (29 Sep 2021). This was compiled on a Fedora Linux 35 machine (GCC 11.2, CMake 3.22.1, MPICH 3.4.1) with automatic download of the quip libraries using cmake with:

cmake -S cmake -B build-quip -DPKG_ML-QUIP=on 
cd build-quip/
make -j8

It would be helpful to obtain a stack trace to see where the failure happens. 11.4. Debugging crashes — LAMMPS documentation

The valgrind detected memory access issue you are seeing is due to the QUIP library code making large memory allocations on the stack and thus using more stack space than valgrind reserves by default. You left out the earlier section of the valgrind output where it comments on that. If you run valgrind with e.g. --max-stackframe=2560000 added those invalid memory access reports should go away.

If your default shell settings are also using a rather restrictive stack limit, it may be worth trying to execute the command ulimit -s unlimited before running LAMMPS with QUIP to lift the stack limitations (if permitted). That may avoid issues with the stack requirements of the QUIP library.

GAP is documented to be very accurate but also rather slow, sometimes by 1-2 orders of magnitude slower than other machine learning potentials for only a minor gain in accuracy. Please see e.g. Virtual LAMMPS Workshop and Symposium - August 10-13, 2021

If you want faster execution, you will have to trade off some of the accuracy against better performance. As of recent, LAMMPS supports 5 different ML pair styles with the distribution itself and with ML-IAP a framework to assemble your own potential from the available components. There are also multiple external ML packages, e.g. DeepMD can be loaded as a plugin.

You input has a load balancing issue. LAMMPS uses a domain decomposition of the simulation cell volume and thus by default assumes that the volume is homogeneously filled with atoms. But that is not the case in your input. With 4 MPI ranks LAMMPS will create by default 4 subdomains in z-direction, but the cell is mostly empty in that direction and thus most of the the subdomains will be empty. This can be avoided by telling LAMMPS to not subdivide the z-direction though adding the line processors * * 1 to the input before creating the box. For 4 MPI ranks, that will result in a 2 x 2 x 1 subdivision instead of a 1 x 1 x 4 grid. On my machine that results in a 3x speedup for using 4 MPI ranks over using 1 MPI rank.

Loop time of 237.613 on 4 procs for 10 steps with 4374 atoms
Loop time of 728.189 on 1 procs for 10 steps with 4374 atoms

This is on an Intel quad core i5 CPU with 2.4 GHz clock.

lzq123 · December 29, 2021, 2:32am

I’m sorry for not describing the problem clearly.
LAMMPS software is installed on ubuntu2020 in Vmware virtual machine. The host model is AMD Ryzen 5 3600-Core Processor. MPI version is mpich.3.3.2, and FFTW version is fftw.3.3.8. Following is my compliler setup for LAMMPS.

# compiler/linker settings
# specify flags and libraries needed for your compiler

CC =		 g++ # mpicxx
CCFLAGS =	-g -O3 
SHFLAGS =	-fPIC
DEPFLAGS =	-M

LINK =		icpc
LINKFLAGS =	-g -O3
LIB = 
SIZE =		size

ARCHIVE =	ar
ARFLAGS =	-rc

SHLIBFLAGS =	-shared

The compiler for QUIP is linux_x86_64_gfortran. The QUIP installation steps are as follows.

export QUIP_ARCH=linux_x86_64_gfortran
export QUIPPY_INSTALL_OPTS=--user  # omit for a system-wide installation
make config  # Select "y" for GAP  and use "[TurboGAP](https://turbogap.fi/wiki/index.php/Installation)" functionality
make 
make libquip

After establishing a connection between QUIP and LAMMPS, I can run the examples of QUIP in LAMMPS. And then there are the aforementioned problems.

Best regards,
LZQ

akohlmey · December 29, 2021, 2:48am

I have nothing to add to my suggestions. None of the additional information you provide helps to confirm that there is a problem:

the valgrind report is bogus because valgrind defaults to a 2MB stack and the quip lib requires more.Mind you the reported access issues are inside of the quip library code and not LAMMPS.
I don’t see an issue with your input and the latest LAMMPS source version
the parallel performance issue is explained by load imbalance
the slow overall performance of Gap is explained by the talk and corresponding publication I pointed out.

If you need additional assistance you will have to first refute my assessments.

lzq123 · December 29, 2021, 5:16am

I took your advice on Valgrind and tested it with –max-StackFrame =2560000. The following error occurred on a single-core run. Seems to be a problem with gfortran.

It suddenly occurred to me that there was a cant’t find -lgfortran error when I type make mpi after installing QUIP. This was because there was no libgfortran file under my usr/lib, so a link was made from /usr/lib/x86_64-linux-gnu/9/libgfortran.so to /usr/lib. I wonder if this operation is illegal?
I think you’re right about the efficiency of parallel computing, but the problem is that even running on a single core fails.
Best regards,
LZQ

akohlmey · December 29, 2021, 5:27am

This sounds like you didn’t follow the compilation instructions correctly.

lzq123 · December 29, 2021, 7:33am

Gfortran didn’t seem to have any problems when I tried to reinstall the dependencies, but the error of cannot find -lgfortran when I type make mpi. I didn’t modify any makefiles.

Should I reinstall the system to fix this problem.

Best regards,
LZQ

akohlmey · December 29, 2021, 12:00pm

This has nothing to do with the OS installation and everything with you not paying sufficient attention to the documentation and my suggestions.
You have asked for advice, you have been given plenty. I have demonstrated and confirmed that the QUIP package and your provided input deck works without a hitch if LAMMPS is compiled correctly. The rest is up to you.

lzq123 · December 29, 2021, 2:40pm

Thanks for your patience.

Because of my misoperation, the whole system crashed, so I had to reinstall everything. In order to prevent problems, I tried to follow the instructions given in the manual for this installation.Versions I used are LAMMPS (29 Sep 2021), MPICH (3.4.3), GCC (9.3.0).

Everything was fine until the installation was complete, but the final test was still wrong. Fortunately, this error message is different from the previous one. The following information are the result of single-core valgrind debugging, multi-core valgrind debugging and single-core gdb debugging.

There is only an obvious error message that occurs during single-core valgrind debugging, but I do not understand the meaning of this error message.

Thanks again for your reply.
Best regards,
LZQ

akohlmey · December 29, 2021, 3:14pm

The serial valgrind report is unrelated and obviously from an MPI support library.
Using valgrind on mpirun is pointless since you don’t want to check the mpirun program.

I have nothing else to add to what I previously wrote. After so many posts there is still no evidence proving that you did the entire configuration and compilation correctly. My patience has run out.

Have a nice day.

lzq123 · December 30, 2021, 2:16am

I’m sorry I asked you questions that you thought were pointless.

But surprisingly, after just restarting the computer, these errors never appeared again. Although the rate of parallel promotion is not great, it may be related to other configurations, and nothing seems to matter.

I really appreciate your help.
Best regards,
LZQ