Combining gpu pair_style

Hello all,

I keep getting seg faults when reading a restart file while there having no problem with reading data files. I am combining a pure pair style “pair_style lj/cut/coul/long/gpu 10.0 10.0” with hybrid styles for the bonded interaction like dihedrals.

Would appreciate any hints on why it sudeenly started happening.

mpirun -np 18 /scratch/mylammps/build/lmp_gpu -in run.in.nvt

LAMMPS (5 Jun 2019)
using 1 OpenMP thread(s) per MPI task
Reading restart file …
restart file = 5 Jun 2019, LAMMPS = 5 Jun 2019
restoring atom style full from restart
orthogonal box = (0 0 0) to (80 80 80)
3 by 2 by 3 MPI processor grid
restoring pair style lj/cut/coul/long from restart
restoring bond style hybrid from restart
restoring angle style hybrid from restart
restoring dihedral style hybrid from restart
restoring improper style hybrid from restart
18350 atoms
12345 bonds
6610 angles
565 dihedrals
10 impropers
Finding 1-2 1-3 1-4 neighbors …
special bond factors lj: 0 0 0.5
special bond factors coul: 0 0 0.5
4 = max # of 1-2 neighbors
6 = max # of 1-3 neighbors
12 = max # of 1-4 neighbors
14 = max # of special neighbors
special bonds CPU = 0.00136442 secs
read_restart CPU = 0.0171756 secs
[ift0130251:02960] *** Process received signal ***
[ift0130251:02960] Signal: Segmentation fault (11)
[ift0130251:02960] Signal code: Address not mapped (1)
[ift0130251:02960] Failing at address: 0x564d3d53f000
[ift0130251:02974] *** Process received signal ***

Tatiana Kuznetsova

Department of Physics and Technology

University of Bergen Norway

Hello all,

I keep getting seg faults when reading a restart file while there having no problem with reading data files. I am combining a pure pair style “pair_style lj/cut/coul/long/gpu 10.0 10.0” with hybrid styles for the bonded interaction like dihedrals.

Would appreciate any hints on why it sudeenly started happening.

what exactly makes this a “sudden” change?

without more detailed information and a simple way to reproduce what you are doing, it is impossible to make any specific statements about potential problems.

axel.

Hi,

Here's a small deck that reproduces my problem even running on a single processor.
-in system.in will minimize the system of five molecules (70 atoms each) and create a data file read
in run.in.equil, where a 10 thousand-step run wrote two running restart files restart.1 and restart.2, and final restart and data files.

As seen in the logs, only a restart from a data file has proven successful, any of the three binary files produces the seg fault error.

Regards

Tatiana Kuznetsova

deck_PLAEO7PLA.tar.gz (74.8 KB)

thanks for the suitable and instructive input deck, but you have still not explained what made this a “sudden” change.
axel.

I used to be able to run from restart files :slight_smile:

TK

I used to be able to run from restart files :slight_smile:

with the same force styles? and with which version of LAMMPS?

does it depend on using the GPU? you mention using a GPU, but in your logs and command lines there is no indication that you are using a GPU (only a possibly GPU capable binary, but that doesn’t use the GPU unless instructed to do so).

axel.

No, it's the first time I have tried combining a non-hybrid pair_style with hybrid styles for bonded interactions (but I don't seem to find any warnings against that in either LAMMPS or Moltempalte documentation).

My executable *is* GPU-capable, and my ultimate goal is to offload non-bonded pair interactions to my videocard via specifying "pair_style lj/cut/coul/long/gpu 10.0 10.0" directly instead of a suffix, and invoking the GPU support by "-pk gpu 1". Using a non-GPU plain pair_style was meant as a first step towards this goal.

Regards

TK

Hi,

I see here https://lammps.sandia.gov/threads/msg17524.html that reading from restart files used to have a very similar seg fault problem back in 2011, albeit it was in case of writing per MPI task restarts.

Cheers

TK

then why do you put it in your subject line? you are pointing people into a completely wrong direction.
as far as i can tell, the problem has nothing to do with GPUs or the pair style in the first place. …and it shouldn’t. restarting is a capability that is handled by each style independently and also by the base class of any GPU enabled class.

some quick tests, that i just did point to a possible general bug in the dihedral style table.
if you add the following line after read_restart, it will recreate and reinitialize the dihedral sub-styles and the segfault goes away.

dihedral_style hybrid opls fourier table spline 400

axel.

Hi,

I see here https://lammps.sandia.gov/threads/msg17524.html that reading from restart files used to have a very similar seg fault problem back in 2011, albeit it was in case of writing per MPI task restarts.

that is a completely different issue, since the segmentation fault happens in a completely different part of LAMMPS. that issue has long been resolved.

axel.

 Thanks\! Reinitializing the hybrids did the trick\!

Going to test switching to GPU acceleration for nonbonded interactions.

Cheers

TK

then why do you put it in your subject line? you are pointing people into a completely wrong direction.
as far as i can tell, the problem has *nothing* to do with GPUs or the pair style in the first place. ...and it shouldn't. restarting is a capability that is handled by each style independently and also by the base class of any GPU enabled class.

some quick tests, that i just did point to a possible general bug in the dihedral style table.

It is possible. I never use restart files (I hate them), and I
probably did not test that feature very carefully when I wrote this
dihedral style.

if you add the following line after read_restart, it will recreate and reinitialize the dihedral sub-styles and the segfault goes away.
dihedral_style hybrid opls fourier table spline 400

I will take a quick look at it now, running it with gdb. If I can
resolve it quickly, I'll get back to you today. If not, I'll put it
on my to-do list.
(But my to-do list is pretty long at the moment. I will work on this
sometime in the next couple weeks.)

Thanks for the bug report and many thanks Axel for diagnosing it. I
am taking a look at it now...

Andrew

actually, the real problem is how dihedral style hybrid handles restarts. it essentially will create the dihedral styles, but has no provisions store and restore the settings from the individual sub-styles. and thus for the dihedral style table, the settings are not there, especially the number of table points. that will trigger access to storage that has not been allocated to the proper dimensions.

this is non-trivial but solvable. i am almost there.

axel.

This might be a bigger bug than just dihedral_table.cpp.
This might be also a bug in bond_table.cpp, angle_table.cpp, and
dihedral_table.cpp, but I'm too lazy to write a test case for
bond_table and angle_table.
I'm taking a look at this and will give you details soon.

One thing that should be done:
The documentation for "bond_style_table.txt", "angle_style_table.txt"
,"dihedral,_style_table.txt" should be updated to warn users that
after reading the restart file, they must include a bond_coeff,
angle_coeff, or dihedral_coeff command after reading the restart file
in order to read in the table data. The restart files do not contain
this information. Here is an excerpt from pair_table.txt:

"This pair style writes the settings for the “pair_style table”
command to binary restart files, so a pair_style command does not need
to specified in an input script that reads a restart file. However,
the coefficient information is not stored in the restart file, since
it is tabulated in the potential files. Thus, pair_coeff commands do
need to be specified in the restart input script."

Other observations made so far:
(BORING DETAILS. FEEL FREE TO SKIP.)
When I wrote dihedral_table.cpp, I just copied the code from
angle_table.cpp and prayed that it works. Whatever problems
dihedral_table.cpp has, it probably shares with bond_table.cpp and
angle_table.cpp. In all 3 cases, the write_restart() file only saves
two integers in the restart file: tabstyle, and tablength. The
read_restart() file reads these two integers and invokes allocate(),
which (for some reason) never gets around to invoking
"spline_table(tb)". That's problem in this case, because that
function, spline_table(tb), takes care of allocating the energy and
force tables tb->e2, and tb->f2. LAMMPS is crashing in
dihedral_table.cpp:1319 because tb->e2 and tb->f2 are currently NULL.

Cheers
-andrew