Problem during minimization

Nicola_Fortunati1 · October 29, 2014, 10:19am

Hi lammps users,
i want to simulate a system composed by 47000 atoms.
During minimization the LAMMPS exit without a reason.
Log file is this:

LAMMPS (11 Jan 2012)
Scanning data file …
4 = max bonds/atom
17 = max angles/atom
39 = max dihedrals/atom
1 = max impropers/atom
Reading data file …
orthogonal box = (0 0 0) to (73.41 42.438 157.192)
2 by 2 by 4 MPI processor grid
46849 atoms
25241 bonds
12961 angles
590 dihedrals
7 impropers
Finding 1-2 1-3 1-4 neighbors …
4 = max # of 1-2 neighbors
11 = max # of 1-3 neighbors
15 = max # of special neighbors
0 atoms in group TIP3P
250 atoms in group PCE_mol
250 atoms in group atom_print
0 atoms in group PCE_atom
1 atoms in group PCE_atom
3 atoms in group PCE_atom
5 atoms in group PCE_atom
7 atoms in group PCE_atom
9 atoms in group PCE_atom
11 atoms in group PCE_atom
12 atoms in group PCE_atom
13 atoms in group PCE_atom
14 atoms in group PCE_atom
16 atoms in group PCE_atom
18 atoms in group PCE_atom
20 atoms in group PCE_atom
22 atoms in group PCE_atom
24 atoms in group PCE_atom
26 atoms in group PCE_atom
28 atoms in group PCE_atom
30 atoms in group PCE_atom
32 atoms in group PCE_atom
34 atoms in group PCE_atom
36 atoms in group PCE_atom
38 atoms in group PCE_atom
40 atoms in group PCE_atom
42 atoms in group PCE_atom
44 atoms in group PCE_atom
46 atoms in group PCE_atom
48 atoms in group PCE_atom
50 atoms in group PCE_atom
52 atoms in group PCE_atom
54 atoms in group PCE_atom
56 atoms in group PCE_atom

58 atoms in group PCE_atom
60 atoms in group PCE_atom
61 atoms in group PCE_atom
WARNING: Resetting reneighboring criteria during minimization (min.cpp:167)
PPPM initialization …
G vector = 0.323231
grid = 72 40 150
stencil order = 6
RMS precision = 3.20644e-06
using double precision FFTs
brick FFT buffer size/proc = 52245 28800 23220

akohlmey · October 29, 2014, 10:34am

Hi lammps users,
i want to simulate a system composed by 47000 atoms.
During minimization the LAMMPS exit without a reason.

there *has* to be a reason.

Log file is this:

LAMMPS (11 Jan 2012)

whoa there! LAMMPS version 11 Jan 2012? that is almost from the stone
age. we are close to the end of 2014. nobody will look into anything
or try to fix anything for a version of LAMMPS that is *that* old.

Scanning data file ...
4 = max bonds/atom

[...]

  17 = max angles/atom
  39 = max dihedrals/atom
  1 = max impropers/atom
Reading data file ...
  orthogonal box = (0 0 0) to (73.41 42.438 157.192)
  2 by 2 by 4 MPI processor grid
  46849 atoms
  25241 bonds
  12961 angles
  590 dihedrals
  7 impropers
Finding 1-2 1-3 1-4 neighbors ...
  4 = max # of 1-2 neighbors
  11 = max # of 1-3 neighbors
  15 = max # of special neighbors
0 atoms in group TIP3P
250 atoms in group PCE_mol
250 atoms in group atom_print
0 atoms in group PCE_atom
1 atoms in group PCE_atom
3 atoms in group PCE_atom
5 atoms in group PCE_atom
7 atoms in group PCE_atom
9 atoms in group PCE_atom
11 atoms in group PCE_atom
12 atoms in group PCE_atom
13 atoms in group PCE_atom
14 atoms in group PCE_atom
16 atoms in group PCE_atom
18 atoms in group PCE_atom
20 atoms in group PCE_atom
22 atoms in group PCE_atom
24 atoms in group PCE_atom
26 atoms in group PCE_atom
28 atoms in group PCE_atom
30 atoms in group PCE_atom
32 atoms in group PCE_atom
34 atoms in group PCE_atom
36 atoms in group PCE_atom
38 atoms in group PCE_atom
40 atoms in group PCE_atom
42 atoms in group PCE_atom
44 atoms in group PCE_atom
46 atoms in group PCE_atom
48 atoms in group PCE_atom
50 atoms in group PCE_atom
52 atoms in group PCE_atom
54 atoms in group PCE_atom
56 atoms in group PCE_atom
58 atoms in group PCE_atom
60 atoms in group PCE_atom
61 atoms in group PCE_atom
WARNING: Resetting reneighboring criteria during minimization (min.cpp:167)
PPPM initialization ...
  G vector = 0.323231
  grid = 72 40 150
  stencil order = 6
  RMS precision = 3.20644e-06
  using double precision FFTs
  brick FFT buffer size/proc = 52245 28800 23220

this looks pretty big.are you sure that you have enough RAM for that
on the GPUs. remember you need 8x the space *and* offload pair style
and neighborlist as well?

--------------------------------------------------------------------------
- Using GPGPU acceleration for pppm:
- with 8 proc(s) per device.

hmm... 8 MPI tasks per GPU. that is a *lot*. particularly when you
offload both, Pair and Kspace to GPU. that is very likely to give you
a bad performance, if not a "GPU deceleration". better to only run
Pair on the GPU and Kspace on the CPU. this way you make use of the
many CPU cores while sharing a single GPU between 8 of them (so also
the possible "amount of acceleration" from the GPU is cut down to
1/8th).

--------------------------------------------------------------------------
GPU 0: Tesla M2050, 448 cores, 2.1/2.6 GB, 1.1 GHZ (Single Precision)

hmm... you ask for kspace to be computed with high precision and then
have the GPU support compile with all single precision. that doesn't
make much sense. you cannot even represent that level of accuracy that
you ask for in single precision floating point.

GPU 1: Tesla M2050, 448 cores, 2.1/1.1 GHZ (Single Precision)
--------------------------------------------------------------------------

[...]

Memory usage per processor = 48.6385 Mbytes
Step TotEng PotEng KinEng E_vdwl E_coul Temp Press Volume
0 inf inf 0 inf inf
0 -nan 489711.8

I use a cluster CPU/GPU for my simulations. The log error file is this:
[curie4:04376] *** Process received signal ***
[curie4:04376] Signal: Segmentation fault (11)
[curie4:04376] Signal code: Address not mapped (1)
[curie4:04376] Failing at address: 0xfffffffe640fc048

segmentation fault mean anything. does this input work with GPU only?
what amount of RAM does it require in total?

you definitely want to try with a current version of LAMMPS.

[...]

--------------------------------------------------------------------------

could you give me a suggestion please??

there are lots of them above. most of which are quite obvious, though.

axel.

sjplimp · October 29, 2014, 3:32pm

If your time=0 energies and forces
are INF, then the minimization is not
going to be able to do anything. That typically
means you have a bad initial config. I suggest
you try w/out GPUs initially and see if the CPU
version also gives a bad initial config and if
so, fix that first (i.e. your data file is a bad config).

Steve

Stefan_Paquay · October 29, 2014, 9:30pm

If a system blows up then sometimes an illegal bin index is computed for the neighbour lists, and once LAMMPS tries to write there it causes a segfault. So you should fix the initial system.

On a side note: Is there a reason why LAMMPS does not check if bin indices are valid? Is it for performance?

akohlmey · October 29, 2014, 9:58pm

If a system blows up then sometimes an illegal bin index is computed for the
neighbour lists, and once LAMMPS tries to write there it causes a segfault.
So you should fix the initial system.

On a side note: Is there a reason why LAMMPS does not check if bin indices
are valid? Is it for performance?

such a system is beyond repair. so why bother?

almost always such problems are caused by running unphysical setups
with variable cell and then an overflow (Inf) in pressure will lead to
illegal binning factors, and there is nothing you can do about it, but
correct the input and run without variable cell dimensions until it
has somewhat relaxed. even if high potential energies cannot be
avoided, it is highly advisable to initially run without variable cell
dimensions, as the initial high energy will cause a massive expansion
that would take extremely long to recover from. so by simply
terminating, LAMMPS does save people some time.

Stefan_Paquay · October 29, 2014, 10:04pm

Sure, I agree that crashing is the only way to deal with such a problem, but emitting a “nice” error (in the same lines as the “lost atoms”-error) rather than just segfaulting can make it is more obvious why the crash happens. I was wondering if it was deliberate design to not do this error checking at the neighbour list building for some reason or not.

akohlmey · October 29, 2014, 10:24pm

Sure, I agree that crashing is the only way to deal with such a problem, but
emitting a "nice" error (in the same lines as the "lost atoms"-error) rather
than just segfaulting can make it is more obvious why the crash happens. I
was wondering if it was deliberate design to not do this error checking at
the neighbour list building for some reason or not.

i would not say that it is a deliberate choice, unless you consider
the fact that developers prefer to work on issues that are more
subtle.

there is, in fact, a performance concern, so want to avoid to insert
checks and tests into the inner parts of (nested) loops or in inline
function, unless it cannot be avoided. some additional checks on valid
box data have been added to the various kspace styles this summer. in
principal, similar tests could be added to the variable cell modules
and commands. feel free to dig in and submit a patch. i am certain
that steve will appreciate anything that makes LAMMPS behave more user
friendly.

sjplimp · October 30, 2014, 3:36pm

Can you post a simple example script where this happens. Is it only
on 1 proc or in parallel?

thanks,
Steve