Seg fault while using fix ave/time

Dear users and developers,

  I came into a seg fault error while attempting to average a vector
quantity from a compute rdf. The error message is

[ipe05:23873] *** Process received signal ***
[ipe05:23873] Signal: Segmentation fault (11)
[ipe05:23873] Signal code: Address not mapped (1)
[ipe05:23873] Failing at address: 0x108fcc740
[ipe05:23873] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf880)
[0x7fe94140e880]
[ipe05:23873] [ 1] /home/vieira/lammps_arch20/src/lmp_openmpi() [0x7a343a]
[ipe05:23873] [ 2] /home/vieira/lammps_arch20/src/lmp_openmpi() [0x7da9d9]
[ipe05:23873] [ 3] /home/vieira/lammps_arch20/src/lmp_openmpi() [0x5218a4]
[ipe05:23873] [ 4] /home/vieira/lammps_arch20/src/lmp_openmpi() [0x5f257a]
[ipe05:23873] [ 5] /home/vieira/lammps_arch20/src/lmp_openmpi() [0x734454]
[ipe05:23873] [ 6] /home/vieira/lammps_arch20/src/lmp_openmpi() [0xdd5fdf]
[ipe05:23873] [ 7] /home/vieira/lammps_arch20/src/lmp_openmpi() [0xd9f636]
[ipe05:23873] [ 8] /home/vieira/lammps_arch20/src/lmp_openmpi() [0x705ad6]
[ipe05:23873] [ 9] /home/vieira/lammps_arch20/src/lmp_openmpi() [0x70391f]
[ipe05:23873] [10] /home/vieira/lammps_arch20/src/lmp_openmpi() [0x70446e]
[ipe05:23873] [11] /home/vieira/lammps_arch20/src/lmp_openmpi() [0x410276]
[ipe05:23873] [12] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)
[0x7fe941077b45]
[ipe05:23873] [13] /home/vieira/lammps_arch20/src/lmp_openmpi() [0x41ee74]
[ipe05:23873] *** End of error message ***

I read a restart file and issued the commands to perform rdf calculations

compute gr all rdf 1000 * * 1 1 2 2 2 3
fix grmed all ave/time 10 500 5000 c_gr[1] c_gr[2] c_gr[3]
c_gr[4] c_gr[5] c_gr[6] c_gr[7] c_gr[8] c_gr[9] mode vector file gr.0800

  If I comment the fix ave/time, the script runs ok. If I try to average a
scalar quantity, it runs ok too.
  I browsed the mail-list archives and found a similar problem with
averaging compute rdf. Was that case ever solved?

  What seems to be the problem in this case?
  Thank you for the attention and let me know if you need any additional
information.

  Best,
  Luis

Dear users and developers,

I came into a seg fault error while attempting to average a vector
quantity from a compute rdf. The error message is

This is not very helpful. Which version of lammps are you using?

Unless you can produce a couple of very simple input files that allow someone else to reproduce this, it is not likely to get looked into.

OK, attached are an input script, the restart file and the output. The
script requires kspace and gpu packages.

The script clearly shows that the fix ave/time is failing somehow and
nothing would be wrong with the restart file.

What could be wrong here?

Best,
Luis

cool.001Kps.75000000 (865 KB)

in.prop (1.34 KB)

out.0800 (5.74 KB)

OK, attached are an input script, the restart file and the output. The
script requires kspace and gpu packages.

luis,

this doesn't really help. restart files are in general not portable
across machines and LAMMPS versions.
so we'd either also need an input script to generate the restart or
you have to try and see, if the issue can be reproduced when using a
data file.

please make an effort to reduce the requirements. the simplest
approach would be to add the commands that you suspect to be not
working correctly to one of the existing LAMMPS example inputs, e.g.
in.melt and see if you can reproduce the crash. if the issue what you
claim it is, it should not make a difference, otherwise you have a
hint that it may have some impact. then you should try to gradually
build the input from something simpler that works correctly until you
identify the step that breaks it.

this is all simple common sense procedures of how one debugs a program
that is suspected to have a bug. you don't just dump what you have one
somebody and let that person to all the legwork for you. :wink:

axel.

You are absolutely right. I will start by using the latest version and
rerun my job (it took 20 days :frowning: ). I would not imagine that 2 months old
code could still be buggy in aspects such as time averages.

Thanks anyway, Axel!

Luis

You are absolutely right. I will start by using the latest version and
rerun my job (it took 20 days :frowning: ). I would not imagine that 2 months old

that is about the worst option to try. first try to insert the
ave/time stuff into some small/fast input.

code could still be buggy in aspects such as time averages.

people are using time averages a *lot* and have been doing it in the
past. so it is unlikely that there is a problem in the code unless it
is a very extremely rare case.

that is why it is so very important to break everything down in small
pieces and look at each piece and then put it back together piece by
piece to locate the "trigger".

axel.

please also note that an input for debugging an issue, doesn't have to
produce a physically meaningful simulation. it only needs to reproduce
the issue and as fast and easy as possible.

axel.

Hi, everyone!

Attached are fast input scripts that reproduce my problem

1- run in.LS2 (about 50 min on GPU)
2- run in.cool using any of the liq.T2000* restart files generated at step
1 (1 min)

extra packages required: GPU and KSPACE

  The error occurs right at the beginning of the RDF average calculation
in step 2.

  Any help on this would be greatly appreciated.

  Cheers,
  Luis

in.LS2 (3.56 KB)

in.cool (1.72 KB)

Hi, everyone!

Attached are fast input scripts that reproduce my problem

1- run in.LS2 (about 50 min on GPU)

do you seriously expect somebody to wait an entire hour (using a GPU
on top of that) only for generating a restart file?
that is not likely going to happen.

why don't you try to insert the failing rdf and ave/time command into
one of the existing and fast inputs from the example directories.

2- run in.cool using any of the liq.T2000* restart files generated at step
1 (1 min)

extra packages required: GPU and KSPACE

  The error occurs right at the beginning of the RDF average calculation
in step 2.

if you assume that the problem is the the rdf and ave/time
calculation, then there is no reason to think it requires to be run
with your exact system.

you also didn't say, whether this requires to run one or more instances LAMMPS.

  Any help on this would be greatly appreciated.

you have to make it easier to help you.

axel.

Hi, everyone!

Attached are fast input scripts that reproduce my problem

1- run in.LS2 (about 50 min on GPU)
2- run in.cool using any of the liq.T2000* restart files generated at step
1 (1 min)

extra packages required: GPU and KSPACE

  The error occurs right at the beginning of the RDF average calculation
in step 2.

  Any help on this would be greatly appreciated.

please try running your second input without gpu support (comment out
package gpu and don't use -sf )
and then also try running it with: package gpu force 0 0 1.0

do both of these work?

axel.

Hi, everyone!

Attached are fast input scripts that reproduce my problem

1- run in.LS2 (about 50 min on GPU)

do you seriously expect somebody to wait an entire hour (using a GPU
on top of that) only for generating a restart file?
that is not likely going to happen.

why don't you try to insert the failing rdf and ave/time command into
one of the existing and fast inputs from the example directories.

I assumed 50 min was no big deal. But that's fine, here is the shortened
script 1 that runs 10x faster

2- run in.cool using any of the liq.T2000* restart files generated at
step
1 (1 min)

extra packages required: GPU and KSPACE

  The error occurs right at the beginning of the RDF average calculation
in step 2.

if you assume that the problem is the the rdf and ave/time
calculation, then there is no reason to think it requires to be run
with your exact system.

I didn't assume this, I am telling you where the program stopped. The
reason I don't know, but it may be my system. I want to rule that out
first.

you also didn't say, whether this requires to run one or more instances
LAMMPS.

one instance

  Any help on this would be greatly appreciated.

you have to make it easier to help you.

OK, made it easier. Same error here.

in.LS2 (3.56 KB)

[...]

I assumed 50 min was no big deal. But that's fine, here is the shortened
script 1 that runs 10x faster

50min is a *very* big deal. even 5min can be a big deal. when
debugging sometimes you need to run the same command many times to
narrow down the source of the problem. so a few minutes can quickly
become many hours. requiring a GPU is a big deal as well. most LAMMPS
developers do not use a GPU or build LAMMPS regularly with GPU
support. so it would first require to log into a machine that has a
suitable GPU and configure and build a LAMMPS executable that has GPU
support included. that also increases the barrier from somebody to be
willing to look into this significantly.

[...]

if you assume that the problem is the the rdf and ave/time
calculation, then there is no reason to think it requires to be run
with your exact system.

I didn't assume this, I am telling you where the program stopped. The
reason I don't know, but it may be my system. I want to rule that out
first.

this is definitely the wrong way to approach this. you should do
exactly the opposite thing, i.e. what i recommended you to do already
twice and try to reproduce the issue with some *other* input and
adding the rdf and ave time calculations to it.

also, since you do use GPU support, you should *always* run a test
without GPU support to determine, whether this has an impact.

specifically, the package gpu command specifically warns about
incompatibilities with building neighbor lists on the GPU.
that is why i suggested to try with: package gpu force 0 0 1.0
instead of: package gpu force/neigh 0 0 1.0

axel.

Yes, it worked with force instead of force/neigh

I'll look into the neighbor list build incompatibility issue.

Thanks!

Luis

???

once i knew all the details that were missing from your original post,
it is a fairly simple thing.
your g(r) calculation requires a neighbor list. the compute rdf
commands enqueues a request for that. however, this is an "occasional"
neighbor list which depends on a previous "regular" build produce a
binning of atoms into subdomains. that is normally not a big issue,
since you do have such a request scheduled in most simulations. using
a GPU pair style is an exception, since that can offload the neighbor
list build to the GPU (which is often faster as it avoids having to
transfer a lot of data across the bus). under these circumstances the
binning tables are missing on the host memory and thus you get a
segmentation fault, where they are expected to be present.

i am not sure what is the best approach at this point: printing a
warning/error and remind people to not offload the neighbor list
creation or implement a workaround. would you mind trying out the
following modification to neighbor.cpp ?
and then compare the performance of using package gpu force/neigh and
this against package gpu force WITHOUT this change.

thanks,
      axel.

diff --git a/src/neighbor.cpp b/src/neighbor.cpp
index 28c0513..3d010ab 100644
--- a/src/neighbor.cpp
+++ b/src/neighbor.cpp
@@ -660,11 +660,13 @@ void Neighbor::init()
     // anyghostlist = 1 if any non-occasional list stores neighbors of ghosts

     anyghostlist = 0;
+ int anybuild = 0;
     for (i = 0; i < nrequest; i++) {
       if (lists[i]) {
         lists[i]->buildflag = 1;
         if (pair_build[i] == NULL) lists[i]->buildflag = 0;
         if (requests[i]->occasional) lists[i]->buildflag = 0;
+ if (lists[i]->buildflag) anybuild = 1;

         lists[i]->growflag = 1;
         if (requests[i]->copy) lists[i]->growflag = 0;
@@ -679,6 +681,17 @@ void Neighbor::init()
       } else init_list_flags1_kokkos(i);
     }

+ // no request has the buildflag set. set it on the first request,
+ // so we have a usable binning for any occasional neighbor lists
+ if (!anybuild) {
+ for (i = 0; i < nrequest; i++) {
+ if (lists[i]) {
+ lists[i]->buildflag = 1;
+ break;
+ }
+ }
+ }

Yes, it worked with force instead of force/neigh

I'll look into the neighbor list build incompatibility issue.

???

once i knew all the details that were missing from your original post,
it is a fairly simple thing.
your g(r) calculation requires a neighbor list. the compute rdf
commands enqueues a request for that. however, this is an "occasional"
neighbor list which depends on a previous "regular" build produce a
binning of atoms into subdomains. that is normally not a big issue,
since you do have such a request scheduled in most simulations. using
a GPU pair style is an exception, since that can offload the neighbor
list build to the GPU (which is often faster as it avoids having to
transfer a lot of data across the bus). under these circumstances the
binning tables are missing on the host memory and thus you get a
segmentation fault, where they are expected to be present.

i am not sure what is the best approach at this point: printing a
warning/error and remind people to not offload the neighbor list
creation or implement a workaround. would you mind trying out the
following modification to neighbor.cpp ?
and then compare the performance of using package gpu force/neigh and
this against package gpu force WITHOUT this change.

Not at all. I'll do it when I have the time, ok?

Thanks again.
Luis

Dear lammps developers,

  I have two questions:

1- Since the latest version of lammps (11-sep-14) I cannot set the option
"neigh no" either with the package command or with -pk command-line
option. I circumvented the problem by setting the default option in the
code. I tried

-pk gpu 1 neigh no -sf gpu

Lammps gave me this message

LAMMPS (11 Sep 2014)
ERROR: Illegal package gpu command (../fix_gpu.cpp:110)

2- Running the same system on GPU and on CPU gives me two completely
different answers for per-atom potential energy. My system uses pair_style
born/coul/wolf , tabulated bond interaction and neighbor list built on cpu
(because of the bonds). Dynamics and per-atom forces and stresses are
comparable though. What seems to be the problem???

  Thanks for the attention!

  Luis

The “neigh no” issue is a typo bug in fix_gpu.cpp.

Look for [iarg]+1 in the constructor and change it to [iarg+1].
I will post a patch.

I’ll let Trung comment on per-atom energy.

Steve

Hi Luis,

can you send a minimal input deck that reproduces the issue #2?

Best,
-Trung

OK, here are the input files

- in.minimal (script)
- data.unrelaxed (data file)
- table_bond.dat (tabulated bond interaction)

To test it, place all files in the same directory and run with packages
manybody, molecule and gpu. The script generates a .cfg file (atomeye).
Just run with and without gpu (1 cpu only).

Thanks for your help, Trung! Contact me if you need anything, ok?

Luis

data.unrelaxed.gz (679 KB)

in.minimal (2.15 KB)

table_bond.dat (40.3 KB)

Hi Luis,

the system you sent is still big for debugging (37982 atoms, 18991 bonds), and it looks to me the atoms are initialized into a crystalline lattice.

It’d be helpful if you can you send a much smaller system (i.e. one unit cell or two) with a few atoms and bonds, that reproduces the discrepancy between CPU and GPU runs.

Best,
-Trung