If I comment the fix ave/time, the script runs ok. If I try to average a
scalar quantity, it runs ok too.
I browsed the mail-list archives and found a similar problem with
averaging compute rdf. Was that case ever solved?
What seems to be the problem in this case?
Thank you for the attention and let me know if you need any additional
information.
OK, attached are an input script, the restart file and the output. The
script requires kspace and gpu packages.
luis,
this doesn't really help. restart files are in general not portable
across machines and LAMMPS versions.
so we'd either also need an input script to generate the restart or
you have to try and see, if the issue can be reproduced when using a
data file.
please make an effort to reduce the requirements. the simplest
approach would be to add the commands that you suspect to be not
working correctly to one of the existing LAMMPS example inputs, e.g.
in.melt and see if you can reproduce the crash. if the issue what you
claim it is, it should not make a difference, otherwise you have a
hint that it may have some impact. then you should try to gradually
build the input from something simpler that works correctly until you
identify the step that breaks it.
this is all simple common sense procedures of how one debugs a program
that is suspected to have a bug. you don't just dump what you have one
somebody and let that person to all the legwork for you.
You are absolutely right. I will start by using the latest version and
rerun my job (it took 20 days ). I would not imagine that 2 months old
code could still be buggy in aspects such as time averages.
You are absolutely right. I will start by using the latest version and
rerun my job (it took 20 days ). I would not imagine that 2 months old
that is about the worst option to try. first try to insert the
ave/time stuff into some small/fast input.
code could still be buggy in aspects such as time averages.
people are using time averages a *lot* and have been doing it in the
past. so it is unlikely that there is a problem in the code unless it
is a very extremely rare case.
that is why it is so very important to break everything down in small
pieces and look at each piece and then put it back together piece by
piece to locate the "trigger".
please also note that an input for debugging an issue, doesn't have to
produce a physically meaningful simulation. it only needs to reproduce
the issue and as fast and easy as possible.
Attached are fast input scripts that reproduce my problem
1- run in.LS2 (about 50 min on GPU)
do you seriously expect somebody to wait an entire hour (using a GPU
on top of that) only for generating a restart file?
that is not likely going to happen.
why don't you try to insert the failing rdf and ave/time command into
one of the existing and fast inputs from the example directories.
2- run in.cool using any of the liq.T2000* restart files generated at step
1 (1 min)
extra packages required: GPU and KSPACE
The error occurs right at the beginning of the RDF average calculation
in step 2.
if you assume that the problem is the the rdf and ave/time
calculation, then there is no reason to think it requires to be run
with your exact system.
you also didn't say, whether this requires to run one or more instances LAMMPS.
Attached are fast input scripts that reproduce my problem
1- run in.LS2 (about 50 min on GPU)
2- run in.cool using any of the liq.T2000* restart files generated at step
1 (1 min)
extra packages required: GPU and KSPACE
The error occurs right at the beginning of the RDF average calculation
in step 2.
Any help on this would be greatly appreciated.
please try running your second input without gpu support (comment out
package gpu and don't use -sf )
and then also try running it with: package gpu force 0 0 1.0
Attached are fast input scripts that reproduce my problem
1- run in.LS2 (about 50 min on GPU)
do you seriously expect somebody to wait an entire hour (using a GPU
on top of that) only for generating a restart file?
that is not likely going to happen.
why don't you try to insert the failing rdf and ave/time command into
one of the existing and fast inputs from the example directories.
I assumed 50 min was no big deal. But that's fine, here is the shortened
script 1 that runs 10x faster
2- run in.cool using any of the liq.T2000* restart files generated at
step
1 (1 min)
extra packages required: GPU and KSPACE
The error occurs right at the beginning of the RDF average calculation
in step 2.
if you assume that the problem is the the rdf and ave/time
calculation, then there is no reason to think it requires to be run
with your exact system.
I didn't assume this, I am telling you where the program stopped. The
reason I don't know, but it may be my system. I want to rule that out
first.
you also didn't say, whether this requires to run one or more instances
LAMMPS.
I assumed 50 min was no big deal. But that's fine, here is the shortened
script 1 that runs 10x faster
50min is a *very* big deal. even 5min can be a big deal. when
debugging sometimes you need to run the same command many times to
narrow down the source of the problem. so a few minutes can quickly
become many hours. requiring a GPU is a big deal as well. most LAMMPS
developers do not use a GPU or build LAMMPS regularly with GPU
support. so it would first require to log into a machine that has a
suitable GPU and configure and build a LAMMPS executable that has GPU
support included. that also increases the barrier from somebody to be
willing to look into this significantly.
[...]
if you assume that the problem is the the rdf and ave/time
calculation, then there is no reason to think it requires to be run
with your exact system.
I didn't assume this, I am telling you where the program stopped. The
reason I don't know, but it may be my system. I want to rule that out
first.
this is definitely the wrong way to approach this. you should do
exactly the opposite thing, i.e. what i recommended you to do already
twice and try to reproduce the issue with some *other* input and
adding the rdf and ave time calculations to it.
also, since you do use GPU support, you should *always* run a test
without GPU support to determine, whether this has an impact.
specifically, the package gpu command specifically warns about
incompatibilities with building neighbor lists on the GPU.
that is why i suggested to try with: package gpu force 0 0 1.0
instead of: package gpu force/neigh 0 0 1.0
once i knew all the details that were missing from your original post,
it is a fairly simple thing.
your g(r) calculation requires a neighbor list. the compute rdf
commands enqueues a request for that. however, this is an "occasional"
neighbor list which depends on a previous "regular" build produce a
binning of atoms into subdomains. that is normally not a big issue,
since you do have such a request scheduled in most simulations. using
a GPU pair style is an exception, since that can offload the neighbor
list build to the GPU (which is often faster as it avoids having to
transfer a lot of data across the bus). under these circumstances the
binning tables are missing on the host memory and thus you get a
segmentation fault, where they are expected to be present.
i am not sure what is the best approach at this point: printing a
warning/error and remind people to not offload the neighbor list
creation or implement a workaround. would you mind trying out the
following modification to neighbor.cpp ?
and then compare the performance of using package gpu force/neigh and
this against package gpu force WITHOUT this change.
thanks,
axel.
diff --git a/src/neighbor.cpp b/src/neighbor.cpp
index 28c0513..3d010ab 100644
--- a/src/neighbor.cpp
+++ b/src/neighbor.cpp
@@ -660,11 +660,13 @@ void Neighbor::init()
// anyghostlist = 1 if any non-occasional list stores neighbors of ghosts
anyghostlist = 0;
+ int anybuild = 0;
for (i = 0; i < nrequest; i++) {
if (lists[i]) {
lists[i]->buildflag = 1;
if (pair_build[i] == NULL) lists[i]->buildflag = 0;
if (requests[i]->occasional) lists[i]->buildflag = 0;
+ if (lists[i]->buildflag) anybuild = 1;
+ // no request has the buildflag set. set it on the first request,
+ // so we have a usable binning for any occasional neighbor lists
+ if (!anybuild) {
+ for (i = 0; i < nrequest; i++) {
+ if (lists[i]) {
+ lists[i]->buildflag = 1;
+ break;
+ }
+ }
+ }
I'll look into the neighbor list build incompatibility issue.
???
once i knew all the details that were missing from your original post,
it is a fairly simple thing.
your g(r) calculation requires a neighbor list. the compute rdf
commands enqueues a request for that. however, this is an "occasional"
neighbor list which depends on a previous "regular" build produce a
binning of atoms into subdomains. that is normally not a big issue,
since you do have such a request scheduled in most simulations. using
a GPU pair style is an exception, since that can offload the neighbor
list build to the GPU (which is often faster as it avoids having to
transfer a lot of data across the bus). under these circumstances the
binning tables are missing on the host memory and thus you get a
segmentation fault, where they are expected to be present.
i am not sure what is the best approach at this point: printing a
warning/error and remind people to not offload the neighbor list
creation or implement a workaround. would you mind trying out the
following modification to neighbor.cpp ?
and then compare the performance of using package gpu force/neigh and
this against package gpu force WITHOUT this change.
1- Since the latest version of lammps (11-sep-14) I cannot set the option
"neigh no" either with the package command or with -pk command-line
option. I circumvented the problem by setting the default option in the
code. I tried
2- Running the same system on GPU and on CPU gives me two completely
different answers for per-atom potential energy. My system uses pair_style
born/coul/wolf , tabulated bond interaction and neighbor list built on cpu
(because of the bonds). Dynamics and per-atom forces and stresses are
comparable though. What seems to be the problem???
To test it, place all files in the same directory and run with packages
manybody, molecule and gpu. The script generates a .cfg file (atomeye).
Just run with and without gpu (1 cpu only).
Thanks for your help, Trung! Contact me if you need anything, ok?
the system you sent is still big for debugging (37982 atoms, 18991 bonds), and it looks to me the atoms are initialized into a crystalline lattice.
It’d be helpful if you can you send a much smaller system (i.e. one unit cell or two) with a few atoms and bonds, that reproduces the discrepancy between CPU and GPU runs.