[lammps-users] seg fault from restart file

Hello everyone. I am getting the following errors when I attempt to continue a run using the read_restart command:

LAMMPS (5 Nov 2010)
Reading restart file …
orthogonal box = (-22.8631 -18 -24.75) to (22.8631 18 24.75)
1 by 1 by 2 processor grid
[node151:20301] *** Process received signal ***
[node151:20301] Signal: Segmentation fault (11)
[node151:20301] Signal code: Address not mapped (1)
[node151:20301] Failing at address: 0xa4
[node151:20301] [ 0] /lib64/libpthread.so.0 [0x2aeb9992fb10]
[node151:20301] [ 1] /home/cforrey/bin/lmp_openmpi(
_ZN9LAMMPS_NS4Atom9map_clearEv+0x177) [0x46c6f7]
[node151:20301] [ 2] /home/cforrey/bin/lmp_openmpi(_ZN9LAMMPS_NS9Irregular13migrate_atomsEv+0x2cf) [0x61b94f]
[node151:20301] [ 3] /home/cforrey/bin/lmp_openmpi(_ZN9LAMMPS_NS11ReadRestart7commandEiPPc+0x7bb) [0x6e1fab]
[node151:20301] [ 4] /home/cforrey/bin/lmp_openmpi(_ZN9LAMMPS_NS5Input15execute_commandEv+0xd3d) [0x616b9d]
[node151:20301] [ 5] /home/cforrey/bin/lmp_openmpi(_ZN9LAMMPS_NS5Input4fileEv+0x34a) [0x617dca]
[node151:20301] [ 6] /home/cforrey/bin/lmp_openmpi(main+0x4a) [0x621e1a]
[node151:20301] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2aeb99b59994]
[node151:20301] [ 8] /home/cforrey/bin/lmp_openmpi(__gxx_personality_v0+0x301) [0x45d009]
[node151:20301] *** End of error message ***
[node151:20300] *** Process received signal ***
[node151:20300] Signal: Segmentation fault (11)
[node151:20300] Signal code: Address not mapped (1)
[node151:20300] Failing at address: 0x33994
[node151:20300] [ 0] /lib64/libpthread.so.0 [0x2b30934f9b10]
[node151:20300] [ 1] /home/cforrey/bin/lmp_openmpi(_ZN9LAMMPS_NS4Atom9map_clearEv+0x177) [0x46c6f7]
[node151:20300] [ 2] /home/cforrey/bin/lmp_openmpi(_ZN9LAMMPS_NS9Irregular13migrate_atomsEv+0x2cf) [0x61b94f]
[node151:20300] [ 3] /home/cforrey/bin/lmp_openmpi(_ZN9LAMMPS_NS11ReadRestart7commandEiPPc+0x7bb) [0x6e1fab]
[node151:20300] [ 4] /home/cforrey/bin/lmp_openmpi(_ZN9LAMMPS_NS5Input15execute_commandEv+0xd3d) [0x616b9d]
[node151:20300] [ 5] /home/cforrey/bin/lmp_openmpi(_ZN9LAMMPS_NS5Input4fileEv+0x34a) [0x617dca]
[node151:20300] [ 6] /home/cforrey/bin/lmp_openmpi(main+0x4a) [0x621e1a]
[node151:20300] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b3093723994]
[node151:20300] [ 8] /home/cforrey/bin/lmp_openmpi(__gxx_personality_v0+0x301) [0x45d009]
[node151:20300] *** End of error message ***

continue.base.1 (786 Bytes)

continue.0.1 (4.23 MB)

continue.1.1 (4.23 MB)

It's possible there was some bug with the 5Nov2010 version
and it's restart files. Or that you are reading the restart
file with a different version of the code than wrote the
restart file? Which can be bad if the restart file
format changes, which it occasionally does.

What I would do is try to use the tools/restart2data tool
in the 5Nov2010 version to convert the restart file
to a data file. If that works, then you should be able
to run with a read_data command.

Steve

Steve, I was using 5Nov2010 version both to create the restart files and to read them, so it wasn’t version inconsistency causing the problem. I just built 9Jan2010 and went through the process again (i.e., doing a prelim run to create restart files and then trying to restart from them). Once again, a seg violation ensued:

[cforrey@…2309… CONT]$ ./start.sh
LAMMPS (9 Jan 2011)
Reading restart file …
orthogonal box = (-22.8631 -18 -24.75) to (22.8631 18 24.75)
4 by 3 by 4 processor grid
[node155:18501] *** Process received signal ***
[node155:18501] Signal: Segmentation fault (11)
[node155:18501] Signal code: Address not mapped (1)
[node155:18501] Failing at address: 0x28b54
[node155:18501] [ 0] /lib64/libpthread.so.0 [0x2affad943b10]
[node155:18501] [ 1] /home/cforrey/bin/lmp_9JAN11(_ZN9LAMMPS_NS4Atom9map_clearEv+0x178) [0x4707d8]
[node155:18501] [ 2] /home/cforrey/bin/lmp_9JAN11(_ZN9LAMMPS_NS9Irregular13migrate_atomsEv+0x2cf) [0x62843f]
[node155:18501] [ 3] /home/cforrey/bin/lmp_9JAN11(_ZN9LAMMPS_NS11ReadRestart7commandEiPPc+0x7c7) [0x7152f7]
[node155:18501] [ 4] /home/cforrey/bin/lmp_9JAN11(_ZN9LAMMPS_NS5Input15execute_commandEv+0xd3d) [0x62329d]
[node155:18501] [ 5] /home/cforrey/bin/lmp_9JAN11(_ZN9LAMMPS_NS5Input4fileEv+0x34d) [0x6248bd]
[node155:18501] [ 6] /home/cforrey/bin/lmp_9JAN11(main+0x4a) [0x62e93a]
[node155:18501] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2affadb6d994]
[node155:18501] [ 8] /home/cforrey/bin/lmp_9JAN11(__gxx_personality_v0+0x2f9) [0x461129]
[node155:18501] *** End of error message ***

chris,

can you send me the files (lammps inputs and data file,
best as a gzipped tar archive) that i could use to repeat
what you are doing and try to reproduce the segfault.
the segfault seems to originate in a part of the lammps code
that i recently got to know intimately, while trying to track down
some other bug (that actually happened at a totally different location).

in any case, this code was added not so long ago, but
also is prone to crash due to MPI programming bugs in
other parts of LAMMPS (or external packages). i am current
testing code to help tracking those kind of bugs down, so
your problem might be a good second test case for me.

thanks,
    axel.

Axel, files attached as .tar.gz. I have left the input script to run for 100,000 steps, so that everything is exactly the way I produced the error. Nonetheless, you may of course find it convenient to shorten the run. If the system size is too large, etc., I would be happy to come up with a minimum system capable of producing the error. Hopefully it will be convenient to work with the files in their current form. Greatly obliged. Cheers,
Chris

LAMMPS_SEGFAULT.tar.gz (894 KB)

hi chris,

thanks for the files. i have been able to reproduce the issue
that your are seeing, and it looks like a different issue what
i am working on.

nevertheless, in order for you to be able to continue
your work, here is the simple workaround:
don't use per MPI task restarts, but single-file restarts.
for a system as small as yours, and where you write
our restarts rather infrequently, that should not make a
significant difference in performance.

i.e. you should use

restart 25000 restart.1 restart.2

and then you can read them in with

read_restart restart.1

that should alleviate the immediate problem.
of course, the per mpi-task restart feature needs
to be made work, too.

cheers,
   axel.

hi again,

if i apply the following change, also the per-MPI task restart
appears to be working. i still need to double check the hash
table variant, but from reading through the code in atom.cpp
this just seems to be the correct change.

steve,

i think this is small enough a change, that you can apply it manually, right?

axel.

diff --git a/src/atom.cpp b/src/atom.cpp
index 16f2174..cbf9841 100644
--- a/src/atom.cpp
+++ b/src/atom.cpp
@@ -464,10 +464,12 @@ void Atom::map_init()
void Atom::map_clear()
{
   if (map_style == 1) {
+ if (!map_tag_max) map_init();
     int nall = nlocal + nghost;
     for (int i = 0; i < nall; i++) map_array[tag[i]] = -1;

   } else {
+ if (!map_nhash) map_init();
     int previous,global,ibucket,index;
     int nall = nlocal + nghost;
     for (int i = 0; i < nall; i++) {

Axel, that’s a very interesting idea. I hadn’t realized that you keep all the binary data from an mpi run in a single file and then continue that single file again with an mpi run. The system I am working with is a small testing system. I will scale it up to a couple million particles. Do you have a guess as to how large a system you can apply the single-file restart trick?

Steve, by the way, restart2data does indeed successfully convert the restart files to a text data file, so there are numerous workarounds for this problem.

Thanks,
Chris

Axel, that's a very interesting idea. I hadn't realized that you keep all
the binary data from an mpi run in a single file and then continue that
single file again with an mpi run. The system I am working with is a small

that is the normal way how restarts were done until
somebody implemented the per-mpi task restart.

testing system. I will scale it up to a couple million particles. Do you
have a guess as to how large a system you can apply the single-file restart
trick?

technically, there is no limit. practically, it depends on how often
you need to write out the restart, how large your system is, how
many processors you use and particularly how fast/slow the
underlying file system is.

i'd say under normal circumstances, using the single file restart
is the more convenient option and it comes at little to no cost
(compared to, say, multi-gigabyte car-parrinello MD restarts that
can take tens of minutes to write on thousands of processors).
thus i always use single file restarts, even for millions of particles.
but the jobs that i run don't need to restart a lot and the machines
can communicate well.

cheers,
   axel.

hi again,

if i apply the following change, also the per-MPI task restart
appears to be working. i still need to double check the hash
table variant, but from reading through the code in atom.cpp
this just seems to be the correct change.

confirmed. when using hash maps with per-mpi task restarts,
the same type of segfault happens and suggested change
works for me.

cheers,
    axel.

Posted a 11Jan11 patch for this - thanks.
I always use single restart files, except for
very big problems. They should be fast to
write out, which you can monitor by the time
spent in I/O at the end of the run.

Steve