Issues with dump command and PRD

Srujan_Rokkam · June 14, 2012, 4:13am

Greetings all,
I am fairly new to lammps. In past, I have been able to complie and run simple test cases in serial/parallel. However, I notice that there seems to be a problem running PRD simulation in conjunction with the “dump” command.

While testing the example of vacancy diffusion in Si, as provided in the folder ~/lammps-14May12/examples/prd/ , I encounter an MPI crash … and a core.* file is generation if I choose a partition like “-partition 4x2”

The only difference between my input and the example, is I try to use the dump command by uncommenting it

dump events all custom 1 dump.prd id type x y z (line 83 of in.prd)

I found the simulation runs fine…printing out the events when I comment out the dump command, or when I set up multiple replicas which each replica using just 1 node. Say, -partition 4x1 or -partition 8x1 … works fine…!!

This makes me speculate if the dump command has some issues with prd run. Typical error messages contain:
… Rank 2, Process 16333 received signal SIGSEGV(11)
…
MPI_COMM_WORLD rank 2 has terminated without calling MPI_Finalize()
MPI: aborting job
MPI: Received signal 11

Has anyone else encountered similar issues? It would be great help if someone could clarify what is going wrong and advise me on how it can be addressed.

Thanks in advance for your time and help,
-Srujan

P.S: I have tested this on SUSE Linux, Sgi MPT 1.25, Intel 11.1.xx compilers. The same executable works fine for normal runs without any crashing of the simulation.

akohlmey · June 14, 2012, 1:59pm

Greetings all,
I am fairly new to lammps. In past, I have been able to complie and
run simple test cases in serial/parallel. However, I notice that there seems
to be a problem running PRD simulation in conjunction with the "dump"
command.

While testing the example of vacancy diffusion in Si, as provided in the
folder ~/lammps-14May12/examples/prd/ , I encounter an MPI crash .. and a
core.* file is generation if I choose a partition like "-partition 4x2"

The only difference between my input and the example, is I try to use the
dump command by uncommenting it
dump events all custom 1 dump.prd id type x y z (line 83 of in.prd)

I found the simulation runs fine..printing out the events when I comment
out the dump command, or when I set up multiple replicas which each replica
using just 1 node. Say, -partition 4x1 or -partition 8x1 .. works
fine..!!

This makes me speculate if the dump command has some issues with prd run.
Typical error messages contain:
... Rank 2, Process 16333 received signal SIGSEGV(11)
..
MPI_COMM_WORLD rank 2 has terminated without calling MPI_Finalize()
MPI: aborting job
MPI: Received signal 11

Has anyone else encountered similar issues? It would be great help if
someone could clarify what is going wrong and advise me on how it can be
addressed.

i can reproduce the issue and thanks to using OpenMPI,
i do get some more reasonable error message.

[fermi:19045] *** An error occurred in MPI_Bcast
[fermi:19045] *** on communicator MPI COMMUNICATOR 4 SPLIT FROM 0
[fermi:19045] *** MPI_ERR_TRUNCATE: message truncated
[fermi:19045] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)

this looks like there could be some inconsistent MPI calls in the code.

this needs some parallel debugging.

axel.

akohlmey · June 15, 2012, 2:22am

Greetings all,
I am fairly new to lammps. In past, I have been able to complie and
run simple test cases in serial/parallel. However, I notice that there seems
to be a problem running PRD simulation in conjunction with the "dump"
command.

While testing the example of vacancy diffusion in Si, as provided in the
folder ~/lammps-14May12/examples/prd/ , I encounter an MPI crash .. and a
core.* file is generation if I choose a partition like "-partition 4x2"

The only difference between my input and the example, is I try to use the
dump command by uncommenting it
dump events all custom 1 dump.prd id type x y z (line 83 of in.prd)

I found the simulation runs fine..printing out the events when I comment
out the dump command, or when I set up multiple replicas which each replica
using just 1 node. Say, -partition 4x1 or -partition 8x1 .. works
fine..!!

This makes me speculate if the dump command has some issues with prd run.
Typical error messages contain:
... Rank 2, Process 16333 received signal SIGSEGV(11)
..
MPI_COMM_WORLD rank 2 has terminated without calling MPI_Finalize()
MPI: aborting job
MPI: Received signal 11

Has anyone else encountered similar issues? It would be great help if
someone could clarify what is going wrong and advise me on how it can be
addressed.

i have been spending some time today looking into
this and - for somebody that has never looked at this
part of the code before and never ran production PRD
calculations - it is not exactly straightforward to debug.

it is not directly the dump, but something related to it
that is giving the problems (actually forces seem to
get corrupted somehow).

what i can say at the moment is that you cannot have
partitions with more than one MPI task.

the only workaround that i can suggest at the moment
moment is to install the USER-OMP package and
compile with OpenMP support. in fact, for the PRD
example that seems to be faster:

here is the run without dump and all-MPI:
mpirun -x OMP_NUM_THREADS=1 -np 8
~/compile/lammps-icms/src/lmp_openmpi-omp -log none -in in.prd
-partition 4x2 -echo screen
LAMMPS (14 Jun 2012-ICMS)
Running on 4 partitions of processors
Setting up PRD ...
Step CPU Clock Event Correlated Coincident Replica
100 0.000 0 0 0 0 0
200 0.572 400 1 0 4 1
700 2.385 2100 2 0 2 3
900 3.257 2600 3 0 1 3
1400 4.705 4300 4 0 1 2
1500 4.949 4400 5 1 1 2
1800 5.862 5300 6 0 2 3
2100 6.784 6200 7 0 1 3
Loop time of 6.78701 on 8 procs for 2000 steps with 511 atoms

and here is the same run with OpenMP parallelization for each replica instead.

mpirun -x OMP_NUM_THREADS=2 -np 4
~/compile/lammps-icms/src/lmp_openmpi-omp -log none -in in.prd
-partition 4x1 -echo screen -sf omp
LAMMPS (14 Jun 2012-ICMS)
Running on 4 partitions of processors
Setting up PRD ...
Step CPU Clock Event Correlated Coincident Replica
100 0.000 0 0 0 0 0
200 0.407 400 1 0 4 1
700 1.776 2100 2 0 2 3
900 2.445 2600 3 0 1 3
1400 3.808 4300 4 0 1 2
1500 4.044 4400 5 1 1 2
1800 4.925 5300 6 0 2 3
2100 5.817 6200 7 0 1 3
Loop time of 5.81973 on 4 procs for 2000 steps with 511 atoms

5.8 seconds is certainly faster than 6.8 seconds...

perhaps somebody that knows more about the PRD code
can look into it and solve the real problem.

cheers,
axel.

Srujan_Rokkam · June 15, 2012, 3:19am

Thank you Axel for taking the time to look at this. I will try the OpenMP route.

I observe that as the code crashes on -partition 4x2, it produces a nan’s (not a number) in log files (temperature and other quantities) on replica > 0 … this is probably a consequence of the MPI errors. The replica 0 seems to be fine though.

Also, I noticed that the restart command doesn’t seem to work with the prd command. The manual specifies we can use the restart command to generate files containing information from all replicas. Probably, the dump and restart issues are related.

Thanks,
-Srujan

sjplimp · June 20, 2012, 7:23pm

I posted a 26Jun patch for this - please see if it now
works as you expect.

Steve

Srujan_Rokkam · June 20, 2012, 11:56pm

Thank you Steve,
I tested the dump along with prd command using the latest version of Lammps (i.e., 26June). The dump works nicely with -partition 8x1, or 4x2 … etc.

Glad to see this fixed so promptly…!! Kudos…!!

Here is the O/p from a 4x2 partition (with dump),

LAMMPS (26 Jun 2012)
Running on 4 partitions of processors
Setting up PRD …
Step CPU Clock Event Correlated Coincident Replica
100 0.000 0 0 0 0 0
200 0.335 400 1 0 4 3
400 0.872 900 2 0 1 1
600 1.406 1400 3 0 2 0
700 1.601 1500 4 1 1 0
800 1.787 1600 5 1 1 0
1000 2.331 2100 6 0 1 2
1400 3.236 3400 7 0 1 0
1500 3.424 3500 8 1 1 0

…
…

8300 19.111 23800 31 1 1 1
8600 19.834 24700 32 0 2 1
8700 20.028 24800 33 1 1 1
9000 20.743 25700 34 0 1 1
9400 21.650 27000 35 0 1 1
9600 22.185 27500 36 0 1 0
9800 22.713 28000 37 0 1 1
10100 23.433 28900 38 0 1 2
Loop time of 23.435 on 8 procs for 10000 steps with 511 atoms

and from a 8x1 partition (with dump),

LAMMPS (26 Jun 2012)
Running on 8 partitions of processors
Setting up PRD …
Step CPU Clock Event Correlated Coincident Replica
100 0.000 0 0 0 0 0
200 0.622 800 1 0 7 5
600 2.334 3300 2 0 1 3
700 2.691 3400 3 1 1 3
1000 4.032 5100 4 0 1 3
1200 5.049 6000 5 0 2 3
1500 6.406 7700 6 0 1 2

…
…

8600 40.066 41400 39 0 3 5
8800 41.085 42300 40 0 1 3
9100 42.449 44000 41 0 2 4
9200 42.814 44100 42 1 1 4
9500 44.162 45800 43 0 2 7
9700 45.160 46700 44 0 7 6
9900 46.169 47600 45 0 1 0
10100 47.164 48500 46 0 4 4
Loop time of 47.1684 on 8 procs for 10000 steps with 511 atoms

The events are slightly different, but i guess that is expected due the randomness.

However, I am wondering if the Restart command is suppose to produce restart.* files when used along with the PRD command. I am not able to see them generated.

Thanks,
-Srujan

sjplimp · June 21, 2012, 12:23am

The events are slightly different, but i guess that is expected due the
randomness.

yes

However, I am wondering if the Restart command is suppose to produce
restart.* files when used along with the PRD command. I am not able to see
them generated.

Works for me. Note this info on the prd doc page:

The restart frequency specified in the restart command is interpreted
differently when performing a PRD run. It does not mean the timestep
interval between restart files. Instead it means an event interval for
uncorrelated events. Thus a frequency of 1 means write a restart file
every time an uncorrelated event occurs. A frequency of 10 means write
a restart file every 10th uncorrelated event.

Steve

Srujan_Rokkam · June 21, 2012, 2:00am

Steve,
Thanks for the note. The restart files can be generated, once the frequency is set to a low number…1. Was also able to use these files to kick start a simulation again.

Best,
-Srujan