program exits when one of the partition ends

Dear LAMMPS developers and users,

Greetings. I am running a case with multiple replicas by using “-partition” in command lines and “variable world” in input script. However it appears that the whole program terminates when one of the replica ends. My PBS script reads as:

#!/bin/bash
#BSUB -J spce-10700
#BSUB -e err
#BSUB -o out
#BSUB -q 768cpu
#BSUB -n 576

source /pkg/chem/lammps/setlammps

cat $LSB_DJOB_HOSTFILE > ./hostlist

echo “Your LAMMPS job starts at Date”

mpirun_rsh -np 576 -hostfile ./hostlist IPATH_UNIT=0 ~/lammps/src/lmp_linux -sf opt -p 576x1 -in in.spce -log spce.out

wait

echo "Your LAMMPS job completed at Date "

Do I need to add anything in the script or input file to invoke the synchronization? Thank you so much for the attention and help.

LC Liu

PS. I am using 2013/4/23 version. My apologies for not updating it to the latest version since the compilation fails. Working with tech support now.

it appears that the whole program terminates when one of the replica ends

If you wrote your input script correctly, that should not happen b/c LAMMPS
will not exit. I suggest you read section 6.4 of the manual and experiment
with simple scripts on your own box (running MPI) before trying 576 partitions
on a big machine.

For example, the script below works fine if I run:
mpirun -np 5 lmp_g++ -p 5x1 -in in.lj
LAMMPS does not exit until the biggest/slowest simulation
has finished.

Steve

Hi, Steve,

Thank you so much for the detailed comment. The reason for so-many partitions is that I was running tempering case before, and now I am ready for single-runs in order to collect their dynamics. I tried this on 24 partitions, but the symptom persists. (meaning, I checked the output files, and only one replica finished the job and printed out the following message. The others did not make it to the end).

I would like to monitor how replicas invoke MPI_Barrier (timer->barrier_stop in lammps)… So, could you help me to locate the piece of code that I should be taking care of? Great appreciations to the help.

LC Liu

Loop time of 390.639 on 1 procs for 10000 steps with 1536 atoms

Pair time () = 216.626 (55.4543) Bond time () = 2.42502 (0.620784)
Kspce time () = 59.2307 (15.1625) Neigh time () = 1.70731 (0.437055)
Comm time () = 1.36675 (0.349876) Outpt time () = 0.001333 (0.000341235)
Other time (%) = 109.282 (27.9751)

FFT time (% of Kspce) = 2.92624 (4.94042)
FFT Gflps 3d (1d only) = 1.87401 3.78948

Nlocal: 1536 ave 1536 max 1536 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost: 8529 ave 8529 max 8529 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs: 408572 ave 408572 max 408572 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 408572
Ave neighs/atom = 265.997
Ave special neighs/atom = 2
Neighbor list builds = 79
Dangerous builds = 0

Hi, Steve,

Thank you so much for the detailed comment. The reason for so-many
partitions is that I was running tempering case before, and now I am ready
for single-runs in order to collect their dynamics. I tried this on 24
partitions, but the symptom persists. (meaning, I checked the output files,
and only one replica finished the job and printed out the following message.
The others did not make it to the end).

I would like to monitor how replicas invoke MPI_Barrier (timer->barrier_stop
in lammps).... So, could you help me to locate the piece of code that I
should be taking care of? Great appreciations to the help.

that is not how this will work. as steve said, LAMMPS is supposed to
work correctly, if your input is correct. there should be no need to
hack the LAMMPS code until you can prove there is a bug in LAMMPS and
that your input is correct.

the easiest way to do this is to produce a minimal input using, say, a
variant of the melt example that would reproduce the problematic
behavior and post it here.

axel.

I would like to monitor how replicas invoke MPI_Barrier (timer->barrier_stop
in lammps)… So, could you help me to locate the piece of code that I
should be taking care of

There is no such barrier. If the partitions are running independently, they
never communicate. Do your other 23 log files show any output?
I suggest again that you do what Axel and I both suggested. Start
simple with something that works (like what I posted), then build up
to your problem so you can see what breaks.

Steve

Hi, Steve and Axel,

Thank you so much for the help. Indeed the problem was not caused by the replicas. First, the melt srcipt runs great, so there is no problem in my LAMMPS build. Second, I chekced my own script. The last two commands are:

run 10000
write_restart ./restart/{tint}.{snap}

Somehow after the fastest replica writes the restart file, it terminates and might issue an error message or something(?) to the other replicas…so they abort execution.

So, for now I just comment out the write_restart. I will check how this happen later.

Thank you again for the attention.

LC Liu

寄件者: Steve Plimpton
寄件日期: ‎星期四‎, ‎2013‎年‎5‎月‎30‎日 ‎下午‎ ‎09‎:‎47
收件者: Axel Kohlmeyer
副本: LC Liu; LAMMPS Users Mailing List

I would like to monitor how replicas invoke MPI_Barrier (timer->barrier_stop
in lammps)… So, could you help me to locate the piece of code that I
should be taking care of

There is no such barrier. If the partitions are running independently, they
never communicate. Do your other 23 log files show any output?
I suggest again that you do what Axel and I both suggested. Start
simple with something that works (like what I posted), then build up
to your problem so you can see what breaks.

Steve

Somehow after the fastest replica writes the restart file, it terminates and might issue an error message or something(?) to the other >replicas…so they abort execution.

I don’t think that’s possible. Are you getting full log files from all the replicas,
just no final restart files? If so, can all replicas write to the directory,
and are you sure they are all writing to different files?

Steve

Hi, Steve,

I tried to play with my script, but this phenomenon persists.

  1. Yes they are writing to different files.
  2. The other log files are not completed, except the 1st one, which also outputs the restart file. So I get only one restart file and one full log file. (the rest are partially done)
  3. I add a line to print out a signal after the write_restart command, and it is not observed in my log file, meaning, the job stops before execution to the print line.

Meanwhile, I added write_restart at the bottom of your melt script, and it works fine. So there must be something wrong in my script. I just can dig it out yet.

Thank you so much for your help. I think, for now, I will just comment out the write_restart line…

LC Liu

Hi, Steve,

I tried to play with my script, but this phenomenon persists.
1. Yes they are writing to different files.
2. The other log files are not completed, except the 1st one, which also
outputs the restart file. So I get only one restart file and one full log
file. (the rest are partially done)
3. I add a line to print out a signal after the write_restart command, and
it is not observed in my log file, meaning, the job stops before execution
to the print line.

Meanwhile, I added write_restart at the bottom of your melt script, and it
works fine. So there must be something wrong in my script. I just can dig it
out yet.

here is a suggestion for debugging such issues:

- pick a multi-processor workstation (6-8 CPU cores should do) and try
to run there with just 3-4 partitions of 2 processors.
- compile a parallel LAMMPS executable with debug info included (use
the -g flag for compiling and linking)
- create file "batch-gdb" that has two lines:

directory /path/to/lammps/sources
run

- make sure you have "xterm" installed (the default X11 one, not the
beefed up replacements that come with linux distributions)
- make sure you have gdb installed
- run your executable like this:
mpirun -np 6 xterm -e gdb -x batch-gdb --args
/path/to/lammps//lmp_linux -sf opt -p 3x2 -in in.spce -log spce.out

this will pop up 6 terminals, run the debugger in each of them and run
lammps inside of it. you executable will drop back into the debugger
prompt at the end of the run and show you either that it finished
successfully or that and why it failed.

this is a bit convoluted, but considering what insane amounts of money
a license for a "proper" parallel debugger costs, it is extremely
effective. i found the main problem with parallel debugging is that
one needs it so rarely, that one always has to learn it almost from
scratch. turning this into quasi serial debugging is thus a decent
compromise.

hope this helps,
    axel.

Hi, Axel,

Thanks for the direction. I will see what the debugger tells me.

LC Liu

2013/5/31 下午3:23 於 “Axel Kohlmeyer” <[email protected]> 寫道: