[lammps-users] parallel jobs stopped due to possible processors mapping issue

Shuai_Shao · June 14, 2010, 6:41am

Hi, everyone,

Recently I came across with a strange problem: one of my simulations will only run on 4 processors - whenever I try to run it on any other number (except for 1) of processors, it will return me an error like: “rank 3 in job 24 compute-1-4.local_48018 caused collective abort of all ranks exit status of rank 3: killed by signal 11”
This simulation is has a cell size of 180x100x180 angstroms. The top part of the cell is a conical diamond indenter (about 20 angstroms in height), therefore the top part of the cell is not fully occupied by atoms.
I got the above error when I tried to run it on 8 processors. I thought it could be a processors mapping problem, so I tried to modify the mapping of the processors using the “processors” command. It did not work - either an error like the above one or “bad grid of processors” was sent back.
Can anybody help me out?

Much appreciate it,

akohlmey · June 14, 2010, 10:06am

Hi, everyone,
Recently I came across with a strange problem: one of my simulations will
only run on 4 processors - whenever I try to run it on any other number
(except for 1) of processors, it will return me an error like: "rank 3 in
job 24 compute-1-4.local_48018 caused collective abort of all ranks exit
status of rank 3: killed by signal 11"

sounds a lot like a bad input or using too aggressive parameters.

This simulation is has a cell size of 180x100x180 angstroms. The top part of
the cell is a conical diamond indenter (about 20 angstroms in height),
therefore the top part of the cell is not fully occupied by atoms.
I got the above error when I tried to run it on 8 processors. I thought it
could be a processors mapping problem, so I tried to modify the mapping of
the processors using the "processors" command. It did not work - either an
error like the above one or "bad grid of processors" was sent back.
Can anybody help me out?

"bad grid of processors" means a wrong input where the number of
processors requested in the lammps script does not match the total
number of processors available to MPI.

the signal 11 or segmentation faults are typically just a secondary
symptom casued by MPICH, assuming that you are using MPICH
that hides the real reason of failure.

can you run the input with just one processor?

cheers,
axel.

sjplimp · June 14, 2010, 3:35pm

If you are saying LAMMPS is crashing on some number of procs,
then it could be a bug. LAMMPS tries hard to print an error message
before it dies (in which case it is not a bug, but a detected error condition).
Are you seeing any other error message?

Steve

Shuai_Shao · June 14, 2010, 5:32pm

Hi, Steve and Axel,
I guess I understand why I was getting “bad grid of processors”, I remember making some silly mistakes like running with 8 processors but using a command “processors 2 1 2”. Thank you for pointing that out for me.

I am using MPICH v2 for the parallel computing. And yes, I tried to run it using only one processor, and it was fine (both with MPICH and without).

The following is the message printed on the screen when I run it on 8 processors:

LAMMPS (5 Jun 2010)
Reading data file …
orthogonal box = (-87.3235 -39.9476 -86.9681) to (87.3235 62 86.9681)
2 by 2 by 2 processor grid
225074 atoms
143706 atoms in group inner
50247 atoms in group fixed
1777 atoms in group indenter
145483 atoms in group output
29344 atoms in group outter
WARNING: Resetting reneighboring criteria during minimization
Setting up minimization …
rank 3 in job 30 compute-1-4.local_48018 caused collective abort of all ranks
exit status of rank 3: killed by signal 9
rank 2 in job 30 compute-1-4.local_48018 caused collective abort of all ranks
exit status of rank 2: killed by signal 9

And the following is my input script:

# 3d indenter simulation of CuNi indentation MM large indentation depth, step size 0.1 A
# energy tolerance 1e-12

units metal
dimension 3
boundary s s s

atom_style atomic
neighbor 0.3 bin
neigh_modify delay 5

# create geometry

read_data input.atoms_windent.cuni

mass 3 12.011
# potentials

pair_style hybrid eam/alloy/opt morse/opt 4.
pair_coeff * * eam/alloy/opt CuNiNbH.eam.alloy Ni Cu NULL
pair_coeff 3 3 morse/opt 0.100 1.7 0.22 3.
pair_coeff 1*2 3 morse/opt 0.1 1.5 4 3.55

# define groups

region 1 block -80.0 80.0 -22.5 40. -80.0 80.0 units box
region 2 block INF INF INF -22.5 INF INF units box
region 3 block INF INF 43. INF INF INF units box
region 4 block -80.0 80.0 -22.5 INF -80.0 80.0 units box
group inner region 1
group fixed region 2
group indenter region 3
group output region 4
group outter subtract all inner fixed indenter

# define compute and make y flexible boundaries

compute pot inner pe/atom
compute disl inner centro/atom fcc

compute load inner group/group indenter
#compute strsp inner stress/atom pair
fix 1 fixed setforce 0.0 0.0 0.0
fix 2 outter setforce 0.0 NULL 0.0
fix 3 indenter setforce 0.0 0.0 0.0

# relaxation
minimize 1.0e-14 1.0e-14 10000 100000

# totally fix all boundaries and apply indenter
fix 2 outter setforce 0.0 0.0 0.0
# control outputs
thermo 0

thermo_style custom step c_load[2]

dump 1 output custom 1000000 ./dump/dump.* id x y z c_disl c_pot type
dump 2 inner custom 1000000 dump.cuni id x y z c_disl c_pot type

# going into loop now
label iloop
variable iter loop 170

unfix 3
displace_atoms indenter move 0. -0.1 0. units box
fix 3 indenter setforce 0.0 0.0 0.0

minimize 1.0e-12 1.0e-12 10000 100000

next iter
jump in.cuni_w2.indent iloop

If you are interested, I can send my input data to you.

Thank you both again for your attention,

Shuai

Shuai_Shao · June 15, 2010, 7:59pm

Hi, Steve and Axel,

My simulation (input script is posted below) runs well with one processor. However, I need to run it with much larger cell size and total number of atoms in the very near future. In this case even four processors will not be sufficient - I am expecting at least 8.
Can you please take a look at my email below and tell me if there anything I can do to make my simulation run on more than 4 processors?

Best,
Shuai

akohlmey · June 15, 2010, 8:16pm

Hi, Steve and Axel,
My simulation (input script is posted below) runs well with one processor.
However, I need to run it with much larger cell size and total number of
atoms in the very near future. In this case even four processors will not be
sufficient - I am expecting at least 8.
Can you please take a look at my email below and tell me if there anything I
can do to make my simulation run on more than 4 processors?

there is nothing inherent in any LAMMPS command that
limits the number of processors.

so from simply looking at your input script, it is impossible to
tell what is going wrong.

the problem with MPICH is that it typically "eats" the last few lines of
error messages when a job dies, so it is often difficult to know what
is leading up to the segmentation fault which is in MPICH itself.
it is a mystery to me why people keep using it, since there are alteratives.

be it as it may, without being able to reproduce exactly what you
are seeing, *nobody* is going to be able to help you.

i am still hoping to get my crystal ball back from repair, but the
mechanic keeps telling me that it is broken beyond repair, so
i'll have to do without.

most likely causes for behavior you are describing is a bad initial
configuration, overlaps due to (small?) errors in the box size,
or to neighborlist related problems (causing lost atoms).

cheers,
axel.

Shuai_Shao · June 15, 2010, 10:47pm

Thank you for your reply, hope you can get your crystal ball back soon.

sjplimp · June 16, 2010, 2:32pm

There is no input script/data file attached. If you are running
successfully on 1 proc, but crashing on a few procs, then post
the files for a small problem that crashes quickly.

Steve