[lammps-users] Lost atoms with MPI but not with serial?

I'm starting submitting some small test runs to a
cluster, which has MPICH-GM installed, and I found
that when I run the job with MPI using 4 cpus of the
clusters, I get lost atoms. If I run the job using
either a serial version of LAMMPS or with MPI using
just 2 cpus of the clusters, the job runs fine. Why
the difference?

Notes:

* I uploaded the latest version of the LAMMPS source
to the cluster and compiled it myself.

* The LAMMPS runs do minimization but not MD.

James,

You typically lose atoms when they move too fast, so they fly over one processor. LAMMPS typically only communicates with its direct neighbors. This can be a result of an integration step that is a bit large. Atoms won’t get lost when there are the neighbor of the processor is itself, which is the case in serial mode.

Pieter

There's many ways to lose atoms due to an invalid
problem setup. But if it seems well behaved on one
proc (i.e. the answer looks good and nothing moves
too far during a minimization), and it croaks in parallel,
it might be a bug. If you can reproduce it with a small,
simple system, then send an input script/config file.

Steve

Test input script and auxillary files are attached.

test_lmp_run.tgz (4.12 KB)

--- "James J. Ramsey" <[email protected]...>
wrote:

I'm starting submitting some small test runs to a
cluster, which has MPICH-GM installed, and I found
that when I run the job with MPI using 4 cpus of the
clusters, I get lost atoms. If I run the job using
either a serial version of LAMMPS or with MPI using
just 2 cpus of the clusters, the job runs fine. Why
the difference?

One more thing. For debugging purposes, I modified the
input script so that it would dump the atom position
during the course of minimization. To my surprise, the
atoms were lost from the very first timestep, and by
that, I mean the zeroth timestep. As far as I know,
that timestep should have the initial positions of the
atoms, before the minimizer has a chance to do
anything, and judging from the coordinates that I see,
that's exactly the case.

I ran your script. The problem is your zbox bounds in the data
file are much bigger than the atom extent. When LAMMPS shrink
wraps on timestep 0, atoms get lost in parallel, b/c they need
to move across too many procs.

If you use zhi = 30, it runs fine.

See the info below from the read_data doc page.

Steve

If the system is non-periodic (in a dimension), then all atoms in the
data file should have coordinates (in that dimension) between the lo
and hi values. Furthermore, if running in parallel, the lo/hi values
should be just a bit smaller/larger than the min/max extent of atoms.
For example, if your atoms extend from 0 to 50, you should not specify
the box bounds as -10000 and 10000. Since LAMMPS uses the specified
box size to layout the 3d grid of processors, this will be sub-optimal
and may cause a parallel simulation to lose atoms when LAMMPS
shrink-wraps the box to the atoms.

1 Like