Moving wall simulations with MPI_Allreduce error.

Dear All,

I met a problem in my simulation in which a wall is moving downward together with a block. My complete input file is too long and I narrow down the problem into these following lines:

variable dx equal xcm(top_block,z)-60.541393 ##### displacement of top block

region hileftwall block INF -55.0 INF INF 53.0 INF units box side out move NULL NULL v_dx
region hirightwall block 55.0 INF INF INF 53.0 INF units box side out move NULL NULL v_dx

I got the error message:

[node16:4566] *** An error occurred in MPI_Allreduce
[node16:4566] *** on communicator MPI_COMM_WORLD
[node16:4566] *** MPI_ERR_TRUNCATE: message truncated
[node16:4566] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)

Dear All,

I met a problem in my simulation in which a wall is moving downward together with a block. My complete input file is too long and I narrow down the problem into these following lines:

variable dx equal xcm(top_block,z)-60.541393 ##### displacement of top block

region hileftwall block INF -55.0 INF INF 53.0 INF units box side out move NULL NULL v_dx
region hirightwall block 55.0 INF INF INF 53.0 INF units box side out move NULL NULL v_dx

I got the error message:
[node16:4566] *** An error occurred in MPI_Allreduce
[node16:4566] *** on communicator MPI_COMM_WORLD
[node16:4566] *** MPI_ERR_TRUNCATE: message truncated
[node16:4566] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)

mpirun has exited due to process rank 7 with PID 4573 on
node node16 exiting without calling “finalize”. This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

[node16:04565] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[node16:04565] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages

I have tried several ways:

  1. Using single CPU (serial executable) cause no problem.
  2. Both 2014-02-01 version and current 2014-04-05 version cause this problem since I noticed someone here mentioned this problem several days ago and the bug might be fixed in the most current version. It seems not.
  3. The movement of the wall depends on the displacement of the top block. If I assign a simple mathematical function like ramp function, there is no this error message,

Any suggestions and advice are appreciated.

Please try build a small test case (does not have to be physically meaningful) that reproduces the issue and post it here.

Thanks,

Axel

Hi, Axel,

Thank you for your reply!

I simplify my simulation with this input file:

Begin

3d ideal wall move with semi-spherical-tip

units real
dimension 3
boundary p p p
atom_style atomic

Set Up Simulation Box

region box block -500.0 500.0 -500.0 500.0 -500.0 500.0 units box
create_box 1 box

Tip Section

lattice fcc 5.4051 orient x 1 -1 0 orient y 1 1 -2 orient z 1 1 1

define region

region sphere sphere 0.0 0.0 100.0 50.0 units box
region bottom_box block INF INF INF INF INF 100.0 units box
region fixed block INF INF INF INF 80.0 100.0 units box
region mobile block INF INF INF INF INF 80.0 units box

region tip intersect 2 sphere bottom_box
region tip-fixed intersect 2 sphere fixed
region tip-mobile intersect 2 sphere mobile

create_atoms 1 region tip

define groups

group tip region tip
group tip-mobile region tip-mobile
group tip-fixed region tip-fixed

Mass

mass 1 40.0

pair potentials

pair_style eam/alloy

pair_coeff * * Au-Grochola-JCP05.eam.alloy Au

pair_style lj/cut 8.5125
pair_coeff 1 1 0.23805 3.405 8.5125

temp controllers

compute new1 tip-mobile temp

define variables

variable tip_fixed_xx equal xcm(tip-fixed,x)
variable tip_fixed_xy equal xcm(tip-fixed,y)
variable tip_fixed_xz equal xcm(tip-fixed,z)

variable dx equal xcm(tip-fixed,z)-90.707265 ##### displacement of top

define wall

region leftwall block INF -50.0 INF INF 0.0 INF units box side out move NULL NULL v_dx
region rightwall block 50.0 INF INF INF 0.0 INF units box side out move NULL NULL v_dx

timestep 1.0

thermo_style custom step temp pe ke etotal press vol v_tip_fixed_xx v_tip_fixed_xy v_tip_fixed_xz v_dx

thermo 200
thermo_modify temp new1

dump 1 all xyz 200 tip.xyz

Step1: Normal movement

velocity tip-mobile create 300.0 482748 temp new1
velocity tip-fixed set 0.0 0.0 0.0 units box

fix 1 tip-mobile nvt temp 300.0 300.0 0.1

fix 2 tip-fixed rigid single torque 1 off off off

fix 3 tip-fixed move linear NULL NULL -0.001

fix 4 tip-mobile wall/region leftwall lj93 0.0005 6.0 10.0
fix 5 tip-mobile wall/region rightwall lj93 0.0005 6.0 10.0

run 50000
unfix 1
unfix 3
unfix 4
unfix 5

END

I still get the error message:

[node2:24089] *** An error occurred in MPI_Allreduce
[node2:24089] *** on communicator MPI_COMM_WORLD
[node2:24089] *** MPI_ERR_TRUNCATE: message truncated
[node2:24089] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)

Hi, Axel,

Thank you for your reply!

I simplify my simulation with this input file:

thanks for the input. i can reproduce the behavior and confirm that
this is definitely a bug in LAMMPS. unfortunately, it is a bug that is
very difficult to track down (at least for me currently). all i can
find out that there is an MPI collective operation that is only
executed on parts of the processors and that corrupts the
MPI_Allreduce() later on.

hopefully steve or somebody else can have a look and sort it out.

sorry,

    axel.

in addition, there is one problem with your input. you apply fix rigid
and fix move to the same group of atoms. both fixes perform time
integration and this usually results in all kinds of problems. if you
want your rigid body move at constant velocity, you have to use the
velocity command with the add flag to add the move velocity in z
direction and then use fix setforce NULL NULL 0.0 to zero out any
forces on those atoms. this will have the same effect, but only time
integrate positions once.

axel

The problem with your script is that you’re doing

something more complicated than we anticipated
with dynamic regions and variables.

You are using fix wall/region which needs to check
if each atom is in/outside the region. You’ve made
the region dynamic with a variable, so that the current
extent of the region depends on the center-of-mass
of some group of atoms. Evaulating that variable

requires the xcm() function (in the variable) which
require a global operation (Allreduce) to evaluate the
center of mass of a group.

So each time a single atom tries to detect whether
it is in the region, a global operation is being invoked,
which requires all procs to participate.

So that isn’t going to work in parallel. Either LAMMPS needs
to be smart enough to detect this case and issue

an error. Or we need to figure out a different way
to perform the calculation, like pre-computing the xcm()
and not triggering it with the single-particle region
logic.

I need to think about this a bit …

Steve

This issue should now be fixed, as of the 2May patch,
and your use of variables should now be allowed.

Steve