Using fix balance rcb with hybrid charged ellipsoids and periodic boundaries causes MPI error/deadlock

Hello,

When running the attached script and data file (mpirun -np 4 ./lmp_mpi -in input.txt) with more than 4 mpi tasks the simulation hangs after a few thousand timesteps.

I am using LAMMPS (12 Dec 2018) compiled with intel 2017.4 with the asphere and molecule packages.

If I change the boundary conditions from “p p p” to “m m m” then it does not hang, similarly if i remove the line “fix bl all balance 100 1.0 rcb out tmp.balance” then it does not hang.

Running with Valgrind and hitting control-c after it hangs results in the following errors:

10000 294.35176 0.56497313 0.56497313 42.992928 966.38941
10100 299.75498 0.54629798 0.54629798 43.78212 938.35873
10200 277.26356 1.0242611 1.0242611 40.49703 1044.7132
10300 290.17739 0.87087369 0.87087369 42.38322 953.16209
10400 276.88233 0.92092241 0.92092241 40.441347 968.89862
^C[[email protected]…8349…] Sending Ctrl-C to processes as requested
[[email protected]…8349…] Press Ctrl-C again to force abort
==160652==
==160652== Process terminating with default action of signal 2 (SIGINT)
==160651==
==160651== Process terminating with default action of signal 2 (SIGINT)
==160653==
==160653== Process terminating with default action of signal 2 (SIGINT)
==160654==
==160654== Process terminating with default action of signal 2 (SIGINT)
==160651== at 0x70519D5: sched_yield (in /usr/lib64/libc-2.17.so)
==160652== at 0x554DCF0: MPID_nem_mpich_blocking_recv (mpid_nem_inline.h:1228)
==160654== at 0x70519D7: sched_yield (in /usr/lib64/libc-2.17.so)
==160651== by 0x554E2CE: MPID_nem_mpich_blocking_recv (mpid_nem_inline.h:1281)
==160652== by 0x554DCF0: MPIDI_CH3I_Progress (ch3_progress.c:589)
==160653== at 0x554DCF0: MPID_nem_mpich_blocking_recv (mpid_nem_inline.h:1228)
==160653== by 0x554DCF0: MPIDI_CH3I_Progress (ch3_progress.c:589)
==160651== by 0x554E2CE: MPIDI_CH3I_Progress (ch3_progress.c:589)
==160654== by 0x554E2CE: MPID_nem_mpich_blocking_recv (mpid_nem_inline.h:1281)
==160651== by 0x58ED193: PMPI_Send (send.c:166)
==160652== by 0x598582D: PMPI_Waitany (waitany.c:223)
==160654== by 0x554E2CE: MPIDI_CH3I_Progress (ch3_progress.c:589)
==160653== by 0x598582D: PMPI_Waitany (waitany.c:223)
==160654== by 0x598582D: PMPI_Waitany (waitany.c:223)
==160651== by 0x8C10BC: LAMMPS_NS::CommTiled::forward_comm(int) (comm_tiled.cpp:518)
==160652== by 0x8C131F: LAMMPS_NS::CommTiled::forward_comm(int) (comm_tiled.cpp:529)
==160653== by 0x8C1F0D: LAMMPS_NS::CommTiled::reverse_comm() (comm_tiled.cpp:609)
==160651== by 0x76F72D: LAMMPS_NS::Verlet::run(int) (verlet.cpp:262)
==160652== by 0x76F72D: LAMMPS_NS::Verlet::run(int) (verlet.cpp:262)
==160653== by 0x76FEFC: LAMMPS_NS::Verlet::run(int) (verlet.cpp:335)
==160654== by 0x8C1F0D: LAMMPS_NS::CommTiled::reverse_comm() (comm_tiled.cpp:609)
==160651== by 0x4329F0: LAMMPS_NS::Run::command(int, char**) (run.cpp:183)
==160652== by 0x4329F0: LAMMPS_NS::Run::command(int, char**) (run.cpp:183)
==160653== by 0x4329F0: LAMMPS_NS::Run::command(int, char**) (run.cpp:183)
==160654== by 0x76FEFC: LAMMPS_NS::Verlet::run(int) (verlet.cpp:335)
==160651== by 0x4A4B5C: void LAMMPS_NS::Input::command_creator<LAMMPS_NS::Run>(LAMMPS_NS::LAMMPS*, int, char**) (input.cpp:873)
==160651== by 0x49C60B: LAMMPS_NS::Input::execute_command() (input.cpp:856)
==160652== by 0x4A4B5C: void LAMMPS_NS::Input::command_creator<LAMMPS_NS::Run>(LAMMPS_NS::LAMMPS*, int, char**) (input.cpp:873)
==160653== by 0x4A4B5C: void LAMMPS_NS::Input::command_creator<LAMMPS_NS::Run>(LAMMPS_NS::LAMMPS*, int, char**) (input.cpp:873)
==160651== by 0x49911D: LAMMPS_NS::Input::file() (input.cpp:243)
==160652== by 0x49C60B: LAMMPS_NS::Input::execute_command() (input.cpp:856)
==160652== by 0x49911D: LAMMPS_NS::Input::file() (input.cpp:243)
==160654== by 0x4329F0: LAMMPS_NS::Run::command(int, char**) (run.cpp:183)
==160651== by 0x6FCC45: main (main.cpp:64)
==160652== by 0x6FCC45: main (main.cpp:64)
==160653== by 0x49C60B: LAMMPS_NS::Input::execute_command() (input.cpp:856)
==160653== by 0x49911D: LAMMPS_NS::Input::file() (input.cpp:243)
==160653== by 0x6FCC45: main (main.cpp:64)
==160654== by 0x4A4B5C: void LAMMPS_NS::Input::command_creator<LAMMPS_NS::Run>(LAMMPS_NS::LAMMPS*, int, char**) (input.cpp:873)
==160654== by 0x49C60B: LAMMPS_NS::Input::execute_command() (input.cpp:856)
==160654== by 0x49911D: LAMMPS_NS::Input::file() (input.cpp:243)
==160654== by 0x6FCC45: main (main.cpp:64)

This seems to indicate it ends up stuck at the MPIwaitany commands in comm_tiled.cpp. I do not yet know enough about mpi and lammps’ implementation to know if this is actually the problem.

Additionally at the start of the run Valgrind gives two memcpy errors, one related to fix balance:

Setting up Verlet run …
Unit style : real
Current step : 0
Time step : 40
Walltime left : 0:59:59.70
==160653== Source and destination overlap in memcpy(0xa1f4880, 0xa1f4880, 40)

==160653== at 0x4C2DFEC: [email protected]@GLIBC_2.14 (vg_replace_strmem.c:1022)
==160653== by 0x6C276F: LAMMPS_NS::RCB::compute(int, int, double**, double*, double*, double*) (rcb.cpp:540)
==160653== by 0x6CB6F1: LAMMPS_NS::Balance::bisection(int) (balance.cpp:679)
==160653== by 0x521DD6: LAMMPS_NS::FixBalance::rebalance() (fix_balance.cpp:273)
==160653== by 0x521888: LAMMPS_NS::FixBalance::setup_pre_exchange() (fix_balance.cpp:196)
==160653== by 0x57A646: LAMMPS_NS::Modify::setup_pre_exchange() (modify.cpp:319)
==160653== by 0x76E56F: LAMMPS_NS::Verlet::setup(int) (verlet.cpp:110)
==160653== by 0x43295C: LAMMPS_NS::Run::command(int, char**) (run.cpp:178)
==160653== by 0x4A4B5C: void LAMMPS_NS::Input::command_creator<LAMMPS_NS::Run>(LAMMPS_NS::LAMMPS*, int, char**) (input.cpp:873)
==160653== by 0x49C60B: LAMMPS_NS::Input::execute_command() (input.cpp:856)
==160653== by 0x49911D: LAMMPS_NS::Input::file() (input.cpp:243)
==160653== by 0x6FCC45: main (main.cpp:64)

And the other related to the atom style hybrid ellipsoids:

==160653== Source and destination overlap in memcpy(0x9fd75c0, 0x9fd75c0, 64)
==160653== at 0x4C2DFEC: [email protected]@GLIBC_2.14 (vg_replace_strmem.c:1022)
==160653== by 0x52DFED: LAMMPS_NS::AtomVecEllipsoid::copy_bonus(int, int) (atom_vec_ellipsoid.cpp:171)
==160653== by 0x522DC6: LAMMPS_NS::AtomVecEllipsoid::copy(int, int, int) (atom_vec_ellipsoid.cpp:148)
==160653== by 0x714E39: LAMMPS_NS::AtomVecHybrid::copy(int, int, int) (atom_vec_hybrid.cpp:191)
==160653== by 0x7F0D1D: LAMMPS_NS::Irregular::migrate_atoms(int, int, int*) (irregular.cpp:164)
==160653== by 0x521F7F: LAMMPS_NS::FixBalance::rebalance() (fix_balance.cpp:298)
==160653== by 0x521888: LAMMPS_NS::FixBalance::setup_pre_exchange() (fix_balance.cpp:196)
==160653== by 0x57A646: LAMMPS_NS::Modify::setup_pre_exchange() (modify.cpp:319)
==160653== by 0x76E56F: LAMMPS_NS::Verlet::setup(int) (verlet.cpp:110)
==160653== by 0x43295C: LAMMPS_NS::Run::command(int, char**) (run.cpp:178)
==160653== by 0x4A4B5C: void LAMMPS_NS::Input::command_creator<LAMMPS_NS::Run>(LAMMPS_NS::LAMMPS*, int, char**) (input.cpp:873)
==160653== by 0x49C60B: LAMMPS_NS::Input::execute_command() (input.cpp:856)

Can you please advise if this simulation set up should work or there is a known incompatibility with fix balance rcb, hybrid ellipsoids, and periodic boundary conditions?

Thank you,
Stephen Farr

data.txt (7.68 KB)

input.txt (972 Bytes)

Please do read the documentation properly, it is a basic question.

J.

So sorry I made a mistake, I was replying to another message.
My apologies.
J.

Stephen,

this is definitely a bug. the symptoms seem to be changing based on choice of compiler, MPI library, and compiler flags. it seems to be coming from inside the RCB load-balancing algorithm, where some data structures keeping track of what data to send to what MPI rank gets corrupted.

as a workaround, i would suggest to either avoid fix balance or tiled communication. you are using a rather unusual combination of settings, so it is more likely, that this bug went unnoticed so far.

would you mind filing a bug report with the example input at the lammps github project at: https://github.com/lammps/lammps/issues ?

thanks for providing a small and representative input deck,

axel.

Dear Axel,

Thank you for looking into this, I have submitted the bug report.

Stephen