Hello,
When running the attached script and data file (mpirun -np 4 ./lmp_mpi -in input.txt) with more than 4 mpi tasks the simulation hangs after a few thousand timesteps.
I am using LAMMPS (12 Dec 2018) compiled with intel 2017.4 with the asphere and molecule packages.
If I change the boundary conditions from “p p p” to “m m m” then it does not hang, similarly if i remove the line “fix bl all balance 100 1.0 rcb out tmp.balance” then it does not hang.
Running with Valgrind and hitting control-c after it hangs results in the following errors:
10000 294.35176 0.56497313 0.56497313 42.992928 966.38941
10100 299.75498 0.54629798 0.54629798 43.78212 938.35873
10200 277.26356 1.0242611 1.0242611 40.49703 1044.7132
10300 290.17739 0.87087369 0.87087369 42.38322 953.16209
10400 276.88233 0.92092241 0.92092241 40.441347 968.89862
^C[mpiexec@…8349…] Sending Ctrl-C to processes as requested
[mpiexec@…8349…] Press Ctrl-C again to force abort
==160652==
==160652== Process terminating with default action of signal 2 (SIGINT)
==160651==
==160651== Process terminating with default action of signal 2 (SIGINT)
==160653==
==160653== Process terminating with default action of signal 2 (SIGINT)
==160654==
==160654== Process terminating with default action of signal 2 (SIGINT)
==160651== at 0x70519D5: sched_yield (in /usr/lib64/libc-2.17.so)
==160652== at 0x554DCF0: MPID_nem_mpich_blocking_recv (mpid_nem_inline.h:1228)
==160654== at 0x70519D7: sched_yield (in /usr/lib64/libc-2.17.so)
==160651== by 0x554E2CE: MPID_nem_mpich_blocking_recv (mpid_nem_inline.h:1281)
==160652== by 0x554DCF0: MPIDI_CH3I_Progress (ch3_progress.c:589)
==160653== at 0x554DCF0: MPID_nem_mpich_blocking_recv (mpid_nem_inline.h:1228)
==160653== by 0x554DCF0: MPIDI_CH3I_Progress (ch3_progress.c:589)
==160651== by 0x554E2CE: MPIDI_CH3I_Progress (ch3_progress.c:589)
==160654== by 0x554E2CE: MPID_nem_mpich_blocking_recv (mpid_nem_inline.h:1281)
==160651== by 0x58ED193: PMPI_Send (send.c:166)
==160652== by 0x598582D: PMPI_Waitany (waitany.c:223)
==160654== by 0x554E2CE: MPIDI_CH3I_Progress (ch3_progress.c:589)
==160653== by 0x598582D: PMPI_Waitany (waitany.c:223)
==160654== by 0x598582D: PMPI_Waitany (waitany.c:223)
==160651== by 0x8C10BC: LAMMPS_NS::CommTiled::forward_comm(int) (comm_tiled.cpp:518)
==160652== by 0x8C131F: LAMMPS_NS::CommTiled::forward_comm(int) (comm_tiled.cpp:529)
==160653== by 0x8C1F0D: LAMMPS_NS::CommTiled::reverse_comm() (comm_tiled.cpp:609)
==160651== by 0x76F72D: LAMMPS_NS::Verlet::run(int) (verlet.cpp:262)
==160652== by 0x76F72D: LAMMPS_NS::Verlet::run(int) (verlet.cpp:262)
==160653== by 0x76FEFC: LAMMPS_NS::Verlet::run(int) (verlet.cpp:335)
==160654== by 0x8C1F0D: LAMMPS_NS::CommTiled::reverse_comm() (comm_tiled.cpp:609)
==160651== by 0x4329F0: LAMMPS_NS::Run::command(int, char**) (run.cpp:183)
==160652== by 0x4329F0: LAMMPS_NS::Run::command(int, char**) (run.cpp:183)
==160653== by 0x4329F0: LAMMPS_NS::Run::command(int, char**) (run.cpp:183)
==160654== by 0x76FEFC: LAMMPS_NS::Verlet::run(int) (verlet.cpp:335)
==160651== by 0x4A4B5C: void LAMMPS_NS::Input::command_creator<LAMMPS_NS::Run>(LAMMPS_NS::LAMMPS*, int, char**) (input.cpp:873)
==160651== by 0x49C60B: LAMMPS_NS::Input::execute_command() (input.cpp:856)
==160652== by 0x4A4B5C: void LAMMPS_NS::Input::command_creator<LAMMPS_NS::Run>(LAMMPS_NS::LAMMPS*, int, char**) (input.cpp:873)
==160653== by 0x4A4B5C: void LAMMPS_NS::Input::command_creator<LAMMPS_NS::Run>(LAMMPS_NS::LAMMPS*, int, char**) (input.cpp:873)
==160651== by 0x49911D: LAMMPS_NS::Input::file() (input.cpp:243)
==160652== by 0x49C60B: LAMMPS_NS::Input::execute_command() (input.cpp:856)
==160652== by 0x49911D: LAMMPS_NS::Input::file() (input.cpp:243)
==160654== by 0x4329F0: LAMMPS_NS::Run::command(int, char**) (run.cpp:183)
==160651== by 0x6FCC45: main (main.cpp:64)
==160652== by 0x6FCC45: main (main.cpp:64)
==160653== by 0x49C60B: LAMMPS_NS::Input::execute_command() (input.cpp:856)
==160653== by 0x49911D: LAMMPS_NS::Input::file() (input.cpp:243)
==160653== by 0x6FCC45: main (main.cpp:64)
==160654== by 0x4A4B5C: void LAMMPS_NS::Input::command_creator<LAMMPS_NS::Run>(LAMMPS_NS::LAMMPS*, int, char**) (input.cpp:873)
==160654== by 0x49C60B: LAMMPS_NS::Input::execute_command() (input.cpp:856)
==160654== by 0x49911D: LAMMPS_NS::Input::file() (input.cpp:243)
==160654== by 0x6FCC45: main (main.cpp:64)
This seems to indicate it ends up stuck at the MPIwaitany commands in comm_tiled.cpp. I do not yet know enough about mpi and lammps’ implementation to know if this is actually the problem.
Additionally at the start of the run Valgrind gives two memcpy errors, one related to fix balance:
Setting up Verlet run …
Unit style : real
Current step : 0
Time step : 40
Walltime left : 0:59:59.70
==160653== Source and destination overlap in memcpy(0xa1f4880, 0xa1f4880, 40)
==160653== at 0x4C2DFEC: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1022)
==160653== by 0x6C276F: LAMMPS_NS::RCB::compute(int, int, double**, double*, double*, double*) (rcb.cpp:540)
==160653== by 0x6CB6F1: LAMMPS_NS::Balance::bisection(int) (balance.cpp:679)
==160653== by 0x521DD6: LAMMPS_NS::FixBalance::rebalance() (fix_balance.cpp:273)
==160653== by 0x521888: LAMMPS_NS::FixBalance::setup_pre_exchange() (fix_balance.cpp:196)
==160653== by 0x57A646: LAMMPS_NS::Modify::setup_pre_exchange() (modify.cpp:319)
==160653== by 0x76E56F: LAMMPS_NS::Verlet::setup(int) (verlet.cpp:110)
==160653== by 0x43295C: LAMMPS_NS::Run::command(int, char**) (run.cpp:178)
==160653== by 0x4A4B5C: void LAMMPS_NS::Input::command_creator<LAMMPS_NS::Run>(LAMMPS_NS::LAMMPS*, int, char**) (input.cpp:873)
==160653== by 0x49C60B: LAMMPS_NS::Input::execute_command() (input.cpp:856)
==160653== by 0x49911D: LAMMPS_NS::Input::file() (input.cpp:243)
==160653== by 0x6FCC45: main (main.cpp:64)
And the other related to the atom style hybrid ellipsoids:
==160653== Source and destination overlap in memcpy(0x9fd75c0, 0x9fd75c0, 64)
==160653== at 0x4C2DFEC: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1022)
==160653== by 0x52DFED: LAMMPS_NS::AtomVecEllipsoid::copy_bonus(int, int) (atom_vec_ellipsoid.cpp:171)
==160653== by 0x522DC6: LAMMPS_NS::AtomVecEllipsoid::copy(int, int, int) (atom_vec_ellipsoid.cpp:148)
==160653== by 0x714E39: LAMMPS_NS::AtomVecHybrid::copy(int, int, int) (atom_vec_hybrid.cpp:191)
==160653== by 0x7F0D1D: LAMMPS_NS::Irregular::migrate_atoms(int, int, int*) (irregular.cpp:164)
==160653== by 0x521F7F: LAMMPS_NS::FixBalance::rebalance() (fix_balance.cpp:298)
==160653== by 0x521888: LAMMPS_NS::FixBalance::setup_pre_exchange() (fix_balance.cpp:196)
==160653== by 0x57A646: LAMMPS_NS::Modify::setup_pre_exchange() (modify.cpp:319)
==160653== by 0x76E56F: LAMMPS_NS::Verlet::setup(int) (verlet.cpp:110)
==160653== by 0x43295C: LAMMPS_NS::Run::command(int, char**) (run.cpp:178)
==160653== by 0x4A4B5C: void LAMMPS_NS::Input::command_creator<LAMMPS_NS::Run>(LAMMPS_NS::LAMMPS*, int, char**) (input.cpp:873)
==160653== by 0x49C60B: LAMMPS_NS::Input::execute_command() (input.cpp:856)
Can you please advise if this simulation set up should work or there is a known incompatibility with fix balance rcb, hybrid ellipsoids, and periodic boundary conditions?
Thank you,
Stephen Farr
data.txt (7.68 KB)
input.txt (972 Bytes)