NEB segfault

Hi all,

I am trying to run a NEB simulations with LAMMPS version 2 June 2020 and it gives me a segfault. I was running the same script with previous versions without problems. The error only occurs when starting from a set of restarts for every replica.

I get these error messages:

[cmime-14:07829] 30 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[cmime-14:07829] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages
[cmime-14:7838] *** An error occurred in MPI_Waitall
[cmime-14:7838] *** reported by process [115474433,3]
[cmime-14:7838] *** on communicator MPI_COMM_WORLD
[cmime-14:7838] *** MPI_ERR_TRUNCATE: message truncated
[cmime-14:7838] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[cmime-14:7838] *** and potentially your MPI job)

So something with MPI_Waitall seems not to be working right.

I am attaching my input script in case it helps.

Thanks,

Enrique

in.neb (3.01 KB)

enrique,

can you provide us with the compilation settings that you are using?

best by sending the output of:

echo info config | ./lmp_mpi

or equivalent.

does this issue happen with any other version of LAMMPS, e.g. the latest stable version (3 Mach 2020)?

thanks,
axel.

Hi Axel,

This is the linking stage:

mpicxx -g -O3 main.o -L…/…/lib/latte/liblink -fopenmp -L. -llammps_mpi …/…/lib/latte/filelink.o -llatte -lgfortran -llapack -lblas -fopenmp -L/home/enriquem/Codes/qmd-progress/install/lib -lprogress -L/home/enriquem/Codes/BML/install/lib -lbml_fortran -lbml -o …/lmp_mpi
/usr/bin/ld: warning: libgfortran.so.3, needed by //usr/lib/liblapack.so, may conflict with libgfortran.so.4
size …/lmp_mpi

And this is the whole output I get with echo info config (not sure it adds much):

you need to do:

export OMPI_MCA_btl=“tcp,self”

so that LAMMPS doesn’t try to access the infiniband device.
then you should be able to get the proper output from “info config”.

axel.

This is what I get now

LAMMPS (2 Jun 2020)

Info-Info-Info-Info-Info-Info-Info-Info-Info-Info-Info
Printed on Mon Jun 8 14:50:23 2020

LAMMPS version: 2 Jun 2020 / 20200602
Git info: unstable / patch_2Jun2020-3-g3cf36fb / 3cf36fb754df5c427ce97ccac23244a9b9f37ba8

OS information: Linux 4.15.0-74-generic on x86_64

sizeof(smallint): 32-bit
sizeof(imageint): 32-bit
sizeof(tagint): 32-bit
sizeof(bigint): 64-bit

Compiler: GNU C++ 7.3.0 with OpenMP not enabled
C++ standard: C++14

Active compile time flags:

-DLAMMPS_GZIP
-DLAMMPS_SMALLBIG

Installed packages:

KSPACE LATTE MANYBODY MC MISC MOLECULE REPLICA SNAP USER-MISC USER-PHONON
USER-REACTION USER-REAXC

Info-Info-Info-Info-Info-Info-Info-Info-Info-Info-Info

Total wall time: 0:00:00
[cmime1:13546] 1 more process has sent help message help-oob-ud.txt / create-cq-failed
[cmime1:13546] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages
[cmime1:13546] 1 more process has sent help message help-oob-ud.txt / no-ports-usable

Enrique

thanks a lot. that is very helpful.

since you are using git, i have one more request for something that you can try out before i need to ask you to try out to confirm that you are seeing a bug that we just fixed yesterday. can you please set up a test branch and include a pending patch:

create test branch from your current branch

git checkout -b test-patch

pull bugfix branch

git pull https://github.com/akohlmey/lammps.git collected-bugfixes-and-updates

resolve any merge conflicts (unlikely) and commit

and then recompile and test with these changes included. if that resolves your problem, we have already fixed the bug you are encountering and just need to revert to the previous version and wait until the next patch is released (hopefully tomorrow).

if not, this would be a new bug and then i would need some simplified/minimal test input, ideally without any modifications to LAMMPS, that would allow me to quickly reproduce and test on a small quad-core desktop machine. no guarantees how long it will take to sort this out. neb is notoriously difficult to debug and my access to capable resources is currently limited.

thanks,
axel.

Hi Axel,

Unfortunately the latest patch does not seem to fix the problem. I still see the same error:

[cmime1:46311] *** An error occurred in MPI_Waitall
[cmime1:46311] *** reported by process [199688193,2]
[cmime1:46311] *** on communicator MPI_COMM_WORLD
[cmime1:46311] *** MPI_ERR_TRUNCATE: message truncated
[cmime1:46311] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[cmime1:46311] *** and potentially your MPI job)
[cmime1:46304] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal

And this is what I get when I run info config

LAMMPS (2 Jun 2020)

Info-Info-Info-Info-Info-Info-Info-Info-Info-Info-Info
Printed on Tue Jun 9 07:47:18 2020

LAMMPS version: 2 Jun 2020 / 20200602
Git info: test-patch / patch_2Jun2020-112-g4aea618 / 4aea6186aa0f696b193fb23460df6730df5107c7

OS information: Linux 4.15.0-74-generic on x86_64

sizeof(smallint): 32-bit
sizeof(imageint): 32-bit
sizeof(tagint): 32-bit
sizeof(bigint): 64-bit

Compiler: GNU C++ 7.3.0 with OpenMP not enabled
C++ standard: C++14

Active compile time flags:

-DLAMMPS_GZIP
-DLAMMPS_SMALLBIG

Installed packages:

KSPACE LATTE MANYBODY MC MISC MOLECULE REPLICA SNAP USER-MISC USER-PHONON
USER-REACTION USER-REAXC

Info-Info-Info-Info-Info-Info-Info-Info-Info-Info-Info

Total wall time: 0:00:00

I’ll send you a simplified input script as soon as I can.

Thanks a lot!

Enrique

Hi Axel,

I haven’t been able to replicate the error in a different system so I am thinking I might have been doing something stupid (don’t know what yet). I’ll keep trying and let you know what the stupid thing was.

Thanks

Enrique

Hi Axel,

I haven’t been able to replicate the error in a different system so I am thinking I might have been doing something stupid (don’t know what yet). I’ll keep trying and let you know what the stupid thing was.

Given that you are struggling to construct a reproducer input from bundled LAMMPS code, the probability is that something is wrong or inconsistent with the code that is not part of LAMMPS, that you seem to be using.
The error message you see could happen, when that code sends an MPI message, that LAMMPS is not ready to receive because it is waiting for some other message. This may be triggered by some refactoring that we have done recently since some of it changed how parsing of potential files is done. Now the reading/parsing is only done on MPI rank 0 and then the parsed data communicated. It could be that this is exposing a race condition or a missing MPI_Barrier() call.

Axel.

Thanks

Enrique


Enrique Martinez Saez
Theoretical Division, T-1
Los Alamos National Laboratory
Los Alamos, NM, 87544
Ph: 505 606 2149
Fax: 505 667 8021
enriquem@…795…


From: Martinez Saez, Enrique
Sent: Tuesday, June 9, 2020 8:34:22 AM
To: Axel Kohlmeyer
Cc: LAMMPS Users Mailing List
Subject: Re: [EXTERNAL] Re: [lammps-users] NEB segfault

[…]