[lammps-users] Latest version of LAMMPS not printing dump files and thermodynamic info when run in parallel mode

vsorichetti · February 28, 2022, 2:17pm

Dear LAMMPS users,

I am not sure that this is the right place to ask this, but I’m going to try anyway since I’ve been fighting with this issue for a long time and cannot understand where it comes from.

I’m running some simulations on a x86_64 GNU/Linux cluster with PowerEdge R6525 (Dell), AMD EPYC 7702 64-Core processors.

The Linux version is 4.15.0-163-generic, the gcc version is 7.5.0 and the mpi version is 2021.4.

With the LAMMPS stable version for 29 Oct 2020, everything was working fine: the simulations were running and the dump files and the thermodynamic info were printed correctly.

Recently, I switched to the version of 29 Sept 2021: with the same script as before, the simulations are still running (in the sense that I can see the execution time increasing), however the dump files and the thermodynamic info is not printed, unless I run the simulations on single core.

I double checked multiple times: the exact same script works as intended in parallel mode on the 2020 stable version, but not on the 2021 one.

In serial mode, everything seems fine on both versions.

Actually, even when running the 2021 version in parallel mode but asking for a single core (mpirun -np 1 lmp_mpi…), everything works fine.

Does someone have any idea of what may be causing this?

If this issue can’t be due to LAMMPS itself, then my bad and sorry for posting here.

Thank you in advance!

All the best,

Valerio

akohlmey · February 28, 2022, 2:48pm

You need to provide more details.

are you running interactively or through a batch system?
is there no output at all or only after the calculation has finished?
are you running on a networked file system?
does your old LAMMPS version and the current LAMMPS executable use the exact same MPI library (vendor and version)?
the MPI version number is not useful without knowing the vendor. what is the output of “./lmp -h” ?

vsorichetti · February 28, 2022, 3:16pm

Hi Axel,

Thanks for the reply.

You need to provide more details.

are you running interactively or through a batch system?

Batch system (SLURM)

is there no output at all or only after the calculation has finished?

Currently testing, actually it seems like the simulation does not terminate at all despite the execution time increasing.

are you running on a networked file system?

Yes, although I don’t know the details (sorry, if you tell me what could be relevant I can look for the related info).

does your old LAMMPS version and the current LAMMPS executable use the exact same MPI library (vendor and version)?

the MPI version number is not useful without knowing the vendor. what is the output of “./lmp -h” ?

Seems like this could actually be the issue. These are the outputs:

-2020 version

Large-scale Atomic/Molecular Massively Parallel Simulator - 29 Oct 2020
[…]
OS: Linux 4.15.0-163-generic on x86_64
Compiler: GNU C++ 7.5.0 with OpenMP not enabled
C++ standard: C++14
MPI v3.1: Open MPI v3.0.0, package: Open MPI root@titan Distribution, ident: 3.0.0, repo rev: v3.0.0, Sep 12, 2017

-2021 version

Large-scale Atomic/Molecular Massively Parallel Simulator - 29 Sep 2021 - Update 2
[…]
OS: Linux “Ubuntu 18.04.6 LTS” 4.15.0-163-generic on x86_64
Compiler: GNU C++ 7.5.0 with OpenMP not enabled
C++ standard: C++11
MPI v3.1: Intel(R) MPI Library 2021.4 for Linux* OS

It looks like the 2021 and 2020 versions are probably using different versions of the MPI library?

In both cases I did a “basic” compilation (“make mpi” after installing some packages).

So maybe I can fix the issue by specifying the path to the MPI library?

Sorry if I ask basic questions, previously the “normal” installation has always worked fine for me.

Thank you,

Valerio

akohlmey · February 28, 2022, 3:30pm

It would be important at this step to reduce the number of variables.
So please try to reproduce this behavior with one of the LAMMPS example inputs after uncommenting the first dump command, e.g. the “melt” one

[…]

are you running on a networked file system?

Yes, although I don’t know the details (sorry, if you tell me what could be relevant I can look for the related info).

networked file systems, especially parallel ones, do buffering for optimal performance, so output may not be available on the cluster login node even if it has been written.

does your old LAMMPS version and the current LAMMPS executable use the exact same MPI library (vendor and version)?

the MPI version number is not useful without knowing the vendor. what is the output of “./lmp -h” ?

Seems like this could actually be the issue. These are the outputs:

-2020 version

Large-scale Atomic/Molecular Massively Parallel Simulator - 29 Oct 2020
[…]
OS: Linux 4.15.0-163-generic on x86_64
Compiler: GNU C++ 7.5.0 with OpenMP not enabled
C++ standard: C++14
MPI v3.1: Open MPI v3.0.0, package: Open MPI root@titan Distribution, ident: 3.0.0, repo rev: v3.0.0, Sep 12, 2017

-2021 version

Large-scale Atomic/Molecular Massively Parallel Simulator - 29 Sep 2021 - Update 2
[…]
OS: Linux “Ubuntu 18.04.6 LTS” 4.15.0-163-generic on x86_64
Compiler: GNU C++ 7.5.0 with OpenMP not enabled
C++ standard: C++11
MPI v3.1: Intel(R) MPI Library 2021.4 for Linux* OS

It looks like the 2021 and 2020 versions are probably using different versions of the MPI library?

not just different versions but different vendors. Intel MPI is based on MPICH which does more extensive buffering of console output than OpenMPI by default.

In both cases I did a “basic” compilation (“make mpi” after installing some packages).

So maybe I can fix the issue by specifying the path to the MPI library?

Sorry if I ask basic questions, previously the “normal” installation has always worked fine for me.

There are multiple possible reasons for not getting output:

the difference of the MPI vendor (but the output should show after the job has completed)
you are not loading the correct MPI library support module inside your batch script (and then you will be running N copies of a serial run since you are using the wrong script to launch the job)
there is a difference in behavior for your specific application that was marginal but passed some critical point with the old version but fails and gets stuck with the new executable. This is particularly common when using binary restart files on the new executable that were generated by the old executable. Those files are often compatible but not always and then can lead to corrupted data.

as suggested, it is best tested with a known-to-work and fast-to-run input.

vsorichetti · February 28, 2022, 3:47pm

Thank you very much. I will check what you suggested and get back to you.

For the moment, what I can say is that from the tests I’ve done, it seems that with the 2021 version the simulation is not terminating at all even for scripts that should terminate in a matter of seconds, so it seems that buffering is not the issue.

Best regards,

Valerio

vsorichetti · February 28, 2022, 5:43pm

Dear Axel,

It looks like the problem was the MPI version.

SLURM was set to work with open MPI but a recent installation of the Intel API had changed the default environment variable.

Unfortunately the administrator of the cluster hadn’t informed me about this so I had no idea this had happened.

So this was not a LAMMPS problem in the end. Thank you very much for your help anyway, it was very useful!

Best regards,

Valerio