weird MPI errors on Intel 2018 + Open MPI

Not sure if it is related, but we had some similar issues (especially with intel compilers 2017 onwards, not sure why) on a cluster which was not reproducible (meaning appearing at different time steps during the simulation).

In the end it turned out that the zipping (gzip) of the dump files during the lammps run was causing the issue. Turning that off helped.

Regards
Wolfgang

Not sure if it is related, but we had some similar issues (especially with intel compilers 2017 onwards, not sure why) on a cluster which was not reproducible (meaning appearing at different time steps during the simulation).

In the end it turned out that the zipping (gzip) of the dump files during the lammps run was causing the issue. Turning that off helped.

​this can be due to a complications from mixing the malloc from glibc​ with malloc in tbbmalloc and optimizations in the intel compiler and OpenMPI on memory, that is supposed to be “pinned” (where OpenMPI uses its own malloc wrapper) when using Infiniband networking for MPI. have a call to system() to launch the gzip process can in some cases, mess up those “pinned” memory areas (by “unpinning” them). this is a tricky and complex situation, since there are different ‘competing’ optimizations for memory management from different sides that can sometimes result in applying heuristics in those optimizations that are not valid anymore. this is why the COMPRESS package was conceived, which avoids the pipeline to gzip and uses an additional library instead.

​axel.​