Unexpect behavior of lammps

Hello,

I have installed the last version of Lammps in a HPC cluster with slurm. The test job works fine. However, the following jobs have exṕerimented an unexpected behavior of lammps. When several jobs are allocated in the same compute node by slurm, the first job completed (normally or not) is responsible for the crashing of the others. I know that a job termintated by a bad execution could crash the others because a mpi termination process. However, even jobs using a serial program are showing this behavior. What is happening? How can I fix this? Anyone could help me?

Thanks,

Alexander.

Hi @amartins,

This is more a MPI or Slurm installation/use issue than a LAMMPS one. Except if you install the cluster yourself, the people more likely to help would be the IT desk managing the cluster to which you can ask about how the way you set-up your jobs conflicts with the MPI installation or slurm configuration.

What is the error message?

Impossible to say without more details that are sufficient to exactly reproduce what you are seeing. Without that kind of information, one will have to guess. There could be issues due to the MPI library creating the same temporary files for multiple jobs, or your jobs creating, deleting, corrupting files because of using the exact same working directory at runtime, or something completely different.

Hi, Germain.

This behavior is happening only with lammps jobs. I have installed this HPC cluster and I have searched by any irregular option in the slurm.conf. Also, I have compiled with gcc/opnempi/MKL.

Hi,

The jobs are exiting with no error message.

That is impossible with LAMMPS. You may be looking in the wrong place or the error message may be stuck in a buffer and thus you may need to add the -nb switch to the LAMMPS command line to turn off buffering completely for output to the screen.

Please note that correlation is not a proof for causation.

We need to know the exact details of how to reproduce the behavior you are seeing in order to debug this and provide either a bugfix or point out the incorrect configuration or similar.

[Update] What would be particularly of interest would be if you could reproduce the crashes with LAMMPS input decks from the LAMMPS distribution, e.g. the inputs in the “bench” folder.

1 Like