LAMMPS Jobs Failing on Intel Nodes Due to Memory or Signal 9 Errors

Dear Support Team,

I’m running multiple concurrent LAMMPS simulations on the Intel nodes and encountering a situation where a few of the jobs get terminated with: Signal 9 (KILLED) * or SLURM Out Of Memory (OOMKilled) errors. I submit the jobs via SLURM with the following script:

#SBATCH --job-name=test
#SBATCH -e err.%j
#SBATCH -o out.%j
#SBATCH -t 00:20:00
#SBATCH -n 16
#SBATCH --mem-per-cpu=2048
#SBATCH --exclusive

for mdc in {1…16}; do
mpirun -np 1 lmp -in md_velres_ms.in -var mdc $mdc -var i_ls 1 -var iter 1 > log_mdc_$mdc.lammps &
done

wait

Each job uses one MPI task and typically stays within 2 GB of memory per core. Interestingly, the same script runs fine on the master node and on AMD compute nodes of another cluster — the issue only arises when I submit a Slurm job on Intel worker nodes. I’m wondering if this could be due to enforced strict per-task memory limits on Intel nodes? or NUMA-related overhead or MPI memory spikes?

I’m using LAMMPS version 29 Aug 2024, compiled with:

  • Intel compiler 2022.1
  • Intel MPI 2022.1
  • CMake 3.26.1

As a further test, I also tried launching each LAMMPS job using 32 MPI ranks across multiple Intel nodes (total 512 cores) and still encountered the same Signal 9 and memory errors in some runs.

Please let me know if additional information or log files would be helpful.

Best regards,
Saeed

This looks like an issue you have to debug locally and where you need the help of your local HPC system staff.

We don’t see a significant benefit from using the Intel compilers over GNU or Clang in terms of performance with the exception of the styles in the INTEL package. So you may try compiling without Intel compilers and Intel MPI.

We do see, however, that some versions of the Intel compilers create broken executables.
That is about all that can be said about this from remote. Please note that this is a place to discuss LAMMPS but what you are asking about is about your cluster and the low-level software on it, and not about LAMMPS.

I compiled LAMMPS using gcc/13.1.0, openmpi/4.1.6, and it works perfectly now. Thanks for the suggestion!

Best regards
saeed