Dear Support Team,
I’m running multiple concurrent LAMMPS simulations on the Intel nodes and encountering a situation where a few of the jobs get terminated with: Signal 9 (KILLED
) * or SLURM Out Of Memory
(OOMKilled) errors. I submit the jobs via SLURM with the following script:
#SBATCH --job-name=test
#SBATCH -e err.%j
#SBATCH -o out.%j
#SBATCH -t 00:20:00
#SBATCH -n 16
#SBATCH --mem-per-cpu=2048
#SBATCH --exclusive
for mdc in {1…16}; do
mpirun -np 1 lmp -in md_velres_ms.in -var mdc $mdc -var i_ls 1 -var iter 1 > log_mdc_$mdc.lammps &
done
wait
Each job uses one MPI task and typically stays within 2 GB of memory per core. Interestingly, the same script runs fine on the master node and on AMD compute nodes of another cluster — the issue only arises when I submit a Slurm job on Intel worker nodes. I’m wondering if this could be due to enforced strict per-task memory limits on Intel nodes? or NUMA-related overhead or MPI memory spikes?
I’m using LAMMPS version 29 Aug 2024, compiled with:
- Intel compiler 2022.1
- Intel MPI 2022.1
- CMake 3.26.1
As a further test, I also tried launching each LAMMPS job using 32 MPI ranks across multiple Intel nodes (total 512 cores) and still encountered the same Signal 9 and memory errors in some runs.
Please let me know if additional information or log files would be helpful.
Best regards,
Saeed