LAMMPS usage on different cluster

Dear LAMMPS community,

I hope you are well. I would like to ask about the LAMMPS usage of computational resources (CPUs) from cluster to cluster. I was running a set of simulations in the LAMMPS package on cluster 1 in which the simulations were fast and efficient. Then I moved to cluster 2 and I am running the same set of simulations but the LAMMPS is not able to use the full CPU resources that are allocated to it. For example, for the 8 cores allocated to the job, the code can use only 20% (and less) of each CPU, using my CPU.hour resources. I was wondering if this issue has been observed for LAMMPS and whether is there any way to address it properly? For example, if a similar case was reported before, does it depend on the cluster configuration and architecture?

Best,

Iman

Yes. Specifically the network.

Thank you for your reply. To what extend it could change the performance of the simulation? For example, can the network of the new cluster increase the simulation run time with a factor of 10? Also, I wish to let you know that I have always used 1 node for my job and I assume that there is no communications between nodes through the network.

To realize how other parameters can affect the job runtime, I changed the OMP_NUM_THREADS=4 to other values from 1 to 16, in addition to change the number of cpus in the node. I was wondering if there is any other parameter that could be related to LAMMPS performance that I need to try in this regard?

Best,
Iman

That is difficult to say. I depends on the kind of simulation, how large the problem is and how frequently communication is required. You would have to provide some tangible data, eg. by running the examples from the “bench” folder blown up to a representative size.

For a single node, the network performance is irrelevant. It is just the most common problem when people use clusters and get bad performance. I suggested that, because no tangible information about the two clusters was given.

There are several other possible causes:

  • there is no exclusive node access and there are users using more resources than they asked for and thus reducing the availability of CPU cores
  • one of the clusters has hyperthreading enabled and you are using the hyperthreads which provide only a minimal speedup versus real CPU cores.
  • there is CPU affinity set in one case and not the other and you don’t set OMP_NUM_THREADS=1
  • the cluster with the bad performance is badly managed and the node is not properly cleaned when a job has reached its walltime limit
  • the cluster with the bad performance is badly managed and allows users to log into the compute nodes and start calculations even if they have no job in that node
  • you are not using OpenMP correctly
  • your LAMMPS executable has not been compiled correctly for the available MPI library

To be able to tell what is happening, we need to know more details, how your LAMMPS version(s) were compiled (e.g. the output from lmp -h) and which command lines you are using.
For your reference, here are some numbers for using 1, 2, and 4 CPU cores on an Intel NUC machine with a 4-core CPU and the “rhodo” benchmark input.

1 MPI 1 OpenMP: mpirun -np 1 ../build/lmp -in in.rhodo

Loop time of 26.6078 on 1 procs for 100 steps with 32000 atoms
Performance: 0.649 ns/day, 36.955 hours/ns, 3.758 timesteps/s, 120.266 katom-step/s
99.7% CPU use with 1 MPI tasks x 1 OpenMP threads

2 MPI 1 OpenMP: mpirun -np 2 ../build/lmp -in in.rhodo

Loop time of 13.4511 on 2 procs for 100 steps with 32000 atoms
Performance: 1.285 ns/day, 18.682 hours/ns, 7.434 timesteps/s, 237.899 katom-step/s
99.8% CPU use with 2 MPI tasks x 1 OpenMP threads

4 MPI 1 OpenMP: mpirun -np 4 ../build/lmp -in in.rhodo

Loop time of 7.59636 on 4 procs for 100 steps with 32000 atoms
Performance: 2.275 ns/day, 10.550 hours/ns, 13.164 timesteps/s, 421.255 katom-step/s
99.6% CPU use with 4 MPI tasks x 1 OpenMP threads

1 MPI 1 OpenMP: OMP_NUM_THREADS=1 mpirun -np 1 ../build/lmp -in in.rhodo -sf omp

Loop time of 23.744 on 1 procs for 100 steps with 32000 atoms
Performance: 0.728 ns/day, 32.978 hours/ns, 4.212 timesteps/s, 134.771 katom-step/s
99.8% CPU use with 1 MPI tasks x 1 OpenMP threads

1 MPI 2 OpenMP: OMP_NUM_THREADS=2 mpirun -np 1 ../build/lmp -in in.rhodo -sf omp

Loop time of 12.684 on 2 procs for 100 steps with 32000 atoms
Performance: 1.362 ns/day, 17.617 hours/ns, 7.884 timesteps/s, 252.287 katom-step/s
199.7% CPU use with 1 MPI tasks x 2 OpenMP threads

1 MPI 4 OpenMP: OMP_NUM_THREADS=4 mpirun -np 1 ../build/lmp -in in.rhodo -sf omp

Loop time of 7.01865 on 4 procs for 100 steps with 32000 atoms
Performance: 2.462 ns/day, 9.748 hours/ns, 14.248 timesteps/s, 455.928 katom-step/s
398.9% CPU use with 1 MPI tasks x 4 OpenMP threads

2 MPI 2 OpenMP: OMP_NUM_THREADS=2 mpirun -np 2 ../build/lmp -in in.rhodo -sf omp

Loop time of 6.56561 on 4 procs for 100 steps with 32000 atoms
Performance: 2.632 ns/day, 9.119 hours/ns, 15.231 timesteps/s, 487.388 katom-step/s
199.5% CPU use with 2 MPI tasks x 2 OpenMP threads

with hyper-threading:

2 MPI 4 OpenMP: OMP_NUM_THREADS=4 mpirun -np 2 ../build/lmp -in in.rhodo -sf omp

Loop time of 5.67482 on 8 procs for 100 steps with 32000 atoms
Performance: 3.045 ns/day, 7.882 hours/ns, 17.622 timesteps/s, 563.895 katom-step/s
396.9% CPU use with 2 MPI tasks x 4 OpenMP threads

4 MPI 2 OpenMP: OMP_NUM_THREADS=2 mpirun -np 4 ../build/lmp -in in.rhodo -sf omp

Loop time of 5.39629 on 8 procs for 100 steps with 32000 atoms
Performance: 3.202 ns/day, 7.495 hours/ns, 18.531 timesteps/s, 593.000 katom-step/s
198.4% CPU use with 4 MPI tasks x 2 OpenMP threads

Dear Axel, Thank you very much for your detailed explanation. Here is the output of the lmp -h:

OS: Linux “Rocky Linux 8.9 (Green Obsidian)” 5.15.133.1.fi x86_64

Compiler: GNU C++ 8.5.0 20210514 (Red Hat 8.5.0-18) with OpenMP 4.5
C++ standard: C++11
MPI v1.0: LAMMPS MPI STUBS for LAMMPS version 2 Aug 2023

Accelerator configuration:

OPENMP package API: OpenMP
OPENMP package precision: double
OpenMP standard: OpenMP 4.5

Active compile time flags:

-DLAMMPS_GZIP
-DLAMMPS_PNG
-DLAMMPS_JPEG
-DLAMMPS_SMALLBIG
sizeof(smallint): 32-bit
sizeof(imageint): 32-bit
sizeof(tagint): 32-bit
sizeof(bigint): 64-bit

Available compression formats:

Extension: .gz Command: gzip
Extension: .bz2 Command: bzip2
Extension: .zst Command: zstd
Extension: .xz Command: xz
Extension: .lzma Command: xz
Extension: .lz4 Command: lz4

Installed packages:

AMOEBA ASPHERE BOCS BODY BPM BROWNIAN CG-DNA CG-SPICA CLASS2 COLLOID COLVARS
CORESHELL DIELECTRIC DIFFRACTION DIPOLE DPD-BASIC DPD-MESO DPD-REACT
DPD-SMOOTH DRUDE EFF EXTRA-COMPUTE EXTRA-DUMP EXTRA-FIX EXTRA-MOLECULE
EXTRA-PAIR FEP GRANULAR INTERLAYER KSPACE MANYBODY MC MEAM MESONT MISC ML-IAP
ML-POD ML-SNAP MOFFF MOLECULE OPENMP OPT ORIENT PERI PHONON PLUGIN POEMS QEQ
REACTION REAXFF REPLICA RIGID SHOCK SPH SPIN SRD TALLY UEF YAFF


And here is my job script:
#!/usr/bin/env bash

#SBATCH -N1 --exclusive

#SBATCH --partition=ccq

#SBATCH --constraint=skylake

#SBATCH -t 0-01:00:00

module -q purge
module load python3
module load openmpi
source /mnt/home/iahmadabadi/ccq-software-build/CavMD/cavity-md-ipi/i-pi-master-py3/env.sh

ulimit -s unlimited
export OMP_NUM_THREADS=2

i-pi input_traj.xml.bak &
sleep 20
mpirun -np 8 lmp -in in-equ.lmp

The system configuration as I am aware is:

  • 300 40-core Intel Skylake nodes with 768GB of RAM interconnected with OmniPath

The system consists of around 700 atoms a typical NVE simulation at room temperature. Please let me know if any further information I should provide.

Best,
Iman

Here is the crucial piece of information for your case: this LAMMPS executable is not compiled with MPI support. If you compiled from source, you may not have loaded the openmpi module before configuring/compiling.

with this you will be running 8 times the same simulation, each with just 1 MPI process. Checkout the log file, it should confirm that.

Using this setting is a waste of time, if you are not also using styles from the OPENMP package.

So to use OpenMP multi-threading (assuming that your input users pair and other styles that are supported by the OpenMP package) you would have to change your command line to:

mpirun -np 8 lmp -in in-equ.lmp -sf omp

One complication for your run is that you are using i-PI and that may turn your simulation into a multi-partition run. I have no experience with what the requirement for i-PI are. But without i-PI you should not use “mpirun”. If using mpirun with a serial LAMMPS executable is correct for i-PI, then you should at least do:

export OMP_NUM_THREADS=5
i-pi input_traj.xml.bak &
sleep 20
mpirun -np 8 lmp -in in-equ.lmp -sf omp

A 700 atom system is tiny and thus will not have much potential to be parallelized unless you are using a very “expensive” potential like ReaxFF, AIREBO, or one of the machine learning potentials.

Dear Axel, Thank you very much. My purpose is to use least amount of CPUs required to for this system ti use minimum CPU resources. Yes, you are right about the MPI as I was just doing some tests with and without it for different runs. The intermolecular interaction is just typical tip4p and nothing special in terms of expensive FF you mentioned. Therefore, what would be your suggestions for this type of runs?

Best,
Iman

I already showed you how you can try out different options to get the best performance. I assume you know enough calculus to compute the most effective resource usage from that.

Beyond that you can study the corresponding section of the LAMMPS manual. 7. Accelerate performance — LAMMPS documentation
There is no simple “do this, not that” answer, but benchmarking is required and the best options are different for each machine, model, and environment.

Yes of course I have been benchmarking. I just was assuming that since you asked me about the FF type, there might be a new piece of information that might be useful. Thank you again for your consideration.