CPU Usage LAMMPS

Stricto · May 8, 2025, 10:16pm

I have just started using LAMMPS and have been using MPI however I have been unable to get the CPU use anywhere near 100% or higher. Any ideas on what this could be would be greatly appreciated. The cmd output can be seen below

Performance: 8.025 ns/day, 2.991 hours/ns, 18.577 timesteps/s, 74.308 katom-step/s
56.6% CPU use with 12 MPI tasks x 1 OpenMP threads

akohlmey · May 8, 2025, 11:41pm

There is not enough information here to make any kind of meaningful assessment.

ripples916 · May 9, 2025, 6:23am

**Would you mind sharing these details **

Your operating system (Windows/Linux)
Computer specifications (core count and logical processors)
The exact LAMMPS command you used (starting with ‘mpi’)

Stricto · May 9, 2025, 2:21pm

I am on Windows, with a 12900HK (14 cores and 20 logical processors) although I used 8 cores for the process. My command was “mpiexec -np 8 lmp -in <infile.lammps”

stamoor · May 9, 2025, 5:01pm

What LAMMPS version, what pair style, etc. ?

Stricto · May 9, 2025, 6:09pm

The pair style is:
#pair_style kim LennardJones612_UniversalShifted__MO_959249795837_003
The LAMMPS version is from April 2 2025 and is the LAMMPS-64bit-latest-MSMPI.exe download

akohlmey · May 9, 2025, 7:16pm

This is just a standard LJ potential, but uses the implementation of OpenKIM.

Please try running the in.lj benchmark input for comparison form the bench folder, but run it with:

mpiexec -n 1 lmp -in in.lj.lmp -v x 2 -v y 2 -v z 2

and then vary the number of processes in the sequence 1 2 4 8 16.
On my desktop running Linux with an AMD Ryzen 7 7840HS CPU (8 physical cores with hyperthreading) I get:

Loop time of 7.01462 on 1 procs for 100 steps with 256000 atoms

Performance: 6158.570 tau/day, 14.256 timesteps/s, 3.650 Matom-step/s
99.6% CPU use with 1 MPI tasks x 1 OpenMP threads
--
Loop time of 3.64519 on 2 procs for 100 steps with 256000 atoms

Performance: 11851.224 tau/day, 27.433 timesteps/s, 7.023 Matom-step/s
99.5% CPU use with 2 MPI tasks x 1 OpenMP threads
--
Loop time of 1.98606 on 4 procs for 100 steps with 256000 atoms

Performance: 21751.593 tau/day, 50.351 timesteps/s, 12.890 Matom-step/s
99.4% CPU use with 4 MPI tasks x 1 OpenMP threads
--
Loop time of 1.18628 on 8 procs for 100 steps with 256000 atoms

Performance: 36416.382 tau/day, 84.297 timesteps/s, 21.580 Matom-step/s
99.3% CPU use with 8 MPI tasks x 1 OpenMP threads
--
Loop time of 0.939385 on 16 procs for 100 steps with 256000 atoms

Performance: 45987.518 tau/day, 106.453 timesteps/s, 27.252 Matom-step/s
98.7% CPU use with 16 MPI tasks x 1 OpenMP threads

To get more than 100% CPU usage, you need to use multi-threading, in this case through setting the environment variable OMP_NUM_THREADS to 2 and adding the -sf omp flag (the lj/cut style in the lj benchmark is supported by the OPENMP package). With 1 to 8 MPI ranks this now looks like this:

Loop time of 3.30538 on 2 procs for 100 steps with 256000 atoms

Performance: 13069.608 tau/day, 30.254 timesteps/s, 7.745 Matom-step/s
199.0% CPU use with 1 MPI tasks x 2 OpenMP threads
--
Loop time of 1.79998 on 4 procs for 100 steps with 256000 atoms

Performance: 24000.205 tau/day, 55.556 timesteps/s, 14.222 Matom-step/s
198.8% CPU use with 2 MPI tasks x 2 OpenMP threads
--
Loop time of 1.10091 on 8 procs for 100 steps with 256000 atoms

Performance: 39240.313 tau/day, 90.834 timesteps/s, 23.254 Matom-step/s
198.3% CPU use with 4 MPI tasks x 2 OpenMP threads
--
Loop time of 1.04419 on 16 procs for 100 steps with 256000 atoms

Performance: 41371.734 tau/day, 95.768 timesteps/s, 24.517 Matom-step/s
186.5% CPU use with 8 MPI tasks x 2 OpenMP threads

Thus, since you are not using an OPENMP supported pair style, you cannot reach more than 100% (within the accuracy of CPU usage accounting). If you get less than that, it may mean that you may have a load imbalance (the information printed from MPI rank 0 only) or there are some other (system) processes using the CPU (always possible on Windows).

Stricto · May 12, 2025, 4:49pm

Thank you for your help, I was able to get very similar results. However, when using my input file it is still struggling getting 30-60% cpu utilization when using 2 open mp threads and 8 mpi tasks. I have copied it below.

Blockquote # LAMMPS simulation for an incident argon atom on a liquid argon target
#Initialization
units metal
atom_style atomic
dimension 3
boundary p p p

#2) System Definition
lattice fcc 5.75 # FCC latice at 5.75 (5.397) angstroms
region box block 0 50 0 50 0 50 # Create a region called “box” with block dimensions of X lattice units
create_box 1 box
type of atom
create_atoms 1 box

pair_style lj/cut 10.0 # LJ cutoff is 10 A
pair_coeff * * 0.0103 3.4 # sigma = 3.4 angstrom, epsilon = 0.0103 eV

#3) Simulation Settings
mass 1 39.948 # Liquid argon mass g/mol

velocity all create 120.0 12345 mom yes rot yes dist gaussian

neigh_modify one 10000 # Allow each atom (one) to have a maximum 10000 neighbours

#Setup file to dump the data to, format: NAME IDofAtomsToBeDumped custom_attributes timestepsBetweenDumps filename
dump equilibration all custom 1000 equilibration_phase.lammpstrj id type x y z vx vy vz
thermo_style custom step temp etotal press density vol
thermo 100 # Print thermo data every 100 steps

minimize 1.0e-4 1.0e-6 100 1000

fix nvt all nvt temp 85.0 85.0 0.1
timestep 0.001
run 10000
unfix nvt

#Use an NPT ensemble to equilibrate the liquid argon
#format: Name MoleculeIDtoFix fixType temperaturefixing tstart tstop dampingTime
fix npt all npt temp 85.0 85.0 0.1 iso 1.0 1.0 1.0
timestep 0.001 # Timestep in ps
run 100000 # run the simulation for 100,000 timesteps to thermalize the liquid
unfix npt

fix nvt all nvt temp 85.0 85.0 0.1
timestep 0.001
run 1000
unfix nvt

run 100

mkanski · May 12, 2025, 6:46pm

You want to use 2*8=16 cores, but your processor only have 14. Hyper-threading does not really help in MD calculations and it even may slow them down. Addionally, some of your processor’s cores are “efficient”, i.e. slow. Mixing slow and fast cores with HT on top of it can lead to low cpu usage, because the fast cores need to wait for the “efficient” and HT ones.

Try using at most 6 cores (e.g. 6 MPI ranks without OpenMP or 3MPI and 2 OpenMP) and see if the CPU utilization isn’t better.