Suboptimal LAMMPS performance on Threadripper CPU, why?

Hi,
I have recently updated the computing hardware of LAMMPS from Intel to AMD. Then, taking graphene stretching as an example, I tested the efficiency of two platforms. The main parameters are as follows:
Number of atoms: 10000~320000
Potential function: AIREBO
Parallel mode: MPI, with cores ranging from 8 to 32
LAMMPS version: LAMMPS-64bit-latest-MSMPI 2025-2-04

Moreover, when testing Intel, the LAMMPS code and result files are stored on the mechanical hard drive. While when testing AMD, they are stored on the solid-state drive and the hyperthread is closed. Nonetheless, I found that AMD’s computing efficiency did not significantly improve (<10%). What could be the possible reason? How to optimize?

Intel platform
CPU: Two Xeon E5 2699 v4 chips (each with 22 cores, 2.2-3.6GHz, maximum 4 memory channels)
Memory: DDR4 2400MHz 32G*5=160G
Hard drive: ATA SSD 120G

AMD platform
CPU: Threadripper 5975wx (32 cores, 3.6~4.5GHz, maximum number of memory channels 8)
Memory: DDR4 3200MHz 32G*8=256G
Hard drive: WD Blue SN5000 NVMe SSD 2T

What is the LAMMPS performance in atom-steps/second for each of the two machines? That is the metric that really matters.

i dont see why hard drive speed would have any effect with only 320000 atoms running on 160-256gb ram, unless youre doing a dump writing to disk every timestep.

both intel and amd cpus are avx2 not avx512 so no difference there in simd vectorization

port pair_arebo.cpp to KOKKOS by replacing for loops with parallel_for and KOKKOS_LAMBDA. look at src/KOKKOS for lots of examples. mpi domain decomposition is outer loop scaling with much higher overhead than inner loop scaling extremely efficient in KOKKOS within L1/L2 cache.

look at stats after a run, percentages will tell you where the bottleneck(s) are, eg.


Loop time of 0.942801 on 4 procs for 300 steps with 2004 atoms

Performance: 54.985 ns/day, 0.436 hours/ns, 318.201 timesteps/s, 637.674 katom-step/s
195.2% CPU use with 2 MPI tasks x 2 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0.61419    | 0.62872    | 0.64325    |   1.8 | 66.69
Bond    | 0.0028608  | 0.0028899  | 0.002919   |   0.1 |  0.31
Kspace  | 0.12652    | 0.14048    | 0.15444    |   3.7 | 14.90
Neigh   | 0.10242    | 0.10242    | 0.10242    |   0.0 | 10.86
Comm    | 0.026753   | 0.027593   | 0.028434   |   0.5 |  2.93
Output  | 0.00018341 | 0.00030942 | 0.00043542 |   0.0 |  0.03
Modify  | 0.039117   | 0.039348   | 0.039579   |   0.1 |  4.17
Other   |            | 0.001041   |            |       |  0.11

Nlocal:           1002 ave        1006 max         998 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Nghost:         8670.5 ave        8691 max        8650 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Neighs:         354010 ave      357257 max      350763 min
Histogram: 1 0 0 0 0 0 0 0 0 1

Total # of neighbors = 708020
Ave neighs/atom = 353.30339
Ave special neighs/atom = 2.3403194
Neighbor list builds = 26
Dangerous builds = 0

if youre scaling to too many mpi ranks for not enough atoms then Comm % will blow up.