Hi,
I have recently updated the computing hardware of LAMMPS from Intel to AMD. Then, taking graphene stretching as an example, I tested the efficiency of two platforms. The main parameters are as follows:
Number of atoms: 10000~320000
Potential function: AIREBO
Parallel mode: MPI, with cores ranging from 8 to 32
LAMMPS version: LAMMPS-64bit-latest-MSMPI 2025-2-04
Moreover, when testing Intel, the LAMMPS code and result files are stored on the mechanical hard drive. While when testing AMD, they are stored on the solid-state drive and the hyperthread is closed. Nonetheless, I found that AMD’s computing efficiency did not significantly improve (<10%). What could be the possible reason? How to optimize?
Intel platform
CPU: Two Xeon E5 2699 v4 chips (each with 22 cores, 2.2-3.6GHz, maximum 4 memory channels)
Memory: DDR4 2400MHz 32G*5=160G
Hard drive: ATA SSD 120G
AMD platform
CPU: Threadripper 5975wx (32 cores, 3.6~4.5GHz, maximum number of memory channels 8)
Memory: DDR4 3200MHz 32G*8=256G
Hard drive: WD Blue SN5000 NVMe SSD 2T
i dont see why hard drive speed would have any effect with only 320000 atoms running on 160-256gb ram, unless youre doing a dump writing to disk every timestep.
both intel and amd cpus are avx2 not avx512 so no difference there in simd vectorization
port pair_arebo.cpp to KOKKOS by replacing for loops with parallel_for and KOKKOS_LAMBDA. look at src/KOKKOS for lots of examples. mpi domain decomposition is outer loop scaling with much higher overhead than inner loop scaling extremely efficient in KOKKOS within L1/L2 cache.
look at stats after a run, percentages will tell you where the bottleneck(s) are, eg.
Loop time of 0.942801 on 4 procs for 300 steps with 2004 atoms
Performance: 54.985 ns/day, 0.436 hours/ns, 318.201 timesteps/s, 637.674 katom-step/s
195.2% CPU use with 2 MPI tasks x 2 OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
---------------------------------------------------------------
Pair | 0.61419 | 0.62872 | 0.64325 | 1.8 | 66.69
Bond | 0.0028608 | 0.0028899 | 0.002919 | 0.1 | 0.31
Kspace | 0.12652 | 0.14048 | 0.15444 | 3.7 | 14.90
Neigh | 0.10242 | 0.10242 | 0.10242 | 0.0 | 10.86
Comm | 0.026753 | 0.027593 | 0.028434 | 0.5 | 2.93
Output | 0.00018341 | 0.00030942 | 0.00043542 | 0.0 | 0.03
Modify | 0.039117 | 0.039348 | 0.039579 | 0.1 | 4.17
Other | | 0.001041 | | | 0.11
Nlocal: 1002 ave 1006 max 998 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Nghost: 8670.5 ave 8691 max 8650 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Neighs: 354010 ave 357257 max 350763 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Total # of neighbors = 708020
Ave neighs/atom = 353.30339
Ave special neighs/atom = 2.3403194
Neighbor list builds = 26
Dangerous builds = 0
if youre scaling to too many mpi ranks for not enough atoms then Comm % will blow up.