Poor performance of lammps-7thAug with CPU acceleration (all packages tested)

Madhur_Aggarwal · November 4, 2019, 9:27am

Hi all,

I am using lammps-7th Aug, 2019 version on HPC cluster with dual Intel Xeon E5-2640 v4 CPUs. I compiled it with all the CPU accelerator packages separately. When I test them on my system containing 39841 atoms with 10 CPUs, I get terrible performance with each CPU accelerator package. (input script attached)

When I look at the CPU utilization stats, I see that at any given point of time during the simulation, some CPUs are being utilized ~ 100 and others ~ 0%; but I think that is understandable since it depends on the calculations being performed. This happens for each CPU accelerator package I use.

CPU configuration: dual Intel Xeon E5-2640 v4 (abbreviated as Broadwell, for KOKKOS)

compiler used: gcc version 4.8.5 (Red Hat 4.8.5-39)

Following is the list of compilation command, run command and performance output from log files for each CPU acceleration package:
(multi-threading not used anywhere)

OPT package: cmake -D LAMMPS_MACHINE=mpi -D PKG_OPT=yes -D PKG_MANYBODY=on -D PKG_MOLECULE=on -D PKG_KSPACE=on …/cmake
mpirun -np 10 lmp_mpi -sf opt -in md.inp
Performance: 57.667 hours/ns

md.inp (1.08 KB)

akohlmey · November 4, 2019, 10:15am

I disagree with your assessment.

You are getting up to a 3x performance boost by CPU optimization alone. That is exceptionally good. How much more of a performance boost did you expect???
This is not like you are offloading calculations to additional components like GPUs (and even then you might run into issues since your system is rather small for that use case). The only way to get a significant larger speedup from code optimization alone would be if the default implementation would be poor.

Please note, that your system is rather small for a classical MD code to parallelize across many MPI ranks and that you may have other issues with it due to its system geometry and particle distribution, that we cannot tell, since you only provide the input script, but not a representative data file or the performance summary of the log files. Please also note, that CPU utilization is not a measure of performance. In fact, CPU utilization will be higher with non-optimized code.

Please also note, that a machine with the configuration you describe would have 20 CPU cores in total, so with using 10 MPI ranks for the parallel runs and no multi-threading, you should have 10 idle CPU cores.

Axel.

_Diaz_Adrian · November 4, 2019, 2:17pm

is your model evenly distributed in space? if your simulation box contains pockets with large amounts of empty space you might see some MPI tasks get no usage.