N^(4/3) runtime scaling?

Hi everybody! I’ve noticed that for runs with a fixed number of processors on a single shared-memory machine, LAMMPS’ runtime for N-particle simulations with short-range pair potentials (e.g. Lennard-Jones) scales as N^x, where 1.2 < x < 1.4 The optimal linear scaling would be x = 1. Is x > 1 because of memory-access issues? The N I’m referring to (10^3-10^6.5) aren’t large enough to require use of virtual memory, but I guess larger N still require the CPUs to access more “distant” parts of physical memory, which slows things down? IIRC this is a common problem when dealing with very large arrays.

Is that what’s going on here? If so, do you know of a reference that discusses this issue and how best to deal with it in MD simulations? I haven’t been able to find anything about it in the LAMMPS documentation…

Thanks,
Rob

To determine what is causing unexpected performance differences, usually some level of profiling is needed, e.g. using the CPU’s internal performance counters. Those could for example inform you of changes in cache efficiency or branch prediction efficiency, CPU pipeline stalls and so on.

There are some changes in the code layout and memory data access that are available (with increasing complexity, aggressiveness, compiler support dependence and impact) in the USER-OMP, OPT and USER-INTEL packages. that may already stipulate some of the performance limits.

Another issue you didn’t discuss is whether you are using processor affinity or not.

Making most effective use of hardware for MD simulation performance is tricky and can be affected by many ways of how code is designed and written. In general, the more readable, maintainable and extensible a package is, the less optimized it is. The comparing the code in the 3 packages listed above to their “standard” versions is a good indication. Bio-MD codes for example typically have special optimized subroutines for water-water interactions, since that is the majority of their computational effort, typically.
Or you can have code generators or use extensive pre-processing macros for having optimized innerloops of force kernels that include vectorization and optimized memory accesses. It is also possible to tweak the neighbor list generation for special cases, e.g. for multi-threading or to process groups of particles at once in vector operations or to follow a specific spacefilling curve to optimize caching.

Also within LAMMPS there are some tweaks possible, e.g. the choice of neighborlist skin and neighbor list settings or how to balance threading and MPI or whether to use newton’s third law for pairwise interactions across MPI subdomains or not.

Axel.

Greetings,

Your simulation input, lammps commands and the model geometry as you increase N, may be worth judging. What kind of input file and model are you using? Also how did you compile lammps, if you did so, or what binary did you employ etc.?

Adrian Diaz

Hi Axel. Happy New Year, and thanks for the very detailed reply! I’ll definitely look into processor affinity, optimizing loops & memory access, etc.

Best,
Rob

Hi Adrian. Thanks for the reply! The N^x runtime scaling seems to be pretty generic. For example, I also see it (with about the same x) in an OpenMP MD code I’ve written. I think it must have to do with one or more of the issues Axel mentioned…

Best,
Rob