I have a C++ program that calls LAMMPS via the dynamically linked library. For various reasons this program repeatedly does “unfix …” “fix …” “run N” where N is small. It’s quite a bit slower than a pure LAMMPS MD run, and obviously I know that there are lots of possible sources of overhead that I’m causing with this approach of repeatedly doing unfix … fix and lots of short trajectories.
In investigating those, it appears that a relatively large fraction of the time is spent in what appears to be setup of the pair style, in this case EAM, e.g. PairEAM::init_style (mainly interpolate, some array2spline and file2array). Because it’s in a dynamic library I can’t seem to get gprof information on the call graph, so I can’t tell where exactly these calls are being triggered from. Does anyone have any idea what in this unfix/fix/run process (if anywhere) might trigger calls to reinitialize the pair style?
You don’t need gprof. As little bit of grep can get you just as far and and then a bit.
Pair::init_style() is called from Pair::init() which is called from Force::init() which is called from LAMMPS::init(). The latter is called from a bunch of places, but the one that is most likely called in your case would be as part of the run command.
It can be avoided by using run N pre no, but if you create/delete fixes, that is not going to work, since you must initialize those before you can use them.
as one of the conditions under which pre no won’t work. Does that mean that I can run with pre no , moving atoms between calls to run 0, and the neighbor list will be rebuilt according to the usual rules, as long as I don’t change any neighbor list parameters? Or is pre yes required to even check whether a neighbor list needs to be rebuilt?
I guess what I want to know is what changes are allowed between calls to run N pre no post no, if any. I.e., what exactly counts as “settings” in
If a run is just a continuation of a previous run (i.e. no settings are changed),
I don’t want to change any LAMMPS parameters, just the atomic configuration (positions, cell, and types, ideally). Is that likely to work?
[edited] from what I can tell in run.cppall the initialization happens in the same place. I don’t see any way to separate recreating the PairEAM splines (which happens in force initialize, called from lammps initialize) from the neighbor list check.
Depends on what. For example, changing atom types creates a problem under certain conditions, e.g. the tail correction for pressure depends on the concentration of atoms types. Basically, you can make changes that are similar to what happens during regular time integration. For example, updating positions can be done, however not so far to require a border communication or a full reneighboring, e.g. beyond half of the neighbor skin distance or outside the ghost atom cutoff.
Issuing commands is almost always requiring a full init. Those commands are usually expecting that any changes they make will be made consistent across the whole domain (i.e. update the ghost atoms accordingly) during an init.
If it is just the EAM splines that are costing you so much, I would consider adding some code to the pair style to try and cache them between invocations, so that its init_style functions will only regenerate them when they need to be regenerated (e.g. if changed by a pair_coeff command), perhaps implement a pair_modify options or something similar.
I just made a test and it looks to me that you are mistaken in where you attribute the time lost.
It is pretty straightforward to implement this, and i find that the performance impact is not measurable.
The main performance difference between pre yes and pre no is that with pre no you do not recompute the forces (as they are used in the first half step of the velocity verlet algorithm before the position update). If you have a construct like this
variable n loop 1000
label again
run 1 post no pre yes
next n
jump SELF again
The time with pre yes is essentially double of what it takes with pre no as the number of force computations is double.
Here is the patch for you to check out.
diff --git a/src/MANYBODY/pair_eam.cpp b/src/MANYBODY/pair_eam.cpp
index a3d4257cc2..08818b906c 100644
--- a/src/MANYBODY/pair_eam.cpp
+++ b/src/MANYBODY/pair_eam.cpp
@@ -66,6 +66,7 @@ PairEAM::PairEAM(LAMMPS *lmp) : Pair(lmp)
frho_spline = nullptr;
rhor_spline = nullptr;
z2r_spline = nullptr;
+ compute_splines = true;
// set comm size needed by this Pair
@@ -370,6 +371,7 @@ void PairEAM::settings(int narg, char **/*arg*/)
void PairEAM::coeff(int narg, char **arg)
{
if (!allocated) allocate();
+ compute_splines = true;
if (narg != 3) error->all(FLERR,"Incorrect args for pair coefficients");
@@ -420,9 +422,11 @@ void PairEAM::coeff(int narg, char **arg)
void PairEAM::init_style()
{
// convert read-in file(s) to arrays and spline them
-
- file2array();
- array2spline();
+ if (compute_splines) {
+ file2array();
+ array2spline();
+ compute_splines = false;
+ }
neighbor->add_request(this);
embedstep = -1;
diff --git a/src/MANYBODY/pair_eam.h b/src/MANYBODY/pair_eam.h
index 67bbde570d..bb67101e45 100644
--- a/src/MANYBODY/pair_eam.h
+++ b/src/MANYBODY/pair_eam.h
@@ -67,6 +67,7 @@ class PairEAM : public Pair {
double cutforcesq;
double **scale;
bigint embedstep; // timestep, the embedding term was computed
+ bool compute_splines;
// per-atom arrays
From what I can tell PairEAM::interpolate is called by array2spline which is called by init_style, so it’s not part of compute that I’m somehow failing to attribute correctly. Call graph indicates that int_mallinfo is called someplace below Output::setup.
Could be that my fixes are setting something unnecessary that’s causing extra overhead. I definitely don’t think I understand everything they’re doing, since they are modified based on other fixes and empirical observation.
Sorry, but this looks bogus. The only way I could think of that the splining in pair style EAM would take that much time would be with a system that has a very large number of atom types and a small number of atoms by comparison and that the EAM table is large. The computational cost of this step should be O(N**2) with the number of atom types.
I think the best way to proceed is that you try to emulate what you are doing with a simple input file using only features available in regular LAMMPS. I would start doing this by first trying to remove as much as possible from your special program as possible and also see how much impact the fix/unfix calls have until you have just the core of what is causing the performance regression and then try to replicate that with a regular input.
I’m not sure what you mean by “bogus”. It either means that it sucks (as in “send them to the iron maiden! excellent! execute them! bogus”), in which case I agree, or it means it’s fake, in which case I’m not sure what to say except that it’s what perf reports.
I’ll see if I can reproduce it with an input file, and whether that depends on what fixes I use.
I don’t know what was going on in my original run with the int_mallinfo, and I guess I need to figure it out, because it’s even more than the interpolate.
In any case, even for repeated fix nve the PairEAM::interpolate % is similar to my original run - not overwhelming, but measurable. This is with 2 atom types, 32 atoms, 1000 loop iters with 3 x run 8 commands (to approximately reproduce what my code really does, although 32 atoms is probably too few). It’s not noticeably different with my own fixes instead of nve.
Right now my code randomly picks from 3 possible fixes, which is why I keep on having to unfix, run 8, fix, but I can merge those into a single fix that does all 3 types of steps, and then I can do more like run 40, which will be a lot better, even if I don’t manage to do anything else to reduce the run init cost.
As I experiment with various compiler flags to figure out the profiling confusion, is there a way to add flags to the defaults that cmake will use? When I set the CXXFLAGS env var or use -D CMAKE_CXX_FLAGS=... the flags I list replace the default, and also wipe out the C and Fortran compiler flags that are printed by cmake.
[edited] I can get around this by noting the default flags from a plain cmake run and then manually concatenating the profiling flags.
I am confused. You don’t need to instrument your binary when you are using perf for profiling. That is its major appeal. Adding instrumentation can have negative side effects by prohibiting inlining and adding significant overhead to small/short functions.