Quickly extracting forces

Hello,

I am extracting forces from many atomic configurations and I have coupled my LAMMPS with C++ to reduce overhead and use the forces as variables in my C++ code. This is a lot faster than reading and writing dump files, but there is still some overhead associated with getting the forces. This is because I call "run 0 pre yes post no" up to 10^8 times for different configurations, and then extract the forces for those configurations. I was wondering if there is a faster way to accomplish this task.

Here is a simplified version of my C++ code to demonstrate my point:

#include <iostream>
#include <array>
#include <math.h>
#include <algorithm>
#include "mpi.h"

// these are LAMMPS include files
#include "lammps.h"
#include "input.h"
#include "atom.h"
#include "library.h"

using namespace std;

// LAMMPS NAMESPACE AND POINTERS
using namespace LAMMPS_NS;
char *args1[] = {
    (char *) "lmp",
    (char *) "-screen",
    (char*) "none",
     0};
LAMMPS *lmp;

const int N=64; // number of atoms
double f[3*N]; // per-atom force vector

int main(int argc, char **argv)
{
  MPI_Init(&argc,&argv);

  LAMMPS *lmp;

  lmp = new LAMMPS(3,args1,MPI_COMM_WORLD);

  // Prepare the run
  lmp->input->one("log none");
  lmp->input->one("neighbor 1.0 multi");
  lmp->input->one("boundary p p p");
  lmp->input->one("units metal");
  lmp->input->one("atom_style atomic");
  lmp->input->one("atom_modify map array");
  lmp->input->one("lattice diamond 5.431");
  lmp->input->one("region box block 0 2 0 2 0 2 units lattice");
  lmp->input->one("create_box 1 box");
  lmp->input->one("create_atoms 1 box");
  lmp->input->one("mass 1 28.0855");
  lmp->input->one("pair_style tersoff");
  lmp->input->one("fix 1 all nve");
  lmp->input->one("pair_coeff * * Si.tersoff Si");
  int loops = pow(10,4); // ---------------------------------------- Increase this number to observe overhead ---------------------------------------------
  for (int i=1; i<=loops; i++)
  {
    // Displace the atoms
    lmp->input->one("displace_atoms all random 0.01 0.01 0.01 5");
    // Evaluate the potential
    lmp->input->one("run 0 pre yes post no");
    // Get the forces
    lammps_gather_atoms(lmp,"f",1,3,f);
  }
  MPI_Finalize();
}

And my potential file looks like this:

# DATE: 2007-10-25 CONTRIBUTOR: Aidan Thompson, [email protected] CITATION: Tersoff, Phys Rev B, 37, 6991 (1988)

# Tersoff parameters for various elements and mixtures
# multiple entries can be added to this file, LAMMPS reads the ones it needs
# these entries are in LAMMPS "metal" units:
# A,B = eV; lambda1,lambda2,lambda3 = 1/Angstroms; R,D = Angstroms
# other quantities are unitless

# This is the Si parameterization from a particular Tersoff paper:
# J. Tersoff, PRB, 37, 6991 (1988)
# See the SiCGe.tersoff file for different Si variants.

# format of a single entry (one or more lines):
# element 1, element 2, element 3,
# m, gamma, lambda3, c, d, costheta0, n, beta, lambda2, B, R, D, lambda1, A

Si Si Si 3.0000 1.0000 1.3258 4.8381 2.0417 0.0000 22.9560
    0.3368 1.3258 95.3738 3.0000 0.2000 3.2394 3264.7000

The purpose of my C++ code is to perturb the atoms and get the forces for the perturbed configurations. Notice the "loop" variable in my code. The overhead I've experienced with this value are as follows:

10^3 loops: 1.1 seconds
10^4 loops: 11.2 seconds
10^5 loops: 145 seconds
10^6 loops: 11 minutes
10^7 loops: 1 hour
10^8 loops: 10 hours

If I change the run command to "run 0 pre no post no" this overhead is drastically reduced (obviously) but then the newly gathered forces are not representative of the perturbed configuration, since "pre no" acts as if nothing was changed.

I was wondering if there is a quicker way to get the forces from snapshots of many configurations. As shown, 10^8 calls of "run 0 pre yes post no" takes ~10 hours. I was hoping to do 10^10 runs for my purposes, and this would take a while.

Is there a quicker way to get the forces from many configurations, or have I done all I can do?

You don’t say whether you are running LAMMPS in parallel.

This call in your loop is slow:
lammps_gather_atoms(lmp,“f”,1,3,f);

b/c it is creating a new array (N,3) to hold the

forces and copying data into it to return to the caller.

In parallel it also involves communication.

If you instead use

f = (double **) lammps_extract_atom(lmp,“f”);

you will just get a pointer to the force array,

the same one LAMMPS uses internally.

You are then pointing directly at the memory

inside LAMMPS and can do whatever you want,

read the values, overwrite them, etc.

Note that in parallel, each processor will return

a pointer to the list of atoms on that processor.

There is nearly no overhead in that function call.

Steve

Also, I can’t remember if you’re the same person I recommended

this to.

You should time LAMMPS running 10^9 steps (or whatever) running

w/out calling it from C++. Just a single

run 100000…

command in your input script.

That’s the baseline timing you can’t do better than.

Until you know that, you don’t know how “inefficient”

any other method is that you’re trying.

Steve

i don't think that this is the problem here. at least not when running
with 1 processor.

what is called "overhead" is essentially the time it requires to
compute the forces and that cannot be avoided.

with "run 0 pre no post no" the forces will be computed on the very
first call since that pair->compute() during integrate->setup(), but
then never again since for all subsequent run commands only
integrate->setup_minimal() is called which does not call
pair->compute(). this is confirmed by the forces not being updated
after the displacements.

with "run 0 pre yes post no" pair->compute() is called every time,
since it calls integrate->setup() every time. there is some additional
cost since it also does some more additional work.

this additional (real) overhead can be reduced by using "run 1 pre no
post no", but since the desired propagation is that of a random
displacement, the call that instantiated fix nve has to be disabled.

this can be *very easily* confirmed by doing some simple profiling
(using the linux kernel level profiler for simplicity).

run 0 pre yes post no gives:

# Samples: 45K of event 'cycles'
# Event count (approx.): 35949059439

Also, I can't remember if you're the same person I recommended
this to.

You should time LAMMPS running 10^9 steps (or whatever) running
w/out calling it from C++. Just a single
run 100000...
command in your input script.

That's the baseline timing you can't do better than.
Until you know that, you don't know how "inefficient"
any other method is that you're trying.

...and a quick profiling as i have just pointed out would have
demonstrated the same and given additional insight.

i have to say that it is quite amusing to see something called
overhead (w/o proof) what essentially constitutes the core of the
science of a model. :wink:

axel.

I’m not running in parallel.

Thanks for the suggestion, getting the pointer to the force array did improve the speed slightly. But even if I don’t extract the forces at all, the overhead is about the same as my original method.

But you’re right, my method takes the same amount of time as calling “run 10000…”, so I guess this is as good as I can do.

Thanks Axel that was very informative. I never thought to run for 1 timestep without using a fix...

My code now takes about the same amount of time for N loops of "run 0" as it does to run for N timesteps, so I suppose it can't be faster without running in parallel now