Creating serial lammps LAMMPS object from C++ API

noam.bernstein · September 30, 2021, 12:48pm

I would like to use LAMMPS as a library from C++, and my own program can optionally use MPI, but I would like LAMMPS to always run as a serial process, so I compiled the shared library with mpi_stubs. My understanding is that this will create fake no-op local copies of all the MPI symbols.

The question is what to do about the communicator, which appears to be required for the C++ constructor of LAMMPS. There’s no obvious routine like library.cpp::lammps_open_no_mpi(). I can think of several possibilities:
1 There is such a routine, and I missed it
2 It doesn’t matter, because it’ll be ignored
3 There’s a way for my program to get access to the LAMMPS internal fake MPI_COMM_WORLD symbol
4 I have to pass MPI_COMM_SELF, but in this case what if my program doesn’t use MPI at all

Which of these, if any, are the right way for a C++ program to create a C++ object from a serially-compiled LAMMPS library?

akohlmey · September 30, 2021, 1:04pm

Why don’t you just use the C library interface?
Is there a particular reason you need to create the LAMMPS object in C++?

akohlmey · September 30, 2021, 1:16pm

To explain my questions a bit further. The library.h header is the only core LAMMPS header that does not require the mpi.h header. That is, unless you set the define that makes the lammps_open() function visible that then requires to also include mpi.h, and not just any mpi.h but exactly the same that was used to compile LAMMPS.

If you implicitly include mpi.h you must make certain, that the LAMMPS headers use the same mpi.h that is used in your code, unless you figure out a way that none of your code includes LAMMPS headers without requiring also calling MPI and thus requiring the mpi.h header and then you must include the mpi.h from the STUBS library and not use the MPI compiler wrappers for compiling those files and compile with the wrappers and the other mpi.h the rest of the files. At that point, it would be simpler to compile LAMMPS with the same MPI library as your application, but then create a custom communicator that has only one MPI rank with MPI_Comm_split() and then use that communicator for creating the LAMMPS class instances with C++ directly.

Thus using the C library interface would be far simpler. Also less risky, since any fatal error may lead to an MPI abort. You definitely want to compiler your LAMMPS library with exceptions enabled and put wrappers that catch exceptions around all function calls (or use the C-library interface that does that automatically and just query the exception state).

noam.bernstein · September 30, 2021, 1:19pm

The missing context is that I want to use LAMMPS (some of the time) for single point evaluation, and doing lammps_command("run 0") seems to incur a substantial overhead. I was hoping that from the C++ API there might be a way to trigger a pe compute without the rest of the run process. Then I realized that I didn’t even know how to initialize the C++ object in my use case.

I’m not complaining about the C++ API being low-level and not supporting this, just wanted to know if I was missing something.

akohlmey · September 30, 2021, 1:23pm

This overhead is required. The compute pe command does not do any computation, but just copies data that is collected during a full force evaluation and that is exactly what happens during a “run 0” command. If you don’t do a full setup you will get invalid data, like from the previous step. You can save some effort (and output) by using “run 0 post no”, and if you don’t do any changes to your system you can save some more overhead by using “run 0 pre no post no”, but bypassing the init and setup phases of the run command completely doesn’t get you anything useful.

noam.bernstein · September 30, 2021, 1:39pm

Obviously some of the overhead is needed, but it’s taking 8 times longer than a single step in a run N (~2.2 s for 100 steps vs. 0.28 s “Loop time”, 0.37 s elapsed time), so clearly it’s doing lots of stuff it doesn’t need to. Calling 100 x C-API lammps_command("run 1") is not as slow as doing a lammps input file loop that does 100 x run 1: that has a summed “Loop time” of 3.5 s, actual elapsed time of 23 s.

I realize this is a weird use case, so I’m expecting you to say that it’s too far out of scope for you to address, but I’d also be happy to move this discussion to a new thread if you prefer.

akohlmey · September 30, 2021, 1:57pm

When I run the LJ benchmark example with “run 100” in serial on my laptop here, it takes about 5 seconds. When I replace the “run 100” with:

variable a loop 100
label jump
run 1 pre no post no
next a
jump SELF jump

It also takes about 5 seconds. However, if you make changes to the system (like changing atom types, charges, positions etc.) you must do a full setup, e.g. to rebuild the neighbor lists and do various initialization calculations. Thus you must use “pre yes” (which is the default) and that increases (for the given example) the run time to 17 seconds. The difference between the “run 100” and “run 1” then is due to not being able to make those kinds of “disruptive” changes that are allowed in between runs. Please note that using “run 1” will trigger two force evaluations: one during the setup and a second during the timestep processing. With “run 0” one of those is avoided that the total time of the loop version is already reduced to 10 seconds on my machine.

The setup procedure is not optimized because there is not much point for it in an MD code, since it is run rarely and thus does not contribute much to the total time and it is important that it is reliable and can handle all usual and unusual circumstances.

Bottom line, unless you don’t modify your system between the two run invocations, there is no way to avoid the overhead unless you write an MD code specifically for that purpose, which would then look more like a code for Monte Carlo than for MD. But also your estimate by comparing run to “run 1” is overestimating the overhead for using “run 0”.

noam.bernstein · September 30, 2021, 3:22pm

Interesting. You’re right about run 100 vs. 100 x run 1 : I had a big mistake in my loop. That fixed, I can reproduce your results (0.5 s for my 10^3 SC lattice LJ with or without the loop), and if I don’t use “pre no post no”, I get about 2.3 s. I didn’t realize how much of a difference “pre no post no” makes.

If I could reproduce the performance of 100 x run 1 pre no post no, I’d be happy. However, if I run an initial lammps_command("run 0"), the repeatedly move atoms (no changes to types or cell for now) with lammps_scatter_atoms and try to re-evaluate with lammps_command("run 0 pre no post no"), the extracted pe compute never changes, and the time is even faster (0.07 s). The lammps input file loop must be doing something that just repeatedly calling lammps_command doesn’t do. Even more oddly, if I do lammps_command("run 1") I always get 0 for the pe compute.

P.S. If you feel this is moving too far from what you want to deal with, please say so before getting annoyed with trying to address my questions (if it’s not too late). I know this isn’t what LAMMPS was designed for.

akohlmey · September 30, 2021, 3:39pm

I forgot one more thing to mention. If you run in a loop and move atoms in between, you must update the neighbor list. Typically, that is not done that often during a run. So in my example, I wasn’t making a fair comparison. The “run 100” command would have to be run with also setting “neigh_modify delay 0 every 1 check no”. The LJ benchmark runs the neighbor list build only every 20th step and that saves quite a chunk of time.

If you move atoms you must not use “pre no” and you must have a neighborlist rebuild. So that is an additional cost you must incur. Like I already mentioned, unless you do a significant programming of the internal LAMMPS code (and run the high risk that you skip over initializing something that is expected to be initialized and updated and thus get (subtly?) bogus data) I don’t see a simple way to save time. What you are proposing would require writing a kind of “single_point” command (instead of “run” that would allow to selectively choose which subsystem to re-initialize or not. But that requires a lot of programming. Then you could just go about and write your own “C++ driver” that is just API compatible with the LAMMPS pair styles (and or whatever else you would need) and then provides the coordinate and neighbor list management yourself in the format that those pair styles expect.

This is too opaque. You need to provide a complete and simple enough example to demonstrate where you are not getting the data you expect.

This is quite suitable for the LAMMPS development category. I think this discussion will be useful for others. The fact that it is pushing things to a more extreme level is part of the game. If people hadn’t pushed the boundaries in the past, LAMMPS would not be where it is now. While it is sometimes difficult to discuss internals with somebody that is not familiar with them, such discussions are also helpful to re-evaluate my own understanding. Over time sometimes one becomes complacent and forgets details or becomes used to glossing over it. Also, this would not be the first discussion that may end up without a satisfactory resolution at the time, but would inspire some changes or improvements at a later point. Being aware where people reach boundaries of the capabilities of the code can be useful when rebuilding some internal control flows or refactoring parts of the code. Sometimes this just needs a little spark from somewhere else to ignite a good idea and a significant improvement. or not.

noam.bernstein · September 30, 2021, 3:47pm

OK - let me post some more specific inputs.

FWIW, I am using a very similar neigh_modify command (both for pure lammps and calling from C++), except I had check yes. When I switch to check no, the speeds for LAMMPS loop and C++ loop are very similar. That suggests to me that check yes isn’t working when my C++ code moves atoms using lammps_scatter_atoms(..."x"...), although it seems to work OK when it’s a pure LAMMPS run.

Let me make some simplified code and then I’ll post it.

noam.bernstein · September 30, 2021, 5:08pm

Quick question, as I try to formulate my simple example. Should "run 0 pre no" work if atoms move as long as they don’t move enough to need a new neighbor list, i.e. max dist < skin/2 ?

akohlmey · September 30, 2021, 5:30pm

No.

noam.bernstein · September 30, 2021, 5:34pm

OK. In that case, should I expect the speed of "run 0 pre yes" to be different if atoms move less than skin/2 and I compare "neigh_modify delay 0 every 1 check yes" to check no ?

akohlmey · September 30, 2021, 6:02pm

yes. you just need to compare the time spent in the Neigh section of the post run summary (i.e. with “post yes”.

noam.bernstein · September 30, 2021, 6:03pm

Answering my own question, check is ignored with a repeated run 0 because Verlet::setup() is called repeatedly, and it always calls neighbor->build(1). Within a single run N command Verlet::run() checks neighbor->decide() == 1 to see if the build is needed. Before I investigate more, how terrible an idea do you think the following patch might be? Is it obvious to you that it’ll break anything? Do you have any particular tests to suggest? Obviously I can move some atoms around randomly with a sequence of run 0 pre yes, and see if it changes anything.

> git diff
diff --git a/src/verlet.cpp b/src/verlet.cpp
index b9b0b39..a178de2 100644
--- a/src/verlet.cpp
+++ b/src/verlet.cpp
@@ -121,9 +121,11 @@ void Verlet::setup(int flag)
   if (triclinic) domain->lamda2x(atom->nlocal+atom->nghost);
   domain->image_check();
   domain->box_too_small_check();
-  modify->setup_pre_neighbor();
-  neighbor->build(1);
-  modify->setup_post_neighbor();
+  if (neighbor->decide() == 1) {
+      modify->setup_pre_neighbor();
+      neighbor->build(1);
+      modify->setup_post_neighbor();
+  }
   neighbor->ncalls = 0;
 
   // compute all forces

akohlmey · September 30, 2021, 6:10pm

You can never know whether the data used for neighbor->decide() is valid or not.
It most certainly won’t be at the very first step. Like I already said, if you want to tweak choosing what is run or not, you will have to basically build your own command and that can create a fix store instance to store data and then be called repeatedly and where you can select from the outside which parts of the flow of control of the “init()” and “setup()” steps are run. And it will be very “dangerous” because you can never know whether some feature in LAMMPS expects to be run after a full “init()/setup()” and thus will result in undefined behavior when you skip parts of it. The only way to know is if you are entirely in control of all data like when your code does “everything” except for the actual call to Force::compute().

noam.bernstein · September 30, 2021, 6:19pm

Indeed, it doesn’t work on the first call. My code does basically know if anything other than changing positions/cell happened since the last call, so it might be able to do this in a way that’s safe enough for my own purposes. But I agree that it doesn’t seem likely that LAMMPS would be able to handle this internally.