Access eatom on device

Hello,

I would like to compute pe/atom but only on device as I’m using Kokkos. As I’m using the Tersoff pair, pe/atom is only eatom (plus handling the ghosts). Is there a clean way to access PairTersoffKokkos k_eatom? At the moment I added a getter function in the header file, but I would prefer not to touch the original code.

Thanks in advance.

Regards,

You have to clarify what exactly you mean by this. If you are using compute pe/atom, you have to transfer data to the host since there is no way to access data any other way for any proper processing in LAMMPS.

Hello, thanks for taking interest.

I’m using lammps as a library and adding code around it.

I want to process and filter the potential energy of some atoms in the GPU. At the moment I’m using lammps_extract_computeto get the double * and transfer it to the GPU as a unmanaged Kokkos::View.

As my data processing only lies on the GPU, I wanted to check if it’s possible to get the pe/atom on the GPU only. At the moment, the energy eatomis computed on the GPU, then synchronized to the CPU, where the lammps_extract_compute is called then the data is copied back to the GPU.
If I could remove the extra steps, that’d be great for performance.

Another way would be to implement a compute_pe_atom_kokkosbut I don’t fully understand the source code yet.

Thanks

I have two general comments on this:

  1. The library interface knows nothing about KOKKOS and we want to keep it that way. It is meant to be high-level and abstract what you are trying to do is the opposite. If performance and the cost of data matters that much, you would rather modify the KOKKOS package source code directly. You mentioned using the Tersoff pair style, so you could either modify it to include your processing or create a derived class that would separate the original pair style class from your additional processing.
  2. Are you sure that this is not a case of “premature optimization”? I.e. that you are trying to optimize something that has very little impact on total performance. Do you have any numbers that provide some evidence that this kind of hackish programming is needed? You are the first person I recall asking something like this in connection with KOKKOS, but we regularly get similar questions where people have performance concerns that are unjustified and irrelevant and are proposing rather complex steps that are not really needed. I am not saying that this is the case here, but if I was in your place, I would only want to dig deeper after having convinced myself that this effort is justified.

That would be a rather pointless undertaking because a) the way how a compute provides access to data is by processing data and making it available on the host and b) the pe computes actually don’t do any computing (despite the name), but rather only make data available that has been accumulated during the force computation. This is why (internally) any code trying to get data from a compute needs to call the addstep/clearstep functions of the Modify class to indicate at which next step this data has to be accumulated. If you want to understand the flow of control, you should rather study the regular CPU code. Looking at the KOKKOS code is just like you are looking at an “encoded version” of the same because of the requirements of Kokkos programming.

Hello,

Sorry, maybe I was not clear enough in my explanation. I don’t want to create code inside Lammps itself, nor its library interface. I’m using its library and creating code around it.

I did a quick and dirty benchmark between using the lammps_extract_computeand sending back the data to the GPU and accessing directly eatom. It’s 500µs vs 15µs. As I don’t want the data to be on the host, the improvement is nice.

Finally, regarding the compute, my request could be summarized as having a compute and make the data available/accesible on the GPU. But I understand that it may be difficult to generalize that idea.

Thanks for the tip regarding addstep/clearstep functions, I’ll try to use them to reduce the number and unneeded compute (accumulation).

Regards

As I already stated, the library interface is meant to be a high-level interface and thus the kind of direct access to internal data you are asking about is outside of its scope.

LAMMPS is open source and you can modify it any which way you like, but it is up to you to make sure you don’t break anything. There is currently no chance that any such change will make it into the official LAMMPS distribution.

The question is not how time much you can save on the on the data access itself, but rather how does it compare to the total evaluation of the force kernel? While a 33x speedup sounds nice, it would be rather meaningless, if the total force evaluation would take of the order of 10 milliseconds or more and thus the overhead would be less than 5% thus the possible improvement of the same order.

I suspect you are misunderstanding the purpose of these and without seenig some example of the code you are using, it is difficult to generally assess the situation.

Ah, sorry I did not mean the library.cpp but LAMMPS as a library itself (so liblammps.a). I understand, my question is indeed out of scope.

I agree, it is a premature optimization, as at the moment the slowest part is a run 0 (around 100ms). My goal was to run the simulation and some processing on the GPU as much as possible without synchronizing too much between the host and the device.

I’ll study more the code and its flow beforehand.

Thanks for your help.

Now here is something with potential for significant optimization:

  1. You should be using run 0 post no to disable the post run summary. This not only avoids some meaningless calculations, but also a lot of output. This should always work.
  2. You should check if the changes you do between calls to run 0 are of the kind that you can reduce the “setup()” step to its minimum with run 0 pre no post no. This avoids a redundant force computation and can make the invocation almost twice as fast.

Please see run command — LAMMPS documentation for more details and explanations.

Thanks! I was already doing run 0 post noto reduce the amount of output. I did not check pre nohowever. So I’ll look into it