CORESHEL potentials implementation on GPUs

Dear all,

I’ve been using LAMMPS version of 30 July 2016 using buckingham potentials with the CORE/SHELL package. Lately some GPUs have become available and I would like to use them for my LAMMPS simulations. As far as I see, although the Buckingham potentials are implemented for GPUs, the CORESHELL package is not, am I right?

If that is the case, is GPU acceleration for the CORESHELL package coming soon? How difficult it is to write the code for the GPU acceleration?

Kind regards,

Toni Macià

Dear all,

I’ve been using LAMMPS version of 30 July 2016 using buckingham potentials with the CORE/SHELL package. Lately some GPUs have become available and I would like to use them for my LAMMPS simulations. As far as I see, although the Buckingham potentials are implemented for GPUs, the CORESHELL package is not, am I right?

If that is the case, is GPU acceleration for the CORESHELL package coming soon?

​please see these two pull requests, that were merged to the development branch recently and will be included in the next patch release.​

https://github.com/lammps/lammps/pull/926

https://github.com/lammps/lammps/pull/958

How difficult it is to write the code for the GPU acceleration?

​since the difference between buckingham and born potentials is small, it should be straightforward ​to add that support based on the recently contributed files. i am copying Trung, the contributor of the two pull requests, perhaps he can give additional advice or may even be willing to contribute a buck/cs/gpu potential and its coulomb variants as well.

regards,
axel.

Hi Toni,

below is the step-by-step instruction for adding new pair styles for the GPU package (when pair table/gpu does not apply). I encourage you try to implement the version buck/coul/long/cs/gpu based on what is done with born/coul/long/cs/gpu. I agree with Axel that the changes should be straightforward.

If you have any issues with the compilation or runtime errors, please post them here and I can take a look.

Cheers,
-Trung

There are two places you need to make addition to:the GPU library in lib/gpu (i.e. libgpu.a) and the /gpu styles in src/GPU. Let’s say you have already implemented a pair style class named PairFoo. Now you want to have a new class PairFooGPU for GPU acceleration.

  1. Addition to the GPU package, i.e. lib/gpu

You will need to add/implement four source files:

lal_foo.h: header to the class Foo
lal_foo.cpp: implementation of the class Foo
lal_foo.cu: the GPU kernel(s) for the force compute, where the computation mirrors what you have in PairFoo.
lal_foo_ext.cpp: contains an instance of the Foo class and the necessary functions to be added into the GPU library, libgpu.a. The functions are essentially invoked for initialize, compute and clean up the memory allocated for the Foo instance.

A good start to look at is the corresponding files for the Gauss class in lib/gpu. You will see how the per-type arrays are declared and allocated (lal_gauss.h and lal_gauss.cpp), how the kernels are implemented (lal_gauss.cu) and how the exported functions are defined (lal_gauss_ext.cpp).

Finally, you need to modify Nvidia.makefile you are using to build libgpu.a (assuming you are using the CUDA toolkit) to include the newly added files. Again, you can look for the lines in Nvidia.makefile that contain lal_gauss* to see how they got built. In case you wand to compile with OpenCL, Opencl.makefile is where to look and Makefile.linux_opencl is the relevant Makefile.

Now, you can rebuild the GPU package via: make -f Makefile.your_machine

  1. Addition to src/GPU

Once you are successful in building the GPU package with the new Foo class from the previous step (by checking that the lal_foo.o and lal_foo_ext.o are included in libgpu.a), it’s time to create an entry from src/GPU to call the external functions defined in lal_foo_ext.cpp. You will need to create the PairFooGPU class (pair_foo_gpu.h and pair_foo_gpu.cpp). Again, pair_gauss_gpu.h and pair_gauss_gpu.cpp are good examples to start with.

Finally, you need to modify Install.sh in src/GPU so that the newly added pair style got installed when users (you) run “make yes-gpu” or “make package-update” from src/. The lines to be added should look similar to what is done to pair_gauss_gpu.cpp and pair_gauss_gpu.h.

Once you are done implementing the PairFooGPU class, you can copy the source files into src/ and rebuild LAMMPS with the updated GPU package:

make yes-gpu
make your_machine

  1. Modification to src/GPU/Install.sh

You can make changes to the Install.sh script so that the newly added gpu pair style can be installed/updated/uninstalled via make package-update, or make no-gpu. Again, take a look at how this is done with the existing styles.

Dear all,

Sorry for the slow reply, the fact is that I’m not allowed to compile stuff on the GPUs by myself and I must go through our technicians, which means it takes some time until each compilation test is done.

I think I’ve been able to do the first part of the implementation (the addition on lib/gpu), but we are encountering problems with respect to the compiling lammps. I’m actually a bit confused because the two first error lines are:

…/pair_buck_coul_long_cs_gpu.cpp(40): error: name must be a namespace name
using namespace MathConst;
^

…/pair_buck_coul_long_cs_gpu.cpp(84): error: name followed by “::” must be a class or namespace name
PairBuckCoulLongCSGPU::PairBuckCoulLongCSGPU(LAMMPS *lmp) :

With regard to the first one, I decided to take out the line from the file, as it is not necessary (basically it was in the born potentials, but it is unnecessary for the buckingham), yet, I can’t get rid of this message during the compilation.

The second error I think it was due to a mismatch in the #include line. As fars as my knowledge allows, I don’t understand this error anymore.

Since both files have been modified but I’m still getting this error, I thought that maybe we had to do a make clean-all in order to do further tests, but still the same error message appears.

What am I doing wrong?

kind regards,

Toni

El ds., 23 de juny 2018 a les 17:23, Trung Nguyen (<ndactrung@…24…>) va escriure:

Dear all,

Sorry for the slow reply, the fact is that I’m not allowed to compile stuff on the GPUs by myself and I must go through our technicians, which means it takes some time until each compilation test is done.

​that is nonsense. you can compile code with the CUDA toolkit without having a GPU. i do it all the time. with recent versions of the toolkit, you don’t even have to install the CUDA driver as it comes with a bunch of stub libraries in the “lib64/stubs” folder. the CUDA toolkit itself does not require any special privileges.​

axel.

I think I’ve been able to do the first part of the implementation (the addition on lib/gpu), but we are encountering problems with respect to the compiling lammps. I’m actually a bit confused because the two first error lines are:

…/pair_buck_coul_long_cs_gpu.cpp(40): error: name must be a namespace name
using namespace MathConst;
^

…/pair_buck_coul_long_cs_gpu.cpp(84): error: name followed by “::” must be a class or namespace name
PairBuckCoulLongCSGPU::PairBuckCoulLongCSGPU(LAMMPS *lmp) :

With regard to the first one, I decided to take out the line from the file, as it is not necessary (basically it was in the born potentials, but it is unnecessary for the buckingham), yet, I can’t get rid of this message during the compilation.

The second error I think it was due to a mismatch in the #include line. As fars as my knowledge allows, I don’t understand this error anymore.

Since both files have been modified but I’m still getting this error, I thought that maybe we had to do a make clean-all in order to do further tests, but still the same error message appears.

What am I doing wrong?

​there likely is an error in your header ​files, where you have not renamed the constructor as needed.

​please note, that you have to do a “make package-update” after changing files in the GPU folder, or manually copy those files to the src folder. the compilation will not pick up the modified files from packages automatically​.

axel.

It was indeed that I didn’t use the make package-update.

I finally was able to compile and test it. It is working, but I’ve noticed for both the born potential and the buckingham one that there are slight differences as the simulation time increases. For instance, I took the in.corshell from the examples/coreshell folder and run it with a longer time for the CPU and GPU, at the last step the results are:

Step TotEng PotEng KinEng Temp Press E_pair E_vdwl E_coul E_long E_bond Fnorm Fmax Volume

0 -635.44099 -675.09865 39.657659 1427 -20613.612 -675.09865 1.6320365 1018.8211 -1695.5518 0 3.4291936e-14 4.4968539e-15 13990.5

15000 -619.07278 -659.48076 40.407973 1453.9985 1774.2922 -662.20249 47.79077 984.73117 -1694.7244 2.7217367 11.643468 1.9733235 13990.5

While for the GPUs:
Step TotEng PotEng KinEng Temp Press E_pair E_vdwl E_coul E_long E_bond Fnorm Fmax Volume

0 -635.44058 -675.09823 39.657659 1427 -20613.572 -675.09823 1.6320365 1018.8215 -1695.5518 0 7.3794785e-14 1.0587018e-14 13990.5

15000 -617.53774 -660.63 43.092262 1550.5872 1368.5937 -663.33114 46.616183 984.9146 -1694.8619 2.7011361 11.273954 2.585788 13990.5

THe first step gives almost exactly the same values. I have the same issue with my implementation of the buckingham potential. What can be the cause of this? Error propagation for single and double precision maybe?

Kind regards,

Toni

El dl., 2 de jul. 2018 a les 15:18, Axel Kohlmeyer (<akohlmey@…33…24…>) va escriure:

It was indeed that I didn’t use the make package-update.

I finally was able to compile and test it. It is working, but I’ve noticed for both the born potential and the buckingham one that there are slight differences as the simulation time increases. For instance, I took the in.corshell from the examples/coreshell folder and run it with a longer time for the CPU and GPU, at the last step the results are:

Step TotEng PotEng KinEng Temp Press E_pair E_vdwl E_coul E_long E_bond Fnorm Fmax Volume

0 -635.44099 -675.09865 39.657659 1427 -20613.612 -675.09865 1.6320365 1018.8211 -1695.5518 0 3.4291936e-14 4.4968539e-15 13990.5

15000 -619.07278 -659.48076 40.407973 1453.9985 1774.2922 -662.20249 47.79077 984.73117 -1694.7244 2.7217367 11.643468 1.9733235 13990.5

While for the GPUs:
Step TotEng PotEng KinEng Temp Press E_pair E_vdwl E_coul E_long E_bond Fnorm Fmax Volume

0 -635.44058 -675.09823 39.657659 1427 -20613.572 -675.09823 1.6320365 1018.8215 -1695.5518 0 7.3794785e-14 1.0587018e-14 13990.5

15000 -617.53774 -660.63 43.092262 1550.5872 1368.5937 -663.33114 46.616183 984.9146 -1694.8619 2.7011361 11.273954 2.585788 13990.5

THe first step gives almost exactly the same values. I have the same issue with my implementation of the buckingham potential. What can be the cause of this? Error propagation for single and double precision maybe?

​there are two major reasons for discrepancies in the first step: 1) using mixed or single precision will result in (slightly, single more, mixed less so) different energies and forces ​compared to double precision calculations. 2) floating point math is not associative, thus the result depends (slightly) on the order of operations. this primarily affects summing of energies and forces. with all computations in double precision, the different in energies would be much less than what you are seeing. the impact of switching from CPU to GPU should be comparable to using a different number of MPI ranks in the all-CPU calculation. how large the deviation for single/mixed precision is versus all-double depends very much on the initial configuration. for high-energy configurations, the deviation is typically larger than for nicely equilibrated configurations.

from (minor) differences in the initial step, it is quite normal to see trajectories diverge. MD is a system of couple partial differential equations, which is a chaotic system and subject to the “butterfly effect”, i.e. even the tiniest of differences (e.g. from rounding/truncation due to floating point math operations) will result in an exponential divergence of trajectories. typically for a normal atomic system is, that results (computed in all double, that is) should have only a minor divergence (less than the output precision) for the first 1000 MD steps or so. often this is longer, but an eventual divergence is unavoidable.

axel.

Hi Toni,

glad that you made it work.

You can recompile the GPU library with all double precision computations/data storage (see CUDA_PRECISION in Makefile.linux.double) and rebuild LAMMPS with the new libgpu.a.

To check if your code would give consistent results with CPU runs, you can compare the forces (and pressure) and energies between CPU and GPU runs for various initial configuration(s) on the same number of MPI ranks. For debugging purposes, you can also turn off the optimization used to build the GPU library, and for born/coul/long/cs specify “pair_modify table 0” to enforce similar code paths for the GPU and CPU styles.

Along with Axel’s comments on the divergence of trajectories with minor diffferences (in forces) in the initial configuration, I would compare statistics of the measurables (energies, pressures, etc.) from CPU and GPU runs rather than their time evolutions, for example, to see if the GPU runs would be consistent with the thermodynamic ensemble in use.

Cheers,
-Trung