I am a Master’s student in Computer Engineering and I am modifying pair_lj_cut.cpp to test how it would work using the GPU via openMP.
So I need some of you to assure me what sizes are the arrays that are used in the compute method.
I have the following:
firstneigh surely does not have those sizes or at least I think so, it is the one that is giving me problems loading the data on the GPU. And of the others, I would also like you to correct me on those that are not correct (I think array f doesn’t have that size, and the others I don’t know).
Thank you very much in advance.
OpenMP target mode should already be supported in LAMMPS through its KOKKOS package. You have to check on the Kokkos library homepage how well that backend is supported.
With KOKKOS (as well as the GPU package) also the neighbor list is created on the GPU which has significant performance benefits, since not only the neighbor list build can be parallelized, but also because the neighbor lists are among the largest memory consumers in classical MD and thus they incur significant overhead to be downloaded to the GPU. You can easily see the difference with the GPU package, since that has a package option to build the neighbor list on the host (not all advanced neighbor list features are GPU accelerated).
Also, the standard, half-neighbor list version of a pairwise potential, has a race condition when updating forces, energies, and virial in parallel, so you may want to use a version with a full neighbor list to avoid the race with the force update. The race due to calling ev_tally() within each inner loop iteration still needs to be addressed. A full neighbor list lj/cut version is not part of the LAMMPS distribution, but I recently posted a such a version that I created for testing purposes in this forum: Non-symmetric pair style - is it possible? - #6 by akohlmey
Here are my corrections:
the firstneigh list, is not a regular matrix but a list of pointers of varying size into a list of memory pages created by the MyPage template class
Most per-atom data is allocated to Atom::nmax dimensions (which is the maximum of nall = Atom::nlocal + Atom::nghost across all MPI processes), since often also information from ghost atoms needs to be accessed. Also, please note that
inum <= Atom::nlocal.
depending on whether the value of Force::newton_pair is 1 or 0
I am very grateful for the answer and sorry for the delay in answering you. I’m going down a bit because it’s my final Master’s thesis in Engineering and there are problems everywhere.
In the end, if it cannot be implemented or does not achieve good performance, it will not be a failure as long as all the problems found are sufficiently documented.
I have started using LAMMPS since this job so I have very limited knowledge about the simulator.
The goal is to offload with openMP but using absolutely no pure CUDA code.
Actually, your indication about the size of the arrays has been fundamental to be able to intuit the problem of loading data. I had a problem with the irregular array that you told me about firstneigh. Luckily I was able to get around this problem using unified memory on the system (I’m using Power9 + NVidia V100 nodes), otherwise I don’t think it’s possible without having to use CUDA.
I find it very interesting if there is an alternative to the race conditions that can occur with the forces update. I have seen your code that you tell me about but it seems to me that I have too little knowledge to see what would be necessary to modify in LAMMPS to carry it out.
All help will be well received.
Thank you very much for the help and greetings!
P.S: I have used the translator for the writing, sorry to anyone who sees errors in the text, but surely the translator has done a better job than if I had had to write it myself.
Honestly, I don’t know what additional assistance I can provide beyond what I have already written. What you are planning to do is non-trivial and particularly difficult considering your limited exposure to LAMMPS.
This is what is done in LAMMPS with the KOKKOS package. The Kokkos library is an abstraction of several accelerator backends including OpenMP target mode.
Good morning again,
now you will see how you are going to help me a lot. Also, I’m going to see how wrong I was.
The maximum time consumption in lammps actually occurs when creating the neighbor list, right? Thus, the compute() method doesn’t really consume any resources compared to creating the list.
No. For typical systems, the neighbor list creation takes 20% or less of the total time. Just look at the timing summary that LAMMPS prints at the end of a run.
Please have a look at 4.3. Screen and logfile output — LAMMPS documentation