[lammps-users] A question about LAMMPS GPU library

Dear LAMMPS users,

Hello! I feel sorry to bother you. I’m going to do some optimization on the LAMMPS GPU library so I read the publication written by Michael.W.Brown: Implementing Molecular Dynamics on Hybrid High Performance Computers - ShortRange Forces. He has mentioned that because the time integration represents a small fraction of the workload, they have focused their initial work on porting neighbors and force routines for acceleration. But all data must be transferred from the host to the device memory and vice versa on each timestep, which spends much time on memcpy. Why doesn’t he move the part of the time integration (the Verlet algorithm) to GPU? He has mentioned the advantages in the essay but could anyone explain more to me? What are the difficulties of the modification?

Thanks and best,

Yutong Wu

Dear LAMMPS users,

Hello! I feel sorry to bother you. I’m going to do some optimization on the LAMMPS GPU library so I read the publication written by Michael.W.Brown: Implementing

before working on that code, you should better first coordinate with the developers of the GPU package as there is currently some significant refactoring ongoing.

Molecular Dynamics on Hybrid High Performance Computers - ShortRange Forces. He has mentioned that because the time integration represents a small fraction of the workload, they have focused their initial work on porting neighbors and force routines for acceleration. But all data must be transferred from the host to the device memory and vice versa on each timestep, which spends much time on memcpy. Why doesn’t he move the part of the time integration (the Verlet algorithm) to GPU? He has mentioned the advantages in the essay but could anyone explain more to me? What are the difficulties of the modification?

There are benefits and disadvantages to trying to keep the data on the GPU, for example those styles are fully compatible with the rest of LAMMPS and there is no porting to the GPU needed for a large part of the LAMMPS code. Whether keeping data on the GPU is faster also strongly depends on the problem at hand and the hardware you are running on.

Please note that not the entire data needs to be moved to the GPU but just parameters, positions, and (in some cases only) velocities be sent to the GPU and the forces, energies and stress retrieved. Please also note that for parallel computation data needs to be transferred via MPI between MPI ranks at every step and thus transferred anyway. Furthermore, current hardware has a different balance between compute power of the host and the GPUs than typical hardware when the GPU package was conceived. With the current division of work, the GPU package can take maximum advantage of the host CPU cores.

Finally, if you are interested in the performance with leaving the data on the GPU, you can try out the KOKKOS package in which the GPU support is built exactly on that premise. KOKKOS also has optimizations for direct GPU-to-GPU data transfer for supported hardware/software when running in parallel. In our benchmarks we find that there are cases where either package is faster. In contrast the GPU package supports the CUDA multi-processor server (MPS), if configured accordingly, which lowers the cost of context switching when oversubscribing the GPU (which is very effective with the GPU package). There are some more details about how to get good performance with either package in the LAMMPS manual section about the accelerator packages.

Axel.