CUDA package and bonds

Steve or maybe Christian,

Coming back to your comment, would it be possible to add OpenMP parallelization to the bonded interactions computed on CPU in USER-CUDA package? That would be still one MPI processes with N threads in the parallel loop.
The advantage is that normally the number of CPU cores per node is higher, than the number of GPUs and using only 1 process on CPU is suboptimal. I know that GPU package does something similar to what I suggested, but this hack with a parallel loop doesn’t require any decomposition, neighbor list etc. I’ve just found out, that the bonded time is dominating when I use single MPI+USER-CUDA jobs, since all P3M is moved to GPU with cufft.

Steve or maybe Christian,

Coming back to your comment, would it be possible to add OpenMP
parallelization to the bonded interactions computed on CPU in USER-CUDA
package? That would be still one MPI processes with N threads in the
parallel loop.

there is no point to add multi-threading to the USER-CUDA package,
however, you can use the styles from USER-OMP together with USER-CUDA.
there are a bunch of ways how you can distribute the work between GPU
and CPU. which is fastest depends on your system and what the ratio of
GPUs to CPU cores is.

The advantage is that normally the number of CPU cores per node is higher,
than the number of GPUs and using only 1 process on CPU is suboptimal. I
know that GPU package does something similar to what I suggested, but this
hack with a parallel loop doesn't require any decomposition, neighbor list

multi-threading MD loops is not a simple as you make it due to using
newton's third law. that is why on GPUs, you use force kernels that
don't use it (and thus do twice the work) because this doesn't have
any memory access conflicts.

etc. I've just found out, that the bonded time is dominating when I use
single MPI+USER-CUDA jobs, since all P3M is moved to GPU with cufft.

the performance impact of the FFT is often overestimated.

axel.