What's the performance of Kremer-Grest(KG)model with DPD thermostat by using GPU in window?

Hello everyone!
Thanks the reply of Axel, who told me the very easy problem of GPU, at first.

I have replaced my GPU, and the setting in my desktop:
CPU :intel Core i7-3770
GPU :GTX 1060 6GB
motherboard: Ivy Bridge-H61

At first, I test the in.lj script, which is in the benckmark, but I changed variabl x,y,z from 1 to 2, so there are 256K atoms
If I use GPU command , there are 3 times acceleration.

In my work, I need to use the Kremer-Grest model(DOI:10.1063/1.458541), which is a coarsed polymer model,
There are 808800 monomers and chain length is 1200 in this work, and the program run 100 steps with 2 MPI task/ 1 GPU

the pair style must be hybird:
pair_style hybrid/overlay lj/cut 1.12 dpd/tstat 1.0 1.0 1.12 22046
pair_coeff * * lj/cut 1.0 1.0
pair_modify shift yes
pair_coeff * * dpd/tstat 0.5

bond_style fene
bond_coeff * 30.0 1.5 1.0 1.0
And input commad is added “-sf gpu -pk gpu 1 neigh no”
But the there is no acceleration by using GPU(If use GPU the wall time is 35s, if not, the wall time is 30S)

Addtionally, the breakdown showed pair time is decreased, neighbor time is increased and communication time is not changed when use GPU(I will show later)

(1)I have limit the monomer numbers in the program, but there are also the problem

(2)if I changed the Rcut from 1.12 to 2.5, there are some acceleratin (from 220s to 150s)

So, what’s the main cause in this problem? If I run my program in a workshop with a better GPU like K40, will I have I better performace,? In other word, there are some limitations in my desktop
Or there are some problem in DPD algorithm?

Addtionally, in the manual, dpd/gpu algorithm is linked to 2 paper, but the second one “(Phillips) C. L. Phillips, J. A. Anderson, S. C. Glotzer, Comput Phys Comm, 230, 7191-7201 (2011).”
This paper is published in Journal of Computational Physics(10.1016/j.jcp.2011.05.021), not CPC
(It is difficult for me to read this work)

Sincerely,
Yongjin, Ruan

Here is the breakdown:
No GPU
99.8% CPU use with 2 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total

In your hybrid model, since pair_style dpd/stat and bond_style fene are not included in the GPU package so their calculations were all on the CPU. The only thing that was on GPU was lj/cut. So the pair time reduced, bond time stayed roughly the same, neigh time increased since you now need to construct neighbor lists on both the CPU and GPU.

Changing to a faster should not help much. The key is to reduce the time spent in dpd/stat and fene bonds as well as neighboring time.

Ray