How to run lammps with GPU in windows?

I am very thankful to Axel to reply my email. I should read manual more carefully!

But i also has these problem

1)Here is the list of my machine:

Winodws7-64bit; lammps: 32-bit 2016-08-20; MPICH2 1.41p1-win-ia32.

The GPU is Nvidia GeForce GT 620, which is drived by Cuda_8.0.27_windows

2)I use the LJ benchmark, and modify the script: variable x,y,z is changed from 1 to 2, so there are 256000 atoms. Addtionly, run is changed from 100 to 1000

I list the time information

1]if I use the “mpiexec -localonly 2 lmp_mpi -in in.lj”

I am very thankful to Axel to reply my email. I should read manual more
carefully!

But i also has these problem

1)Here is the list of my machine:

Winodws7-64bit; lammps: 32-bit 2016-08-20; MPICH2 1.41p1-win-ia32.

The GPU is Nvidia GeForce GT 620, which is drived by Cuda_8.0.27_windows

here is the problem: you have a "wimpy" GPU!
if you look up the specs, you'll see that it has "only" 96 cuda cores,
a low clock, a slow memory interface and so on.

GPUs used for GPU acceleration typically have around 3000 cuda cores,
higher clock rates, wider and faster clocked memory interfaces.

essentially, what you see is what you have, i.e. a GPU that is slower
than your CPU.

2)I use the LJ benchmark, and modify the script: variable x,y,z is changed
from 1 to 2, so there are 256000 atoms. Addtionly, run is changed from 100
to 1000

I list the time information

1]if I use the "mpiexec -localonly 2 lmp_mpi -in in.lj"
*********
Loop time of 108.55 on 2 procs for 1000 steps with 256000 atoms

Performance: 3979.751 tau/day, 9.212 timesteps/s
99.9% CPU use with 2 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
---------------------------------------------------------------
Pair | 86.016 | 86.401 | 86.785 | 4.1 | 79.60
Neigh | 10.463 | 10.483 | 10.502 | 0.6 | 9.66
Comm | 3.3985 | 3.83 | 4.2614 | 22.0 | 3.53
Output | 0.0015256 | 0.0015423 | 0.0015591 | 0.0 | 0.00
Modify | 6.2888 | 6.3092 | 6.3296 | 0.8 | 5.81
Other | | 1.525 | | | 1.41
......
Total # of neighbors = 9595043
Ave neighs/atom = 37.4806
Neighbor list builds = 50
Dangerous builds not checked
Total wall time: 0:01:48
******************
2]if I use the "mpiexec -localonly 2 lmp_mpi -in in.lj -sf gpu -pk gpu 1"
******************
Loop time of 113.439 on 2 procs for 1000 steps with 256000 atoms

Performance: 3808.230 tau/day, 8.815 timesteps/s
42.9% CPU use with 2 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
---------------------------------------------------------------
Pair | 99.584 | 99.723 | 99.863 | 1.4 | 87.91
Neigh | 6.6408e-005| 6.8823e-005| 7.1238e-005| 0.0 | 0.00
Comm | 5.9677 | 6.1252 | 6.2828 | 6.4 | 5.40
Output | 0.0018621 | 0.028918 | 0.055975 | 15.9 | 0.03
Modify | 5.9899 | 5.9965 | 6.0032 | 0.3 | 5.29
Other | | 1.564 | | | 1.38
.....
Total # of neighbors = 0
Ave neighs/atom = 0
Neighbor list builds = 50
Dangerous builds not checked

Please see the log.cite file for references relevant to this simulation

Total wall time: 0:01:54
******************
Pair time is longer!, And "42.9% CPU use with 2 MPI tasks x 1 OpenMP

yes, because your GPU is so slow. keep in mind, that it also has to
build the neighbor list.

threads", there is sentence in manual, "it should be close to 100% times the
number of OpenMP threads (or 1). Lower numbers correspond to delays due to
file I/O or insufficient thread utilization."

this applies to CPU-only runs. when you add GPUs to the mix, things
get complicated. the 50% utilization is an indication of your CPU
having to wait for the GPU.

axel.