GPU run does not give the correct Eneregy

Hi Mike,

You are right. I got it to work by doing the following steps:

1. Updated to the latest nvidia driver
2. Updated CUDA to 3.2
3. Deleted the lammps directory that I untarred from the tarball
4. Untarred the lammps tarball
5. make STUBS directory
6. make lib/gpu using STUBS mpi
7. make lammps using STUBS,FFTW and GPU

And here are the timings for in.melt (Loop time). I used force/neigh

CPU - 31Mar11 version -1.63 s
GPU - 31Mar11 version -0.27 s
GPU -5Sep10 version (which doesn't run anymore after I did the above
steps) - 0.45 s

setting fix gpu mode to force gives the same timings as the 5Sep10 version.

I also run my data and input files that did not run in the 5Sep10
version of lammps because of a cell list error. Here is the url of
that thread: And I am
happy to report that this system ran in the 31Mar11 version and I had
a speedup of 5.6 ( 113 s - cpu /20 s gpu) by setting the fix gpu mode
to force/neigh.

Now I will be testing my input files that are charged and using pppm
by using lj/cut/coul/long/gpu. What speedups/ benchmark did you get
for these kind of systems? And how is pppm implemented? Is the
calculations done in the gpu ?

Thanks again.


Glad it worked. CUDA 3.2 used size_t types for some of the driver API calls versus int in the last versions. There is supposed to be a check for this in the code, but I never tested mixing 3.2 drivers with 3.0 headers.

Regarding the speedups, this really depends on the hardware. For the version you have, pppm runs on the CPU asynchronously during the GPU force calculation, so the speedup depends on the number of processes sharing the GPU among other things.

I will send out a link to update the GPU library with a gpu accelerated pppm for testing - hopefully monday. It is done, however, I am improving the error reporting to prevent users from using the wrong arch flag, etc before I send out.

A speedup number for the rhodopsin benchmark with mixed precision force calculation and double precision pppm calculation is ~ 3x versus 6 processes on a hex-core opteron.

- Mike