lammps performance on CUDA - single v/s double precision

Dear lammps users,

Following are the numbers that we got with lammps single & double precision benchmarks. The input file used is in.lj.cuda. (taken from the lammps distribution - bench/GPU folder)

#AtomSize Steps single_looptime double_looptime
256 100 0.10108 0.102101
2048 100 0.111682 0.110275
16384 100 0.139425 0.144047
131072 100 0.393052 0.387888

256 1000 0.899185 0.906075
2048 1000 1.02966 1.03671
16384 1000 1.3074 1.31008
131072 1000 3.7608 3.73454

256 10000 9.04168 8.99344
2048 10000 10.2678 10.2985
16384 10000 12.9105 13.1372
131072 10000 38.6626 39.0364

We tried to compare these results with results published on the lammps website:

http://lammps.sandia.gov/bench/gpu.desktop.lj.single.jpg
http://lammps.sandia.gov/bench/gpu.desktop.lj.double.jpg

The graphs show that there is a small difference between the single & double precision results. For example the peak performance for single is
40 & for double it is 25 (approx) millions of atom timesteps per second. But in our results these are almost same. How this could be possible?
We’re sure that, the compilation of both single & double is correct.

Also it is confusing to get the values from the above graphs. It is easy to compare, if results are available in tabular format. Is that available in some link?

Some details about our setup:

LAMMPS version: 30-Aug-2012
GPU Card : nVidia Tesla M2090
CUDA version : 4.0

Thanks

sge_sub.sh (304 Bytes)

in.lj.cuda (455 Bytes)

lammps_cuda_single_double (436 Bytes)

Mike can possibly comment. It doesn’t look you
ran the same case twice to me.

Steve

On titan, with tesla k20x, I do see that single and double are more similar for some simulations because the cast time for data transfer becomes significant.

For your setup, I do not think that this is expected, however. Can you send the screen output for both for just a single run, e.g. 131072, 100 steps? The –screen commandline option can be used to send screen to a file. Thanks. - Mike

Sorry - that was a typo. I meant to say,
it looks like you ran the same case twice.
I.e. you did not run single vs double precision.

Steve

On titan, with tesla k20x, I do see that single and double are more similar for some simulations because the cast time for data transfer becomes significant.

For your setup, I do not think that this is expected, however. Can you send the screen output for both for just a single run, e.g. 131072, 100 steps? The –screen commandline option can be used to send screen to a file. Thanks. - Mike

lammps_single_32_100 (1.11 KB)

lammps_double_32_100 (1.11 KB)

32_100st_single_double.tar.gz (1.96 KB)

Sorry, I thought you were using the GPU package.

Maybe Christian is a better contact for help with the USER-CUDA package. - Mike

ok - maybe Christian can take a look, since this is USER-CUDA.

Steve