[lammps-users] lj/cut/gpu benchmark

Hi Steve and Axel,

We built a computer today with a dual core 2.9 GHz AMD CPU chip and a
GeForce GTX 260 gpu card. I Intsalled Fedora Core 10, CUDA and then
LAMMPS. I ran the lammps-27Mar10/examples/melt example and modified the
input file for the pair_style to "lj/cut/gpu one/node 0 2.5". I ran upto
5000 MD steps and compared the loop time for a single cpu run with the gpu
run. The gpu run is 3.6 times faster than the single cpu run.

Is this the typical LJ speedup that I get when I use the lj/cut/gpu?
Furthermore, can I run 2 lammps simulation with 1 gpu card? I tried this
by running two gpu runs and the loop time of both gpu runs is almost equal
to the loop time of a single cpu run. I guess one lammps simulation will
use up one card. Is there a way to allocate number of gpu multiprocessors
in one card per simulation?

Thanks,

Jan-Michael Carrillo

jan-michael,

as it so happens, i just put some LAMMPS GPU benchmarks online today.
http://code.google.com/p/gpulammps/wiki/SingleGPUBenchmarkResults

this is using the development branch of the GPU code (hosted externally)
but for all intents and purposes that should be identical to the
official distribution.
i am just about to commit the scripts and inputs that i used to the
svn repository.

Hi Steve and Axel,

We built a computer today with a dual core 2.9 GHz AMD CPU chip and a
GeForce GTX 260 gpu card. I Intsalled Fedora Core 10, CUDA and then
LAMMPS. I ran the lammps-27Mar10/examples/melt example and modified the
input file for the pair_style to "lj/cut/gpu one/node 0 2.5". I ran upto
5000 MD steps and compared the loop time for a single cpu run with the gpu
run. The gpu run is 3.6 times faster than the single cpu run.

Is this the typical LJ speedup that I get when I use the lj/cut/gpu?
Furthermore, can I run 2 lammps simulation with 1 gpu card? I tried this

i think technically you can do it for as long as the GPU is not set to
GPU-exclusive mode, but i doubt that there will be much gain. on
the contrary. perhaps is should do a couple of tests on that as well...
LAMMPS does not have the level of sophistication in the GPU code
that codes like NAMD have. you definitely cannot use the same GPU
from two MPI tasks.

by running two gpu runs and the loop time of both gpu runs is almost equal
to the loop time of a single cpu run. I guess one lammps simulation will
use up one card. Is there a way to allocate number of gpu multiprocessors
in one card per simulation?

this is not how GPUs work. you can only "feed" the whole card. the cards
don't run an operating system, so there is no resource management
except the one that is programmed in the host code. the nvidia driver
will nevertheless serialize all requests. however, if you run a fairly small
system (and 5000 LJ particles qualify), you may get into a situation, where
the on-GPU part of one executable is perfectly interleaved with the on-CPU
part of the other and vice versa. but even with NAMD, oversubscription of
GPUs doesn't make much sense. better to get dual GPU cards (like
the GTX 295). now that the first fermi cards are introduced, the price for
this generation hardware should start drop.

cheers,
   axel.

Hi Steve and Axel,

jan-michael,

some more comments.

We built a computer today with a dual core 2.9 GHz AMD CPU chip and a
GeForce GTX 260 gpu card. I Intsalled Fedora Core 10, CUDA and then
LAMMPS. I ran the lammps-27Mar10/examples/melt example and modified the
input file for the pair_style to "lj/cut/gpu one/node 0 2.5". I ran upto
5000 MD steps and compared the loop time for a single cpu run with the gpu
run. The gpu run is 3.6 times faster than the single cpu run.

the speedup depends a lot on how fast you get the code to run
on the CPU. for 10000 steps in the CPU version on a xeon X5520 (2.27GHz)
i need 42.3 seconds using the lj/cut/opt pair style even only 33.4
seconds. on a single GPU of a tesla S1070 instead it is 13.3 seconds
(using the Loop time number). that resuls in a speedup of roughly 3.2x
(or 2.5x) for larger systems it is better.

if i run instead on a GeForce GTX 285 i get a time of 11.8 seconds
resulting in a 3.6x (2.8x) speedup.

Is this the typical LJ speedup that I get when I use the lj/cut/gpu?

you should get a little bit better speedup when going
to much larger systems. with the GTX 285 i get up to
6x speedup for 130000 atoms.

Furthermore, can I run 2 lammps simulation with 1 gpu card? I tried this
by running two gpu runs and the loop time of both gpu runs is almost equal
to the loop time of a single cpu run. I guess one lammps simulation will

that is very strange. i cannot reproduce this.
did you start the jobs as *exactly* the same time??
if i oversubscribe a GPU my jobs slow down a lot.
only if a submit to two different GPUs, i get about
the same performance, for as long as i don't overload
the PCIe but.

cheers,
   axel.

Hi axel,

I had two terminal windows opened and tried to do my best to start at the
same time.

Here are the loop times:
1 lammps gpu run - 9.3 s
1 lammps cpu run - 33.39 s
2 lammps gpu runs running at the same time* - 42 s, 42 s

The first time I did this I didn't pay much attention to making them run
at exactly the same time and I got 28 s for both runs when I delayed
running the other simulation by approx 5 seconds.

Your observation is right that the gpu runs are considerably slower when
they are oversubscribed to a gpu card.

Thanks again,

Jan-Michael

Hi axel,

I had two terminal windows opened and tried to do my best to start at the
same time.

ouch. why don't you just do?

./lmp_gpu -in lj.in >& lj-1.out & ./lmp_gpu -in lj.in >& lj-2.out &

and viola, no magic, no racing between two windows needed.

Here are the loop times:
1 lammps gpu run - 9.3 s
1 lammps cpu run - 33.39 s
2 lammps gpu runs running at the same time* - 42 s, 42 s

this is with 5000 MD steps, right?

[...]

Your observation is right that the gpu runs are considerably slower when
they are oversubscribed to a gpu card.

ok. for a moment i thought i might have messed something
up in how i ran my tests. :wink:

cheers,
   axel.