[lammps-users] lammps and namd on gpu for small systems

Roman_Petrenko · May 7, 2010, 7:59pm

Dear users,
would lammps perform better than namd on a single gpu for small
systems <100,000 atoms (gpu 9800 gtx, cpu opteron 2218 2.6GHz)? so
far with namd i got only minor improvement when using gpu. so i
thought maybe lammps can perform better.

sjplimp · May 10, 2010, 2:42pm

If LAMMPS has a GPU-ized potential for the one you
want to use (there are only a couple currently), then the
pairwise part should speep up by 3-5x typically. But that
means your overall speed-up will be less, depending on
what else you are doing (e.g. PPPM).

Steve

Roman_Petrenko · May 10, 2010, 5:04pm

so, what's so special must be about simulated system when people get
10-30x speedup?

akohlmey · May 10, 2010, 5:20pm

roman,

so, what's so special must be about simulated system when people get
10-30x speedup?

there are two reasons for high speedups: 1) not having to move data between
host memory and GPU memory (that is done in a code like HOOMD and the
LAMMPS-CUDA branch of GPULAMMPS), 2) high algorithmic complexity,
i.e. lots of floating point operations to compute the forces. this is
true for the
Gay-Berne potentials in LAMMPS. they are _very_ slow compared to lennard-jones
on the cpu, and thus make better use of incredible compute power of GPUs
(a Tesla G200 type GPU has about 100x the floating point capability of a CPU.
you can get even higher speedups, if you can use special versions of sine,
cosine, square root and exponential).

NAMD type bio-forcefields use functional forms that were intentionally chosen
to be simple and thus fast (this is why you have lennard-jones and not morse,
for example). getting great speedup with these on GPUs is hard work. also

you have to be careful with benchmark numbers. they are easy to rig.
for example,
if you don't use neighbor or cell lists, you can easily get a 75x
speedup on the GPU,
but that is a pointless comparison, since for a large enough system, a
neighborlist
code would be even faster on the CPU already (due to O(N) scaling instead of
O(N**2) scaling).

cheers,
axel.