Pair colloid and GPU , poor performance . Help

Hello everybody,

I am using the GPU package , but with the pair potential colloid I have not had good results , attached a .txt file that contains some results of my tests . I’m trying to script: in.colloid and in.melt to compare the performance of colloid pair and pair LJ . When I use the GPU , the performance is very poor with colloid Pair . Maybe I’m not adequately executing the command , I ask you please help .

The card I am using is : Nvidia Tesla Kepler k40c . In the attachment more details appear .

I need to use the Pair Colloid efficiently. Also when I build the list of neighbors on the card , it appears that the atoms has Ave / neigh = 0 .

hardware_and_test.txt (6.53 KB)

You appear to be running with 900 atoms? I don’t think

you will get good performance with pair lj/cut on a GPU

with < 30000 atoms. So can you try many more colloid

particles?

Steve

Hello everybody,

I am using the GPU package , but with the pair potential colloid I have not
had good results , attached a .txt file that contains some results of my
tests . I'm trying to script: in.colloid and in.melt to compare the
performance of colloid pair and pair LJ . When I use the GPU , the
performance is very poor with colloid Pair . Maybe I'm not adequately
executing the command , I ask you please help .

if you believe that GPUs are magic things that make everything
massively faster, then you have become (yet another) victim of
successful PR. GPUs are designed to process massive amounts of
independent work units concurrently (e.g. compute the color of many
pixels). only if you have a very large number of work units, and if
they have no data dependencies, you will see a speedup, as the latency
of offloading the work to the GPU and the quite slower speed of the
individual processing units (compared to CPU cores) will be offset by
the massive parallelism. if your problem doesn't allow to create such
a large number of work units, you won't see much of a speedup, if not
a slowdown.

The card I am using is : Nvidia Tesla Kepler k40c . In the attachment more
details appear .

I need to use the Pair Colloid efficiently. Also when I build the list of
neighbors on the card , it appears that the atoms has Ave / neigh = 0 .

this is correct since you are building the neighborlists directly on
the GPU, this information is the number of neighbors per atom on the
CPU only.

axel.

Hi Steve , thanks for answering . I have tested with 206,000 particles and the result is identical . I could not get the most out of the GPU :(.

Hello Axel , thanks for answering . For my setup and hardware available
that I can do to assign more work units to the GPU ?. It strikes me that a
potential implementation gpu ( colloid / GPU ) , which is supposed to
accelerate the calculations run slower.

Hello Axel , thanks for answering . For my setup and hardware available that
I can do to assign more work units to the GPU ?. It strikes me that a

if you have set up and are your input correctly, there is little you
can do. but i don't have a crystal ball and i cannot read minds
(neither those of people or computers or gpus), so i cannot tell you
whether the problem is with your input or your expectations. please
post a suitable example input for a short run (ideally from a data
file) that demonstrates your issue and has all irrelevant parts
removed. please also provide the corresponding output and command line
settings that you used to run.

potential implementation gpu ( colloid / GPU ) , which is supposed to
accelerate the calculations run slower.

you don't seem to understand what i have explained in my previous
e-mail. whether you see acceleration or deceleration with GPUs depends
on many factors (hardware, machine configuration, drivers, software,
GPU utilization, exclusive/shared access, LAMMPS input, system).

the usual recommendation also applies: if things don't work as
expected.you have to study the output of your job very carefully to
see, if there is any indication of a problem. run other tests to see,
if they do work as expected. there is GPU benchmark data posted on the
LAMMPS website. you should compare to that first.

axel.

Hello,
sorry for the delay.

send my test results with GPU and potential colloid .

MPI with 20 cores get the best result: ID_SIM:0 and Loop: 18.5876 s. The best time with GPU was: ID_SIM 13, Loop: 26.9159.

resume.txt : % time and time loop .
in.colloid : script .
run.sh : simulations.
log_scren.tar.gz : detail of each simulation .

Cheers

resume.txt (548 Bytes)

in.colloid (587 Bytes)

run.sh (1.5 KB)

log_screen.tar.gz (4.89 KB)

Hello,
sorry for the delay.

send my test results with GPU and potential colloid .

MPI with 20 cores get the best result: ID_SIM:0 and Loop: 18.5876 s. The
best time with GPU was: ID_SIM 13, Loop: 26.9159.

resume.txt : % time and time loop .
in.colloid : script .
run.sh : simulations.
log_scren.tar.gz : detail of each simulation .

​when looking at the log output running on the CPU, there are *extremely*
few neighbors per atom (between 0.5 and 1.1)​
​that means, there is very little "compute work" that can be given to the
GPU, which is also reflected in the ​rather small percentage of time in
Pair module.
so simply considering Amdahl's law, there is little to gain from GPU
acceleration, and that doesn't even factor in the overhead from
transferring data to the GPU and related costs.

effective GPU acceleration, however, specifically requires a many
neighbors, as that allows to better exploit concurrency.

BTW: with this kind of setup, you should be able to speed up the CPU
calculation by using the standard binning neighbor list method (again, this
reduces overhead from unneeded operations, i.e. when there are no neighbors
as such, there is no benefit in differentiating between near and far
neighbors).

​axel.​