Neighbor Calculations Using GPU/CUDA Parckage Are Different Than Using CPU

_Engler_Steven · May 13, 2014, 7:56pm

Hi all,

Hopefully there is a simple answer to this. I tried to include as much information as possible in this email, just in case it’s useful.

I built three different versions of LAMMPS, the first with the default packages, the second with the default + GPU packages, and the third with the default + User CUDA packages.

When I run the melt.2.5 GPU example simulation using the first build (running on CPU only), the output value for the total # of neighbours is 10,039,927. When I run the simulation again using the second build (using the GPU package with the “package gpu force 0 1 1” command), the output value for the total # of neighbours is 19,190,086.

If I use the CUDA package (the third build), the # of neighbour results is the same as with the GPU package (which is 19,190,086).

The total # of neighbours using the GPU or CUDA package is always around 2x the value returned when using only the CPU (when the number of neighbours is low, it’s usually exactly 2x). All of the other results in the output are the same (the data table, etc) for both builds. The only difference is the # of neighbours values. This does not just happen for the melt simulation, but for all of the simulations I’ve tried so far.

Another thing to note is that if I run the CUDA build with CUDA turned off (“-c off” option), or the GPU build without the “-sf gpu” option, it returns the same number of neighbours (10,039,927) as the CPU build (which is expected, but I figured I’d write it anyways).

sjplimp · May 14, 2014, 1:48pm

There is a simple answer. CPU pair styles typically
use a half neighbor list where the I,J pair appears once.

GPU pair styles typically use a full neighbor list where

I,J is stored both with atoms I and J. Hence the factor
of 2.

Steve

akohlmey · May 14, 2014, 3:40pm

Hi all,

Hopefully there is a simple answer to this. I tried to include as much
information as possible in this email, just in case it’s useful.

[...]

Another thing to note is that if I run the CUDA build with CUDA turned off

(“-c off” option), or the GPU build without the “-sf gpu” option, it
returns the same number of neighbours (10,039,927) as the CPU build (which
is expected, but I figured I’d write it anyways).

yes, because if you compile the GPU or USER-CUDA package into LAMMPS, but
are not using the corresponding styles, you will *still* run on the CPU and
use the code in the exact way as if you had not included any of those
packages. LAMMPS is structured in a way, that you can including multiple
variants of the same thing and then select at run time which of those you
want to use.

[...]

These are my thoughts so far:

1) One noticeable difference is the “Neighs” VS “FullNghs”, but from what
I understand, it’s only a difference between a half neighbour list and a
full neighbour list. I wouldn’t think this would make a difference for the
total number of neighbours since it’s only a change in list structure.

here is where you are wrong. the half vs. full neighbor list makes all the
difference. in a full neighbor list all pairs are listed twice. once for
each atom that constitutes each pair. in the half neighbor list, those
pairs are distributed, so that you have in total only half the neighbors.

2) I tried remaking both the packages and LAMMPS a few times, and tried
making the GPU and CUDA packages with both single and double precision
(just to try it).

that has no impact at all.

3) I looked at the pair style used in the example (lj/cut), and the docs
say “Styles with a *cuda*, *gpu*, *omp*, or *opt* suffix are functionally
the same as the corresponding style without the suffix”, so I can’t see
this being the problem.

functionally the same means they implement the same potential, but they do
it differently. and one of the differences is the choice of whether you
apply newton's third law or now.

axel.

4) I would think that since the skin distance and the forces are not

_Engler_Steven · May 15, 2014, 6:14pm

Thanks Axel and Steve for your quick replies. I misunderstood what exactly the total number of neighbours represented, I figured that the total number of neighbours was the total number of distinct neighbour pairs, not the size of the neighbour list. My only question now then is why the number of neighbours in the full list is not exactly 2 times the number of neighbours in the half list? From the examples I have run, it is usually somewhere between 1.74x and 2x (those are the two extreme values I have found so far). If it was around 1.98x, it might be caused by rounding errors when building the neighbour list, but I wouldn’t think rounding errors would cause the difference between 2x and 1.74x. Is there a simple explanation as to why the full list is sometimes notably less than double the size of the half list, or could there be many different factors involved? The main reason for this question is that I’m trying to figure out whether or not there is potential for the simulation running on the GPU to return different results than when it runs on the CPU (ignoring rounding errors and order of operation errors).

Thanks,

Steve

akohlmey · May 15, 2014, 6:28pm

Thanks Axel and Steve for your quick replies. I misunderstood what exactly
the total number of neighbours represented, I figured that the total number
of neighbours was the total number of distinct neighbour pairs, not the
size of the neighbour list. My only question now then is why the number of
neighbours in the full list is not *exactly* 2 times the number of
neighbours in the half list? From the examples I have run, it is usually
somewhere between 1.74x and 2x (those are the two extreme values I have
found so far). If it was around 1.98x, it might be caused by rounding
errors when building the neighbour list, but I wouldn’t think rounding
errors would cause the difference between 2x and 1.74x. Is there a simple
explanation as to why the full list is sometimes notably less than double
the size of the half list, or could there be many different factors
involved? The main reason for this question is that I’m trying to figure
out

there are two factors involved here: 1) whether the pair style requests a
half or a full neighbor list, 2) whether "newton off" or "newton on" is
used. the newton keyword does not apply to what the pair style itself
requests, but how to handle the situation where a pair is split between two
subdomains. even with a half neighbor list, you may have "newton off" and
then you have more neighbors than with "newton on".

whether or not there is potential for the simulation running on the GPU to

return different results than when it runs on the CPU (ignoring rounding
errors and order of operation errors).

no. there should be no significant differences. the only way that i know
how neighbor lists can become a problem in this context, is when you do
some fancy manipulations with exclusions.

axel.

_Engler_Steven · May 15, 2014, 6:44pm

Thanks again for the quick reply, that was exactly what I was looking for. I appreciate your help!

Regards,

Steve