desktop machine acceleration OpenMP + GPU possible?

Gyorgy_Hantal · January 12, 2016, 8:48pm

Dear lammps users,

I am somewhat new to boost scientific calculations with strong hardware and I need some guidelines from you. I have recently built a strong single-CPU desktop machine on which I also want to run lammps (it was partially built for that), and of course, I want to get the most out of the configuration.

I have spent quite some time googling some info, but for most of the time it is not specific enough for my case. Besides your general point of view I would also be very grateful if you could point me to the right thread that I might have missed on this mailing list.

So here is the configuration: i7 5820k (six-core, hyper-threading possible) @ 3.3 GHz, GeForce GTX 980, 16 GB DDR4 memory, running ubuntu 14.04.

I have Open MPI 1.10.1, Open MP and the latest cuda and lammps (7dec15).

My systems: For now I will need to run simulations of 1) dense LJ particles (not more than a several thousands, but many of such systems) and 2) simple fluids (water and small alkanes) between walls of LJ particles.

Later I will simulate larger molecules and ions with water.

My questions:

1) MPI or OpenMP. Before my search I thought I should go for OpenMP as my machine only has one CPU. Then I found out it is not that straightforward and actually MPI can also do a lot even with my single proc. Is this true?

2) Furthermore I am most interested to find out what libraries I should use to get the best performance. I understood there is no single rule applicable to all configurations (and systems) but I would like to know some basic thumb rules. As a matter of fact, I am planning to make tests to find the best combination of these accelerating tools.

-I found somewhere that hyper-threading is only efficient if one uses CPU acceleration but not in combination with GPUs.

-I understood the gpu library might be more suited than user-cuda due to the little size of my systems, but this also depends on what I want to compute on the fly. Is this correct?

-Should I try KOKKOS (it seems to me it is suited for multi-threaded CPUs and GPU acceleration) or is it better to use user-omp in combination with the gpu library?

I know all these are rather vague questions but any pieces of information will be highly appreciated that clarify any of the above points and can help me take the right direction and avoid some big fails.

Thanks a lot in advance.

Gyorgy

akohlmey · January 13, 2016, 2:28pm

Dear lammps users,

I am somewhat new to boost scientific calculations with strong
hardware and I need some guidelines from you. I have recently built a
strong single-CPU desktop machine on which I also want to run lammps
(it was partially built for that), and of course, I want to get the
most out of the configuration.

I have spent quite some time googling some info, but for most of the
time it is not specific enough for my case. Besides your general point
of view I would also be very grateful if you could point me to the
right thread that I might have missed on this mailing list.

So here is the configuration: i7 5820k (six-core, hyper-threading
possible) @ 3.3 GHz, GeForce GTX 980, 16 GB DDR4 memory, running
ubuntu 14.04.

I have Open MPI 1.10.1, Open MP and the latest cuda and lammps (7dec15).

My systems: For now I will need to run simulations of 1) dense LJ
particles (not more than a several thousands, but many of such
systems) and 2) simple fluids (water and small alkanes) between walls
of LJ particles.

Later I will simulate larger molecules and ions with water.

My questions:

1) MPI or OpenMP. Before my search I thought I should go for OpenMP as
my machine only has one CPU. Then I found out it is not that
straightforward and actually MPI can also do a lot even with my single
proc. Is this true?

yes. the general rule with LAMMPS is, try to get the most out of MPI first
and then check if you can get something on top of that with OpenMP. there
are some niche cases, where OpenMP has significant advantages, e.g. when
the domain decomposition leads to load imbalances that cannot be remedies.
The OpenMP support via USER-OMP es very effective at a small number of
threads, but less so at a larger number of threads.

2) Furthermore I am most interested to find out what libraries I
should use to get the best performance. I understood there is no
single rule applicable to all configurations (and systems) but I would
like to know some basic thumb rules. As a matter of fact, I am
planning to make tests to find the best combination of these
accelerating tools.

-I found somewhere that hyper-threading is only efficient if one uses
CPU acceleration but not in combination with GPUs.

yes. and also for CPUs, the benefits is limited or in some cases even
harming performance. only tests and benchmarks can tell. please keep in
mind that for most classical potentials, the performance is not only
determined by the number of cores and how many flops they provide, but also
by how fast you can move data and how much data locaility you can maintain.
the latter is what makes MPI so efficient with LAMMPS. the domain
decomposition MPI parallelization, increases data locality with the number
of MPI ranks used.

-I understood the gpu library might be more suited than user-cuda due
to the little size of my systems, but this also depends on what I want
to compute on the fly. Is this correct?

the GPU package only computes (some) pair styles and (optionally) kspace.
running as much as possible on the GPU is not always the fastest, as one
can run GPU and CPU concurrently with the GPU package. other than that,
overall speed and parallel efficiency is - like for all parallel codes -
dominated by the slowest module and the worst parallel efficient module.
amdahl's law can be brutal.

-Should I try KOKKOS (it seems to me it is suited for multi-threaded
CPUs and GPU acceleration) or is it better to use user-omp in
combination with the gpu library?

KOKKOS or USER-CUDA and GPU/USER-OMP/USER-INTEL follow different
strategies.
at the moment, KOKKOS is not recommended for anybody that is not a
developer and USER-CUDA is essentially unmaintained with known bugs, but
also well tested for most features. so in both cases, thorough and careful
testing is required. actually, USER-CUDA benefits the most from USER-OMP,
since it doesn't support oversubscribing GPUs (well). with GPU, you can
oversubscribe the GPU and thus achieve better GPU utilization *and* MPI
parallelization of the rest.

I know all these are rather vague questions but any pieces of
information will be highly appreciated that clarify any of the above
points and can help me take the right direction and avoid some big
fails.

the main rule for such things is always the same: do plenty of benchmarks
and pay attention to the profiling output.
also, this topic comes up on this list every few months and thus a search
through the mailing list archives will give you more detailed info and
discussion of benchmarks and optimizations.

i don't know what you mean by "small" system, but if your system is too
small, you might even see, that you may be better off not using the GPU at
all.

axel.

sjplimp · January 13, 2016, 2:52pm

at the moment, KOKKOS is not recommended for anybody that is not a developer

I wouldn’t say that. Assuming you can compile with KOKKOS on your box

(requires current versions of specific compilers), and KOKKOS has the

pair style you want, I’d give it a try on a GPU or Phi and see how it does.

Any issues or performance feedback would be welcomed by Stan and Christian (the

LAMMPS developers working most with KOKKOS).

Steve

akohlmey · January 13, 2016, 3:15pm

> at the moment, KOKKOS is not recommended for anybody that is not a
developer

I wouldn't say that. Assuming you can compile with KOKKOS on your box
(requires current versions of specific compilers), and KOKKOS has the
pair style you want, I'd give it a try on a GPU or Phi and see how it does.

but that is pretty much my point: unless you have experience with software
development and know your way around the compiler documentation and how to
tweak makefiles, you *will* have a hard time to figure out what you have to
do to get LAMMPS with KOKKOS compiled properly, especially if you need to
figure out what is going wrong in case of error messages. it is getting
better with every update to the package, but it is still a long way to get
to the point where the base LAMMPS code is (and even that has people
occasionally struggling). this is doubly true for people trying to compile
LAMMPS/KOKKOS on HPC clusters, where you also have to factor in what
non-standard settings and tweaks or mistakes the HPC admins have made to
compilers/libraries and other supporting packages.

axel.

Gyorgy_Hantal · January 13, 2016, 4:32pm

Thanks, Steve and Axel, a lot for your answers. I will then try using different libraries (leaving out KOKKOS as it might not be worth the effort at the moment) paying particular attention to comparing gpu and user-cuda.

I will also make a more thorough check of previous related threads in the mailing list archive.

yes. the general rule with LAMMPS is, try to get the most out of MPI
first and then check if you can get something on top of that with OpenMP.
there are some niche cases, where OpenMP has significant advantages, e.g.
when the domain decomposition leads to load imbalances that cannot be
remedies. The OpenMP support via USER-OMP es very effective at a small
number of threads, but less so at a larger number of threads.

This means that when I invoke mpirun with lmp_mpi the number or processes I pass to mpirun is the number of threads to be used (6 at most in my case) ?

i don't know what you mean by "small" system, but if your system is too
small, you might even see, that you may be better off not using the GPU at
all.

By small I mean in the order of several thousands atoms.

Thanks.

Best regards,
Gyorgy

akohlmey · January 13, 2016, 7:22pm

[...]

yes. the general rule with LAMMPS is, try to get the most out of MPI

first and then check if you can get something on top of that with
OpenMP.
there are some niche cases, where OpenMP has significant advantages,
e.g.
when the domain decomposition leads to load imbalances that cannot be
remedies. The OpenMP support via USER-OMP es very effective at a small
number of threads, but less so at a larger number of threads.

This means that when I invoke mpirun with lmp_mpi the number or processes
I pass to mpirun is the number of threads to be used (6 at most in my case)
?

no. mpirun determines the number of MPI ranks. multi-threading in USER-OMP
is implemented via OpenMP and thus the number of threads is controlled by
default via the OMP_NUM_THREADS environment variable (but can be overridden
with the package command/flag).

for optimal performance, you also have to pay attention to processor and
memory affinity.

please also note that typically the largest consumers of total time are
Pair, Kspace, Neigh and out of these Pair and Neigh run very well on the
GPU and it is often most efficient to run Kspace on the CPU concurrent with
the GPU (tweak the coulomb cutoff for optimal balance). outside of that,
there is rather little to gain from multi-threading. few computes/fixes
consume a lot of time, and out of the ones that do, very few have been
ported to USER-OMP yet (or make sense to be ported).

i don't know what you mean by "small" system, but if your system is too

small, you might even see, that you may be better off not using the GPU
at
all.

By small I mean in the order of several thousands atoms.

yes, that is at the lower limit of using the GPU well. it depends a bit on
the specific set of styles, fixes and computes you are using.

just to give you some point of reference. i have a desktop with two
sockets 4-core intel xeon X5560 @ 2.8GHz and a single Nvidia GeForce GTX
Titan (also used for graphics). if i run the peptide example (which has
2004 atoms). for that i get the following performance numbers (Loop time).
GPU support in full double precision.

1 CPU, MPI-only: 6.46067
2 CPU, MPI-only: 3.43457
4 CPU, MPI-only: 1.86888
8 CPU, MPI-only: 1.12704
16 CPU (hyperthreading): MPI-only: 1.16251
1 CPU, 1 MPI x 1 OpenMP: 5.52743 (--bind-to socket)
2 CPU, 1 MPI x 2 OpenMP: 2.90653 (--bind-to socket)
4 CPU, 1 MPI x 4 OpenMP: 1.7102 (--bind-to socket)
4 CPU, 2 MPI x 2 OpenMP: 1.68366 (--bind-to socket)
8 CPU, 1 MPI x 8 OpenMP: 1.13036 (--bind-to none)
8 CPU, 2 MPI x 4 OpenMP: 1.6054 (--bind-to socket)
8 CPU, 4 MPI x 2 OpenMP: 1.01233 (--bind-to socket)
16 CPU (hyperthreading), 8 MPI x 2 OpenMP 0.990902 (--bind-to socket)
16 CPU (hyperthreading), 4 MPI x 4 OpenMP 0.980228 (--bind-to socket)
16 CPU (hyperthreading), 2 MPI x 8 OpenMP 1.17012 (--bind-to none)

1 CPU, 1GPU, MPI-only: 1.2607 (-sf gpu)
1 CPU, 1GPU, MPI-only, 1.00899 (suffix off for pppm)
2 CPU, 1GPU, MPI-only: 1.6281 (-sf gpu)
2 CPU, 1GPU, MPI-only: 1.25873 (suffix off for pppm)
1 CPU, 1GPU, 1 MPI x 2 OpenMP: 1.02555 (suffix omp for pppm)
2 CPU, 1GPU, 2 MPI x 1 OpenMP: 1.62516 (suffix omp for pppm)

4 CPU, 1GPU, 1 MPI x 4 OpenMP: 1.00901 (suffix omp for pppm)
4 CPU, 1GPU, 2 MPI x 2 OpenMP: 1.24395 (suffix omp for pppm)
4 CPU, 1GPU, 4 MPI-only: 1.92484 (suffix gpu for ppm)
4 CPU, 1GPU, 4 MPI-only: 1.9057 (suffix off for ppm)

you can see, that for such a small system, you can get about 6x speedup
from adding the GPU to running on a single CPU core. it is slightly faster
to run Kspace on the CPU concurrently to running Neigh and Pair on the GPU.
however, you can achieve about the same speed just using CPU cores, and it
is hard to get faster than 1 second regardless of the settings.

HTH,

axel.