Advice on building LAMMPS for i7 machine with NVIDIA GPU?

Reading the tutorial, I discovered the various LAMMPS accelerators and am eager to give them a try. I’m particularly intrigued by Axel’s OpenMP implementation, which promises higher performance than MPI for single SMP processors like mine.

But I am a little confused about whether to pursue the GPU or the CUDA packages, and about their synergy with the OMP build. I am planning to simulate a 1-10 million atom system for MD simulations like NPT and NVE. Before I launch into the building and testing process, I would like some advice for which accelerators or build configuration would likely be fastest for my system:

Core i7-4790K @ 5GHz

32GB 3100MHz DDR3 RAM

4TB HDD, 10k RPM

NVIDIA GTX 980 @ 1500MHz, 8GHz RAM

I am also planning to use GCC 5.2 with its new profile-guided auto-vectorization tuned for this processor to further boost performance. It’s no supercomputer, but it’s actually a fair bit more capable than some of the old 12-core boxes here.

With gratitude,

Jihong

Reading the tutorial, I discovered the various LAMMPS accelerators and am
eager to give them a try. I'm particularly intrigued by Axel's OpenMP
implementation, which promises higher performance than MPI for single SMP
processors like mine.

no it doesn't. for LAMMPS, MPI parallelization is almost always
better, even inside a node. threads mostly make sense when using MPI
leads to significant communication contention for inter-node
communication. for intra-node communication, MPI parallelization often
benefits from better data localization due to MPI using domain
decomposition and threading using atom decomposition.

But I am a little confused about whether to pursue the GPU or the CUDA
packages, and about their synergy with the OMP build. I am planning to

you can compile a binary that contains all three packages (GPU,
USER-CUDA, USER-OMP) and select which to use at run time.
when using GPU acceleration, you first need to make sure you use the
GPU efficiently, everything else is rather a secondary.
the GPU package allows to have multiple MPI tasks use the same GPU and
that is often more effective, since the MPI parallelization is
faster than OpenMP parallelization. the best approach is increase MPI
tasks for as long as you get speedup and then add threads for
additional speed. you can also use USER-OMP styles with a single
thread. the USER-CUDA package is currently unmaintained and while it
is still usable, you have to double check your results carefully.
since USER-CUDA has a different data model than GPU, you cannot
oversubscribe the GPU and thus you will benefit more from USER-OMP in
combination with USER-CUDA. in any case, the word is: tests and
benchmarks. there is no single globally best solution for all
simulations.

simulate a 1-10 million atom system for MD simulations like NPT and NVE.
Before I launch into the building and testing process, I would like some
advice for which accelerators or build configuration would likely be fastest
for my system:

Core i7-4790K @ 5GHz
32GB 3100MHz DDR3 RAM
4TB HDD, 10k RPM
NVIDIA GTX 980 @ 1500MHz, 8GHz RAM

i cannot talk about specific GPUs, since i haven't had an opportunity
to test any newer GPU for over two years. the Geforce Titan i have was
pretty much equivalent to the top level tesla card at the time (K20) .
if possible, consider installing multiple GPUs for as long as they can
use the full 16-lane PCIe bandwidth.

i would put in more RAM (not for simulations, but for analysis and
visualization), use multiple SSDs and configure them as a RAID-1 or
RAID-10 (with current disk system sizes, i would consider any machine
without redundancy far too risky for research use. and RAID-5 or
RAID-6 wear out SSDs faster and require a lot of compute power and
tend to be slower).

I am also planning to use GCC 5.2 with its new profile-guided
auto-vectorization tuned for this processor to further boost performance.

for a huge and complex software like LAMMPS, these advanced compiler
features tend to do more harm than good. it is not worth the effort.
so forget about it.
to get good vectorization with the regular code paths in LAMMPS, the
internal data structures need to be changed. they currently are not
vectorization friendly. a compiler cannot change that, regardless how
smart it is.

if you do want to benefit from vectorization, you have to use the
latest intel compilers and the USER-INTEL package, which utilized
vectorization quite well.

axel.

Thanks for the info. For now I’m using the pre-built Windows binary you graciously provided (albeit without the Intel extension due to the nature of GCC obviously not being ICC).

It’s working with the GPU parameters just fine, although I would recommend following a few troubleshooting steps to get modern desktops with Intel chips working properly with their graphics card (Intel OpenCL platform is the default OCL platform for some maddening reason, and LAMMPS would not recognize any real graphics cards out of the box). The solution was to probe the registry –
HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\OpenCL\Vendors, and remove the other vendors such as Intel (only do this if you don’t actually need OpenCL on your processor/APU, which LAMMPS does not support anyway).

Anyway, I noticed the run specifies parameters for Fermi were used rather than Kepler or newer, as stated on the download page. Is there an easy way to grab the openCL parameters for a later GPU architecture? I’m using a 980 GTX FTW OC, which likely shares more architecturally with Kepler than Fermi.

With gratitude,

Gabe

Thanks for the info. For now I'm using the pre-built Windows binary you
graciously provided (albeit without the Intel extension due to the nature
of GCC obviously not being ICC).
It's working with the GPU parameters just fine, although I would recommend
following a few troubleshooting steps to get modern desktops with Intel
chips working properly with their graphics card (Intel OpenCL platform is
the default OCL platform for some maddening reason, and LAMMPS would not
recognize any real graphics cards out of the box). The solution was to
probe the registry --
HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\OpenCL\Vendors, and remove the other
vendors such as Intel (only do this if you don't actually need OpenCL on
your processor/APU, which LAMMPS does not support anyway).

​my windows knowledge stems primarily from the times of windows 3.0, 3.1
and a little bit of windows 95. the windows binaries are cross-compiled
from Linux. i do occasionally some light testing on a windows virtual
machine, but that is the extent of support i can give for these. i ported
LAMMPS to the mingw compilers, because i needed the binaries for a
tutorial, and the changed to keep the scripts working have been minimal
since then. mind you, that was before using virtual machines was as easy as
it is now. they mainly became the "official" LAMMPS binaries for windows
because nobody else has stepped up and offered to take over.

Anyway, I noticed the run specifies parameters for Fermi were used rather
than Kepler or newer, as stated on the download page. Is there an easy way
to grab the openCL parameters for a later GPU architecture? I'm using a 980
GTX FTW OC, which likely shares more architecturally with Kepler than
Fermi.

​the package command has a "device"​ option. perhaps that has some impact,
but i would be surprised of the difference was large.

​http://lammps.sandia.gov/doc/package.html​

​axel.​