Reading the tutorial, I discovered the various LAMMPS accelerators and am
eager to give them a try. I'm particularly intrigued by Axel's OpenMP
implementation, which promises higher performance than MPI for single SMP
processors like mine.
no it doesn't. for LAMMPS, MPI parallelization is almost always
better, even inside a node. threads mostly make sense when using MPI
leads to significant communication contention for inter-node
communication. for intra-node communication, MPI parallelization often
benefits from better data localization due to MPI using domain
decomposition and threading using atom decomposition.
But I am a little confused about whether to pursue the GPU or the CUDA
packages, and about their synergy with the OMP build. I am planning to
you can compile a binary that contains all three packages (GPU,
USER-CUDA, USER-OMP) and select which to use at run time.
when using GPU acceleration, you first need to make sure you use the
GPU efficiently, everything else is rather a secondary.
the GPU package allows to have multiple MPI tasks use the same GPU and
that is often more effective, since the MPI parallelization is
faster than OpenMP parallelization. the best approach is increase MPI
tasks for as long as you get speedup and then add threads for
additional speed. you can also use USER-OMP styles with a single
thread. the USER-CUDA package is currently unmaintained and while it
is still usable, you have to double check your results carefully.
since USER-CUDA has a different data model than GPU, you cannot
oversubscribe the GPU and thus you will benefit more from USER-OMP in
combination with USER-CUDA. in any case, the word is: tests and
benchmarks. there is no single globally best solution for all
simulations.
simulate a 1-10 million atom system for MD simulations like NPT and NVE.
Before I launch into the building and testing process, I would like some
advice for which accelerators or build configuration would likely be fastest
for my system:
Core i7-4790K @ 5GHz
32GB 3100MHz DDR3 RAM
4TB HDD, 10k RPM
NVIDIA GTX 980 @ 1500MHz, 8GHz RAM
i cannot talk about specific GPUs, since i haven't had an opportunity
to test any newer GPU for over two years. the Geforce Titan i have was
pretty much equivalent to the top level tesla card at the time (K20) .
if possible, consider installing multiple GPUs for as long as they can
use the full 16-lane PCIe bandwidth.
i would put in more RAM (not for simulations, but for analysis and
visualization), use multiple SSDs and configure them as a RAID-1 or
RAID-10 (with current disk system sizes, i would consider any machine
without redundancy far too risky for research use. and RAID-5 or
RAID-6 wear out SSDs faster and require a lot of compute power and
tend to be slower).
I am also planning to use GCC 5.2 with its new profile-guided
auto-vectorization tuned for this processor to further boost performance.
for a huge and complex software like LAMMPS, these advanced compiler
features tend to do more harm than good. it is not worth the effort.
so forget about it.
to get good vectorization with the regular code paths in LAMMPS, the
internal data structures need to be changed. they currently are not
vectorization friendly. a compiler cannot change that, regardless how
smart it is.
if you do want to benefit from vectorization, you have to use the
latest intel compilers and the USER-INTEL package, which utilized
vectorization quite well.
axel.