lammps on current AMD gpus

Patrick_Duchstein · March 21, 2018, 3:35pm

Hi all,

due to Nvidia's new license regulations concerning datacenter usage of
their drivers, I am tempted to switch away from my GTX 1080 cards,
towards AMD. Questions:

Has anybody ever tested lammps on AMD's Vega 64 GPUs?

Is it easy to compile with OpenCl (as compared to CUDA - does
the geryon library work well)?

Any experience with mGPU/CrossFire setups, does it work?

How is the single(/mixed) precision performance, compared to GTX 1080
(Ti)?

One more, as I was never able to directly compare CPU models: with GPUs
for computing pair interactions enabled, is it generally better to buy
fewer CPUs at a faster clock rate, or more CPUs at a lower clock rate?

Thanks in advance for any replies!

/Patrick

akohlmey · March 21, 2018, 4:05pm

Hi all,

due to Nvidia's new license regulations concerning datacenter usage of
their drivers, I am tempted to switch away from my GTX 1080 cards,
towards AMD. Questions:

Has anybody ever tested lammps on AMD's Vega 64 GPUs?

no idea. i had some donated fire gl GPUs, but that was almost 10 years ago.
i recall some people posting to the lammps-users list that were using
AMD GPUs on windows.

Is it easy to compile with OpenCl (as compared to CUDA - does
the geryon library work well)?

yes. this is regularly tested with the LAMMPS windows binaries. those
are compiled in OpenCL mode using a cross-compiler which is not
supported by CUDA.
how well this works depends a lot on the OpenCL driver provided by the
GPU vendor.

Any experience with mGPU/CrossFire setups, does it work?

it is not used at all. rather than attaching multiple GPUs to the same
MPI rank, the typical and most efficient setup for the GPU package is
to attach multiple MPI ranks to the same GPU. please recall, that the
GPU package only offloads selected parts of the calculation (neighbor
lists, pair interactions, part of kspace) to the GPU, which accelerate
well, and then runs the rest on the CPU, overlapping CPU and GPU where
possible. when considering amdahl's law, oversubscribing gives you a
performance boost, as you can parallelize the non-accelerated part.

How is the single(/mixed) precision performance, compared to GTX 1080
(Ti)?

don't know. in the past it was often so that the single (and mixed)
precision performance had a tendency to be faster on nvidia GPUs with
CUDA, while double precision was significantly faster on AMD.
on nvidia CUDA was usually better than OpenCL, but not much. the
biggest issue with AMD was the driver support for multiple MPI ranks
attached to the same GPU. the AMD driver apparently had more locks and
was not as well multi-threaded, which limited the maximum utilization.
nvidia drivers used to have the same issue in the very beginning, but
that had been improved over time.

One more, as I was never able to directly compare CPU models: with GPUs
for computing pair interactions enabled, is it generally better to buy
fewer CPUs at a faster clock rate, or more CPUs at a lower clock rate?

impossible to say. that are far too many factors impacting
performance: available memory bandwidth, PCI bus layout and
performance, cache sizes, memory module layout, average turbo-boost
clock difference, thermal load, utilization of vector instructions
(code that doesn't vectorize well runs faster without using AVX at
all, as this reduces thermal load, which in turn raises the average
turbo-boost frequency).

axel.

Stan_Moore · March 22, 2018, 2:56pm

Kokkos recently added support for AMD:

# AMD-GPUS: Kaveri,Carrizo,Fiji,Vega
# AMD-CPUS: AMDAVX,Ryzen,Epyc

So that is another option, but I've never tried running LAMMPS Kokkos on an AMD-GPU though.

Stan