LAMMPS GPU Compatibility/Recommendations

ke_shen · August 2, 2017, 6:53pm

Dear LAMMPS Developer,

Hi, I’m Ke and I’m currently using LAMMPS to run simulations on the effects of polymer additives (like PLA and PCDTBT) on the morphology of a perovskite active layer.

I have a couple questions regarding the acceleration of the performance of LAMMPS. I’m currently running my graphics cards on a system running Ubuntu 17.04. As someone fairly new to the doing computational chemistry, I would really appreciate your answers, however brief.

The GPUs I have are not specialized for computing, they are Geforce GPUs and thus lack FP64 cores (1/32 as many FP64 cores as FP32), meaning that double precision performance is pretty poor on them. If I ran the simulation in single-precision produce unacceptable amounts of error? The main goal of the simulation is to see what’s happening with the PLA around the perovskite crystals. Alternatively, if double-precision is something that is generally required, what’s the most cost effective GPU accelerator for this purpose?

Also, from other threads I’ve seen LAMMPS doesn’t play well with current-generation (Pascal) Nvidia graphics cards… is there something I need to play with in the GPU package to make it work correctly?

Even when running lmp_serial/lmp_mpi with the appropriate environmental variable set such that it is using 12 threads (I have an old 12 core Xeon), the processor only has ~10% utilization. On the other hand, LAMMPS shows ~99% CPU usage. Which of these should I trust and am I not optimizing my system enough?
How would I run LAMMPS at the granular level, e.g. work with groups of atoms (instead of individual ones) in order to simulate grain boundaries? I’ve also heard that Voronoi tessellation is necessary in order to achieve PLA woven in between perovskite crystals but I’m not sure if that would be the best way to approach it.

Thank you so much for taking the time to reply,

Ke

akohlmey · August 2, 2017, 11:25pm

Dear LAMMPS Developer,

Hi, I'm Ke and I'm currently using LAMMPS to run simulations on the
effects of polymer additives (like PLA and PCDTBT) on the morphology of a
perovskite active layer.

please avoid using acronyms unless they are obvious to your audience.

I have a couple questions regarding the acceleration of the performance of
LAMMPS. I'm currently running my graphics cards on a system running Ubuntu
17.04. As someone fairly new to the doing computational chemistry, I would
really appreciate your answers, however brief.

1) The GPUs I have are not specialized for computing, they are Geforce
GPUs and thus lack FP64 cores (1/32 as many FP64 cores as FP32), meaning
that double precision performance is pretty poor on them. If I ran the
simulation in *single-precision* produce unacceptable amounts of error?
The main goal of the simulation is to see what's happening with the PLA
around the perovskite crystals. Alternatively, if double-precision is
something that is generally required, what's the most cost effective GPU
accelerator for this purpose?

nobody can tell up front and without making tests whether your simulation
will provide accurate results. single-precision vs. double precision math
is just one part of it. please note, that LAMMPS currently has two options
for GPU acceleration: the GPU package and the KOKKOS package. the GPU
package can be compiled for single-precision, mixed precision (most
operations and in single, but critical ones, like summing the forces are
done in double precision) and double precision. the KOKKOS package, as far
as i remember, currently only supports double precision. which of the two,
if any, is applicable to your system depends on the force field you are
using. beyond that, in case you are using the GPU package, you will have to
make tests and see what kind of differences you get on different
observables.
for example the stress tensor (and thus the accuracy of variable cell
simulations) has typically a much larger error than forces.

but i also have a feeling, that you are putting the carriage before the
horse. before even considering GPUs, you should validate your force field
choices with small test simulations and learn how to reproduce published
data on the CPU that way.

in my personal experience, for current hardware, the most cost effective
GPU with full double precision support is called a CPU.

Also, from other threads I've seen LAMMPS doesn't play well with
current-generation (Pascal) Nvidia graphics cards... is there something I
need to play with in the GPU package to make it work correctly?

most reported GPU problems can be tracked down to people making mistakes
when compiling LAMMPS or setting up their machines. to compile for and
run/use GPUs correctly *and* efficiently(!) requires significantly more
technical skills than running LAMMPS on the CPU.

2) Even when running lmp_serial/lmp_mpi with the appropriate environmental
variable set such that it is using 12 threads (I have an old 12 core Xeon),
the processor only has ~10% utilization. On the other hand, LAMMPS shows
~99% CPU usage. Which of these should I trust and am I not optimizing my
system enough?

impossible to say with such limited information. you are most likely not
utilizing styles that are thread enabled or have not correctly compiled
LAMMPS for that.
please also note, that for an MD code like LAMMPS, the MPI parallelization
is - by construction - usually more efficient than multi-threading until
you saturate the memory or communication infrastructure with message
passing data. e.g on a dual socket 6-core xeon box, LAMMPS is often the
most efficient with 4-MPI tasks per node plus 3 threads each or 6 MPI plus
2 threads each. again, what is the best choice depends a lot on your system
and your hardware, so there is no simple "do this not that" type of advice.
benchmarking is the best way to find out. and i have to repeat: before
worrying about performance, worry about the science. a fast running
simulation that produces garbage for results is useless.

3) How would I run LAMMPS at the granular level, e.g. work with groups of

atoms (instead of individual ones) in order to simulate grain boundaries?
I've also heard that Voronoi tessellation is necessary in order to achieve
PLA woven in between perovskite crystals but I'm not sure if that would be
the best way to approach it.

i cannot make any sense out of this question. it overall looks to me, that
you need a *lot* of help from your adviser/supervisor and work on the
science first. you seem very eager to move to advanced issues regarding
your simulations, but it looks like you are skipping over far too many
basic skills and exercises that you should be doing to learn the tool (i.e.
simulation) properly before applying it to what appears to be a quite
challenging and complex simulation task. could it be, that you are
underestimating the difficulty of performing good simulations? running
simulations is the easiest part. planning them and planning them so, that
you can extract useful and dependable results in analysis of your
simulations is what makes good MD studies.

axel.

ke_shen · August 2, 2017, 11:31pm

Thank you for the detailed reply. I will flesh out the science first in that case before worrying about the performance of the simulation.

Thank you Dr. Kohlmeyer for the advice.

_Meij_Henk · August 3, 2017, 1:11pm

Strongly recommend you try some test drives. We made CPU benchmarks of our “typical jobs” then went to two different vendors and tried out their GPUs. Was a totally positive experience with both setups.

https://exxactcorp.com/testdrive/

https://www.microway.com/take-a-test-drive/

-Henk

ke_shen · August 3, 2017, 1:21pm

Thanks for the tip again!