LAMMPS-USERS optimal configuration of HPC for running lammps

Niveditha_Surukonti · July 22, 2016, 9:58am

Dear lammps user,

We want to purchase a high performance computing cluster for running lammps code with GPUs. Our simulation system consists of nearly 70000 atoms. Can some one shed the light on what are the latest processors siuts lammps and the optimal confuguration oh HPC with GPUs with in $20000 USD.

Thanks in advance.

akohlmey · July 22, 2016, 1:39pm

Dear lammps user,

We want to purchase a high performance computing cluster for running lammps
code with GPUs. Our simulation system consists of nearly 70000 atoms. Can
some one shed the light on what are the latest processors siuts lammps and
the optimal confuguration oh HPC with GPUs with in $20000 USD.

as you may guess, people like me get asked questions yours a *LOT*.
and it is quite difficult to give good advice, since good advice needs
to be tailored specifically to your particular needs, local expertise
and environment. it needs doing research and spending time and effort
checking on hardware developments. while there are no really big
revolutionary changes, the hardware market has frequent incremental
changes at a pace, that makes a recommended configuration near
obsolete by the time you make the recommendation. especially for
people on a small budget, there is also the risk factor. with little
money, you cannot afford to take risks, or you lose it all. which
means, you need tried and tested hardware configurations and must to
jump on the "latest greates" bandwagon, that vendors keep pushing
forward all the time.

in short, nobody can afford to give you a recommendation for an
_optimal_ configuration. those need to be custom tailored for your
needs, and - like a custom tailored suit or dress - it is expensive
and time consuming to do. remember, each time you add a degree of
freedom (CPU, RAM, network, GPU, storage), the number of possible
permutations grows larger, and you cannot optimize each dimension
independently.

thus some general comments and questions on this, that will hopefully
help you find a suitable answer yourself.

- a budget of $20,000 is *very* small for building an HPC cluster. the
basic infrastructure for an HPC cluster, i.e. the hardware you need,
but that you don't run calculations on can easily consume up to half
of that. are you willing to give away that much money up front?

- systems of 70,000 are not very large. so you won't get much speedup
across multiple nodes unless you have a very expensive high-speed
interconnect, which doesn't make much sense for a very small budget.

- HPC clusters will need to be configured, housed in suitable
facilities (racks, power, cooling), run by people with proper
knowledge. do you have people with those skills on hand?

- how about your local experience with GPUs? did you do some
benchmarks? do the potentials you want to run properly support GPU
acceleration? vendors usually only support configurations with (very
expensive) dedicated "HPC GPUs", e.g. nividia tesla. running with
consumer grade GPUs is possible, but requires suitable cases and power
supplies. also you need expertise to spec, configure, set up and
operate those correctly. you also need to have a contingency plan to
deal with hardware failures, as your warranty is typically more
limited on consumer grade hardware. and some of it is not designed to
be run under full load 24/7.

- consumer grade GPUs require use of single or mixed precision
floating point math or else your acceleration will be very limited.
this means that some operations have a larger error than when running
with CPUs. this becomes particularly noticeable when computing
stresses, which are very sensitive to floating point accuracy. so if
you need to do a lot of pressure/stress computation or run frequently
with fix npt or fix press/berendsen, using GPUs in single or mixed
precision may prove more troublesome than running on CPUs.

- keep in mind that accelerator devices will usually have to be shared
across multiple CPU cores, which will limit the available acceleration
capacity per CPU core. e.g. with 2 high-end GPUs for a 20 core node,
you may get less than a 2x speedup from the GPUs compared to running
on the CPU cores only.

- when getting a quote for a cluster, also get quotes for alternative
approaches (possibly also from multiple vendors): check out getting
several (dual-socket) workstations instead (with and without GPUs) and
also check how many consumer grade "gaming" PCs you can get. and
compare. keep in mind that consumer grade hardware typically needs
significantly more maintenance effort and has more hardware failures.
so you must not only compare capability of the hardware, but also
resilience, potential downtimes and maintenance effort

- keep in mind that vendors have a preference to push the next
generation hardware (even if there is no benefit for you), and push
the most extreme configurations (highest clock, most cores, cheapest
components). the optimum is usually at a point where things are well
balanced. unfortunately, the extremely confusing and huge number of
hardware variants makes finding a good combination much harder than
easier. add to that, that some manufacturers arbitrarily cripple
certain hardware in order to make more extreme hardware with
higher-margins more attractive, and you'll see why finding an optimal
hardware configuration is such a large effort.

axel.

sjplimp · July 22, 2016, 2:07pm

A great example of the mail list providing useful info.
Awesome answer with a lot of detail …

Steve

_James_Kress · July 22, 2016, 3:24pm

One item not included in Axel's excellent response is power consumption.
Clusters of servers tend to consume a large amount of power so, given the
rapidly increasing costs of electricity, you need to pay attention to power
consumption. That will also factor into the electrical provisioning for
your cluster.

You can't just plug everything into a 15 amp wall socket. You will need
some type of Uninterruptable Power Supply (UPS) and Power Distribution unit.
For a small cluster (around 10 servers or so) you will probably need at
least one 30 amp, dual pole service that is an independent circuit. So, you
need to factor in the installation of the proper electrical service as well.

Jim

James Kress Ph.D., President
The KressWorksR Institute
An IRS Approved 501 (c)(3) Charitable, Nonprofit Corporation
"ENGINEERING THE CURE" C
(248) 573-5499

Learn More and Donate At:
Website: http://www.kressworks.org
Facebook: https://www.facebook.com/KressWorks.Institute/

Fabricio_Cannini · July 22, 2016, 5:02pm

Whenever a question like this appears in the gromacs list, this link comes up in the discussion:

https://www.mpibpc.mpg.de/15070156/Kutzner_2015_JCC.pdf

Is there any similar work but with lammps?

[ ]'s

akohlmey · July 22, 2016, 6:22pm

Whenever a question like this appears in the gromacs list, this link
comes up in the discussion:

https://www.mpibpc.mpg.de/15070156/Kutzner_2015_JCC.pdf

Is there any similar work but with lammps?

have you looked at how many different kinds of potentials LAMMPS
supports? and with/without GPU acceleration?
also, have you considered how much wider a range of system sizes
LAMMPS supports? and how many more features and flags LAMMPS has, that
affect performance?

furthermore, none of these benchmark comparisons usually factor in the
(ever growing) computational cost of (ever more complex)
post-processing, which LAMMPS often can do in parallel during the
simulation, which may also impact parallel efficiency and GPU speedup,
but reduce overall time and storage requirements.

in short, there are some general rules that can be applied, and there
are some benchmark data posted on the LAMMPS homepage, but a thorough
"buy-this-not-that" study would be a fool's errand. and due to the
many degrees of freedom, by the time you have some comprehensive data,
it will be obsolete.

with current generation hardware, *transferable* comparative
benchmarking is essentially impossible. just consider turbo boost for
example. if you buy a 2.8 GHz CPU, you in fact rather get a
800MHz-3.5GHz CPU, and you'll get different clock rates at different
number of concurrent processes or load to the CPU cache or efficiency
of vectorization. depending on the individual simulation, it may even
be more efficient to not use all CPU cores.

but let me make a more provocative statement: if you *really* want to
get the most bang for the buck. don't worry much about the hardware,
instead make your users and developers smarter! train them better!
give them time and encourage them to learn things properly!! train
them to always do some test to identify decent performance settings.

there is *by far* more money wasted due to inexperienced and
insufficiently trained people running superfluous, badly designed or
generally inefficient simulations. there are huge amounts of resources
wasted because people don't care and respect the effort it takes to do
things well. just have a look at the distribution of questions asked
here, and factor in that for every person having the courage to ask,
there are probably 20 more subscribed that are just sitting there
silently and hoping somebody provides answers to questions they have
on their mind for a long time.

or let me express this differently: if you want to get good
performance, *you* (i.e. every individual) will have to work for it.
and keep in mind the 80:20 rule. even if you only invest 20% of the
effort, you can get 80% of the benefit. most people today still live
in the mindset of the last 10-15 years, where it didn't matter that
your code was inefficient or badly written or not well parallelized or
not efficiently used. you just purchased the next generation of (more
powerful) hardware, and everything was alright again. it wasn't like
this before that period; you had to fight the OS, the hardware, the
compilers, the limitations of software to get anything done well and
it doesn't apply now, where performance improvements are going
"sideways", and it doesn't matter as much to have the most potent
hardware, but much more how well you can use it.

in summary, if you don't have the local expertise and training, stick
with simple things, as that is what what most people will still get
right. in terms of accelerators, one has to realize that at the
current state of affairs, they are still some kind of niche product.
they are the most effective in desktops, as you usually cannot easily
add more compute power to them, and ultra-high end supercomputers, as
they cannot easily get more linpack flops for the same money. for
everything in-between, they are only moderately cost effective and
only really good for a subset of applications.

axel.