C4130 configurations

(apologies for those who got this twice, I deleted the attached white paper pdf with technical specs in my original post, because it was too large to be accepted as attachment on the list)

Hi Axel,

Hello people,

Our group is considering buying a quad-K80 C4130 box from Dell. It comes in various configurations. In one, the 4 K80s communicate via a PCIe switch. This gives faster communication between gpus, but higher latency in gpu-cpu communication. In another configuration, each cpu has a faster direct connection to two gpus without the PCIe switch in between. However, communication between gpus and between a cpu and ‘the other two’ gpus is slower.

Since these ae big machines that cost well over $20k, I’ll be lucky to get temporary access for testing to just one, so comparing them directly is not an option.

Can anyone estimate which configuration would be best for the gpu package (running eam calculations)? Do gpus need to communicate between themselves a lot? And is gpu-cpu communication mainly between one cpu and ‘its’ gpus or do cpus need to communicate a lot (the same amount?) with all gpus?

peter,

can you elaborate a bit on:

  • what is the motivation for going for such a machine?

Doing lammps/eam calculations involving a dozen or several dozen million atoms. In some cases the systems are very sparsely and irregularly populated with material (< 15% of volume), making use of large numbers of cpu cores very inefficient. Since in the limit of using very many cores, 85% of them would be wasting their time on subboxes with no atoms in them.

  • is it going to be a single machine? or multiple connected via

high-speed network?

The C4130 is a single 1U box that can house 2 or 4 K80s and 1 or 2 cpus (it’s a pretty compact machine). So no latency at all due to communication between nodes.

  • what is the expected use? (single GPU runs, runs across all GPUs, mixed use)

Mixed. For probably 2/3 of the time, the machine would run 2-4 jobs, with 1 or 2 K80s per job. About 1/3 of the time, the machine would run a single big job with as many atoms as would fit into the total gpu ramm of 96 GB.

  • what is the system size that you expect to be running most of the time?

When running multiple jobs, 5-15 million atoms. When running a single job, as much as will fit into those 96 GB of gpu ramm. Based on extrapolations from smaller gpu cards, some 40-50 million atoms maximum.

  • is the machine supposed to be (significantly) used by other applications?

No, the only use for it so far is lammps/eam calculations, but that is the workhorse code for a number of people in our computational materials science group.

  • what kind of CPUs are you looking at for this? e.g. what is the

ratio of CPU cores vs. GPUs?

It can be configured with 2 Xeon E5-2690 v3 family cpus, so 6 or 8 cpu cores to one K80 is possible. K80s are essentially 2 gpus on one board, so a 3:1 or 4:1 ratio of cpu cores: gpus.

also, what is missing in your review of the situation is a discussion

of the communication bandwidth (CPU vs. GPU) when all GPUs are used

concurrently.

In the configuration with the internal PCIe switch, there are 16x PCIe lanes from each card to the switch. The PCIe switch connects to one of the two CPUs and each CPU communicates with the other via QPI.

In the configuration without the internal PCIe switch, each cpu has a direct connection to each of two K80s, without going through a switch. That gives lower latency to the two cards to which that one cpu is connected, but I presume higher latency to the other two (as well as higher latency between gpus).

I attach the diagram for the two configurations from the Dell pdf:
configurations.jpg

there likely isn’t a single simple answer to the question whether it

is worth to go with the C4130 box(es). as for many problems in

science, the answer will be “it depends”.

I had feared that the answer might not be so simple. Still, looking at the diagrams above of the 2 possible configurations I’m considering, the question of which would be best does depend to a large degree on the matter of how much gpus need to communicate directly. Is there a general insight into whether gpus communicate mosty with cpus (and maybe even with just one cpu instead of with both of them?) or do they also do lots of communication directly to each other?

greets,
Peter

(apologies for those who got this twice, I deleted the attached white
paper pdf with technical specs in my original post, because it was too
large to be accepted as attachment on the list)

Hi Axel,

Hello people,

Our group is considering buying a quad-K80 C4130 box from Dell. It comes
in various configurations. In one, the 4 K80s communicate via a PCIe
switch. This gives faster communication between gpus, but higher latency in
gpu-cpu communication. In another configuration, each cpu has a faster
direct connection to two gpus without the PCIe switch in between. However,
communication between gpus and between a cpu and 'the other two' gpus is
slower.

Since these ae big machines that cost well over $20k, I'll be lucky to get
temporary access for testing to just one, so comparing them directly is not
an option.

Can anyone estimate which configuration would be best for the gpu package
(running eam calculations)? Do gpus need to communicate between themselves
a lot? And is gpu-cpu communication mainly between one cpu and 'its' gpus
or do cpus need to communicate a lot (the same amount?) with all gpus?

peter,

can you elaborate a bit on:

- what is the motivation for going for such a machine?

Doing lammps/eam calculations involving a dozen or several dozen million
atoms. In some cases the systems are very sparsely and irregularly
populated with material (< 15% of volume), making use of large numbers of
cpu cores very inefficient. Since in the limit of using very many cores,
85% of them would be wasting their time on subboxes with no atoms in them.

​LAMMPS has the recursive bisectioning communication style to alleviate
this and also using MPI plus OpenMP (multi-threading is parallelized over
atoms, not space) in combination with load balancing can reduce the load
imbalance. while your assessment of GPUs being even less impacted by this
is correct, your view on the CPU side isn't.​

- is it going to be a single machine? or multiple connected via

high-speed network?

The C4130 is a single 1U box that can house 2 or 4 K80s and 1 or 2 cpus
(it's a pretty compact machine). So no latency at all due to communication
between nodes.

​communication latency is really only a major concern, when you are running
with a rather small number of atoms per nodes. for your budget and problem
size, bandwidth is much more important, albeit TCP/IP latency is indeed too
large. ​

- what is the expected use? (single GPU runs, runs across all GPUs, mixed
use)

Mixed. For probably 2/3 of the time, the machine would run 2-4 jobs, with
1 or 2 K80s per job. About 1/3 of the time, the machine would run a single
big job with as many atoms as would fit into the total gpu ramm of 96 GB

​i would be very concerned about this usage pattern. it is awfully
complicated to figure out how to properly organize (optimal) access to
specific CPUs and GPUs​ for multiple concurrent jobs. it is usually
straightforward for a 1-job, 1-cpu (core), 1-gpu scenario (e.g. a code that
runs practically completely on the gpu) or for a 1-user 1-node setup. in
mixed mode, chances are high things go wrong and then your jobs run *much*
slower. this is another advantage for cpu-only hardware. even though it is
slower, it is easier to partition and harder to make mistakes that impact
performance as massively.

.

- what is the system size that you expect to be running most of the time?

When running multiple jobs, 5-15 million atoms. When running a single job,
as much as will fit into those 96 GB of gpu ramm. Based on extrapolations
from smaller gpu cards, some 40-50 million atoms maximum.

- is the machine supposed to be (significantly) used by other applications?

No, the only use for it so far is lammps/eam calculations, but that is the
workhorse code for a number of people in our computational materials
science group.

​have you benchmarked HOOMD? it should support eam as well and is by
construction more suited (and efficient) for the kind of use scenario you
describe.​
​it may not have as many features as LAMMPS, but it may have enough for
your needs.

- what kind of CPUs are you looking at for this? e.g. what is the

ratio of CPU cores vs. GPUs?

It can be configured with 2 Xeon E5-2690 v3 family cpus, so 6 or 8 cpu
cores to one K80 is possible. K80s are essentially 2 gpus on one board, so
a 3:1 or 4:1 ratio of cpu cores: gpus.

​at 3:1 you should be close to optimal utilization​.

also, what is missing in your review of the situation is a discussion

of the communication bandwidth (CPU vs. GPU) when all GPUs are used

concurrently.

In the configuration with the internal PCIe switch, there are 16x PCIe
lanes from each card to the switch. The PCIe switch connects to one of the
two CPUs and each CPU communicates with the other via QPI.

In the configuration without the internal PCIe switch, each cpu has a
direct connection to each of two K80s, without going through a switch. That
gives lower latency to the two cards to which that one cpu is connected,
but I presume higher latency to the other two (as well as higher latency
between gpus).

​those extra PCIe switches are almost always a mistake. remember that for
concurrent operation (exactly what you want to do) bandwidth has to be
shared. you already have the switch inside the K80 where two GPUs have to
share the PCIe bandwidth (and thus run effectively using 8 lanes instead of
16). with the switch you effectively break this down to even less. the
switchless configuration is a must for your usage scenario and you have to
make sure that you run either 1 or 2 jobs on the node and that those are
confined to the CPU that is driving the respective GPUs. ​keep in mind that
GPU to GPU transfers don't happen in LAMMPS with the current code and what
you need to make certain of is that the hardware doesn't throttle the many
concurrent CPU to GPU communications.

I attach the diagram for the two configurations from the Dell pdf:

there likely isn't a single simple answer to the question whether it

is worth to go with the C4130 box(es). as for many problems in

science, the answer will be "it depends".

I had feared that the answer might not be so simple. Still, looking at the
diagrams above of the 2 possible configurations I'm considering, the
question of which would be best does depend to a large degree on the matter
of how much gpus need to communicate directly. Is there a general insight
into whether gpus communicate mosty with cpus (and maybe even with just one
cpu instead of with both of them?) or do they also do lots of communication
directly to each other?

with the GPU package you have a simple offloading scheme, so communication
is only between the GPU and the driving process on the CPU.​

​i understand the overall benefit of having a single specialized machine,
but because of the difficulties in partitioning it, you should also
consider a scenario of purchasing a collection of gaming workstations with
top level consumer GPUs (there are usually 1 or 2 models that are
practically equivalent to Tesla models in terms of performance and double
precision support.). in this kind of setup, you won't have more than 2 GPUs
per job, but you'll be able to afford more GPUs in total and thus support
more concurrent jobs. of course, maintaining consumer grade hardware at
constant high load is a bit trickier and requires more skill and experience
​but the benefits of tesla hardware in terms of operating and maintaining
them shows much more at (much) larger deployments.

​i think in the end, you will have to look as much at what kind of skill
set and manpower to support a possible solution will require as you have to
look at the hardware. all-cpu solutions have the benefit of requiring the
least skills to operate, and uncritical to use, but may require more effort
in setting up simulations from users. the one-big-integrated box solution
is attractive from the management perspective, but could require more
skills to set up and utilize effectively and it may not be perfectly suited
for your specific use scenario. so you basically have to choose between low
risk, low reward and high-risk, high reward. building a pile of boxes with
consumer grade hardware is somewhere in the middle, but requires the most
manpower and is the least convenient to maintain.​

for the sake of completeness, there is one more option: you can also get a
workstation class (and size) machine that can support 4 GPUs (when using
two CPUs), which you could then fill with consumer grade top level GPUs.
that would allow you to run a larger system compared to gaming class
desktops with a single cpu due to doubling the number of full bandwidth
PCIe x16 slots with a dual cpu mainboard.

in summary, none of the options is ideal and only the extra PCIe switch
configuration is what i would rule out completely.

axel.

configurations.jpg