C4130 configurations

Hello people,

Our group is considering buying a quad-K80 C4130 box from Dell. It comes in various configurations. In one, the 4 K80s communicate via a PCIe switch. This gives faster communication between gpus, but higher latency in gpu-cpu communication. In another configuration, each cpu has a faster direct connection to two gpus without the PCIe switch in between. However, communication between gpus and between a cpu and 'the other two' gpus is slower.

Since these ae big machines that cost well over $20k, I'll be lucky to get temporary access for testing to just one, so comparing them directly is not an option.

Can anyone estimate which configuration would be best for the gpu package (running eam calculations)? Do gpus need to communicate between themselves a lot? And is gpu-cpu communication mainly between one cpu and 'its' gpus or do cpus need to communicate a lot (the same amount?) with all gpus?

greets,
Peter

Hello people,

Our group is considering buying a quad-K80 C4130 box from Dell. It comes in various configurations. In one, the 4 K80s communicate via a PCIe switch. This gives faster communication between gpus, but higher latency in gpu-cpu communication. In another configuration, each cpu has a faster direct connection to two gpus without the PCIe switch in between. However, communication between gpus and between a cpu and 'the other two' gpus is slower.

Since these ae big machines that cost well over $20k, I'll be lucky to get temporary access for testing to just one, so comparing them directly is not an option.

Can anyone estimate which configuration would be best for the gpu package (running eam calculations)? Do gpus need to communicate between themselves a lot? And is gpu-cpu communication mainly between one cpu and 'its' gpus or do cpus need to communicate a lot (the same amount?) with all gpus?

peter,

can you elaborate a bit on:

- what is the motivation for going for such a machine?
- is it going to be a single machine? or multiple connected via
high-speed network?
- what is the expected use? (single GPU runs, runs across all GPUs, mixed use)
- what is the system size that you expect to be running most of the time?
- is the machine supposed to be (significantly) used by other applications?
- what kind of CPUs are you looking at for this? e.g. what is the
ratio of CPU cores vs. GPUs?

also, what is missing in your review of the situation is a discussion
of the communication bandwidth (CPU vs. GPU) when all GPUs are used
concurrently.

in general, i would caution about such extreme "GPU monster"
configurations. anything extreme is extremely good only under extreme
circumstances. :wink:
in the cases, where i see this kind of configuration justified,
alternatives are usually not viable due to lack of technical skills.
also, at the price tag of such a machine with a reasonable
configuration, you can often get within a factor of 2 in total LAMMPS
performance across the entire machine when using a small network with
CPU-only nodes using a small infiniband switch. for the best
utilization of such a GPU centric machine, you are probably better off
using a GPU focused code like HOOMD (but then you can get the same
throughput with a bunch of standard PCs using high-end consumer grade
GPUs. they are just more work to operate and maintain).

there likely isn't a single simple answer to the question whether it
is worth to go with the C4130 box(es). as for many problems in
science, the answer will be "it depends".

axel.