advice for buying NVIDIA Tesla workstation

Dear colleagues,

as a part of the larger project proposal (deadline January 15th), we have decided to buy workstation based on /probably 2/ NVIDIA Tesla (K80 or P100) cards for MD simulations. We plan to use LAMMPS for MD simulations of defect production due to energetic ion irrradiation (collision cascades, thermal spikes). I would appreciate advice on how to configure the workstation for this (apologies for such newbie question). Given the budget of roughly 15-20.000 USD, here are my questions:

  1. K80 or P100? Two cards or four? I would like to keep open option of future upgrades (additional cards)?

  2. which best buy for Xeon processor (number of cores, cache size, how many processors per GPU card?)

  3. other issues: how much RAM? Would 64-96 GB be ok? HDD, SDD?

  4. I would prefer a tower instead of a rack, can it be placed inside office space (noise, temperature, power consumption)?

  5. any other issue? what are possible bottlenecks?

These are my question, but also any other advice is most welcome.

Thanks, Marco

Marco,

One important factor to consider is whether the particular pair, fix, and compute styles you require for your cascade simulations are well accelerated on GPUs, via either the KOKKOS or GPU package. For some pair_styles (e.g. class2) the acceleration on GPUs are quite poor and it is better off to run these potentials (or force fields) on CPUs. On the other hand, short ranged potentials accelerate quite well on GPUs. If you haven’t benchmarked your entire input deck on K80 or P100, I’d suggest you to do so before committing the $20K.

Replies to your other Qs:

  1. One K80 and one P100 is roughly $4k and $6k, respectively. But P100 for tower mount is not available till Q2 2017. If you insist a tower mount NOW, K80 is your only option. Speaking of performance, P100 is usually 2x faster than K80 for most potentials in LAMMPS.

  2. If your pair, fix and compute styles are well accelerated on GPUs, then you don’t need a powerful CPU – in this case a 4-8 core Haswell is sufficient. If not, then you need to consider more powerful CPUs. Both Kokkos and GPU packages support multiple MPI tasks per GPU card.

  3. I usually consider 2-4 GB of ram per CPU core. The smallest ram I recommend is 32 GB. HDD, SDD doesn’t really matter, unless your calculations are I/O intensive.

  4. Racks are cheaper and P100 is available for rackmount. Towers are better for small offices. That is for you to consider.

  5. The major bottleneck is to get a very clear sense of the speed up you can obtain on various GPU cards for your simulation. You don’t want to spend $20K on a workstation with the latest NVIDIA cards only to find out your input deck does not accelerate well on GPUs.

Dear colleagues,

as a part of the larger project proposal (deadline January 15th), we have
decided to buy workstation based on /probably 2/ NVIDIA Tesla (K80 or P100)
cards for MD simulations. We plan to use LAMMPS for MD simulations of defect
production due to energetic ion irrradiation (collision cascades, thermal
spikes). I would appreciate advice on how to configure the workstation for
this (apologies for such newbie question). Given the budget of roughly
15-20.000 USD, here are my questions:

1. K80 or P100? Two cards or four? I would like to keep open option of
future upgrades (additional cards)?

the major limitation is the mainboard and the power supply. there are
mainboards that can host many GPUs, but those use PCI-bridges and that
can have a significant negative impact on the performance if
host-to-GPU communication is overlapping. i recommend against planning
to add more cards. you are paying for something you don't use. better
to replace hardware later or just buy additional machines. given the
quite high cost of tesla GPUs, the cost of the hosting machine is a
lesser concern.

2. which best buy for Xeon processor (number of cores, cache size, how
many processors per GPU card?)

depends quite a bit on whether everything you run will be GPU
accelerated and what else you'll be using the machine for. i would
look for CPUs with a moderate number of cores and instead a higher
clock. pay attention to memory speeds and bus speed. it may be
beneficial to have a dual socket machine, to have more bandwidth for
the PCIe bus, but with only two GPUs, this is not really needed. fast
CPUs help to speed up the non-GPU accelerated parts and with very fast
GPUS, this can be a limiting factor. however, overall good balance is
the most important thing or else you'll be wasting money on part of
your machine, that is too extreme.

3. other issues: how much RAM? Would 64-96 GB be ok? HDD, SDD?

classical MD doesn't require much RAM, but it may be worth cranking up
the RAM for data analysis. often quick-n-dirty analysis tools are not
optimized for memory use and may require O(N**2) or worse memory, or
you may want to keep whole trajectories of a large system in RAM. for
the same reason, i would include some SSD storage (only SSDs are fast
enough to make sense as swap space these days. whether you want to go
all SSD or have some spinning disks depends on budget and storage
needs.

4. I would prefer a tower instead of a rack, can it be placed inside
office space (noise, temperature, power consumption)?

if you want to put such a high-end multi-GPU workstation in your
office, you'll need extra strong power outlets, cooling and a
subscription of earplugs or noise cancelling headphones.
under full load, such a machine can easily dissipate more heat than a
space heater and will create significant noise from fans. during
summer, this can be unbearable or the machine can crash due to
overheating, unless there is good airconditioning.

i would very much prefer to put a GPU monster machine somewhere in a
well maintained machine room, perhaps carve out some of the budget for
a moderate size desktop workstation with good graphics (and lots of
RAM, see above) as a frontend.

5. any other issue? what are possible bottlenecks?

for the budget, you should also consider (and ideally benchmark)
getting a set of 4 (or more) dual socket CPU-only workstations and a
small infiniband "pocket switch" or a couple of 4-socket machines
which can be connected directly with a single infiniband cable. while
this is not competitive with a well-balanced GPU workstation for
applications that can make good use of GPUs, it is a killer setup for
applications that don't accelerate well. and also may give good
throughput for many small calculations. ...given the increase in CPU
cores over recent years and the quite high prices for HPC GPUs the
cost differential is not as large as you may think.

you also need to keep in mind, that to get good acceleration from
GPUs, your system has to have a certain size, so that there are
sufficient work units to keep the many GPU cores busy. the smaller the
system and the more GPUs you want to use at the same time, the less
the effective performance increase of GPUs over a CPU-only
configuration.

axel.

Dear Axel and Ray,

thanks a lot for your emails. Here are few clarifications from my side:

1. it will be a rack (not tower) configuration.

2. I understand well that benchmarking is the only way to see amount of acceleration. But at the moment, with deadline so close, I only need to have some sensible configuration on the paper to put aside the money, and when we go shopping (hopefully around next Christmas) we'll know better what exactly do we need.

3. Simulation size will be big, because high energy (MeV) ion impact affects a lot of atoms, so the size of simulation will be large.

4. Besides running LAMMPS, this machine will probably do other things like DFT calculations (Quantum Espresso supports NVIDIA Tesla). Also I have seen Matlab support for parallel computing.

Now for few more questions:

1. As Axel suggested, going for CPU workstation also makes sense. I guess it is also easier to run LAMMPS on the machine like this. Could you please specify more in detail such a system (CPU, memory)?

2. Would it make sense to make some kind of hybrid CPU/GPU configuration, like 4xCPU + 1xGPU? Even only one node if two nodes are out of the budget? What would be sensible configuration in that case?

3. Axel also suggested frontend desktop workstation (with good graphics and lot of RAM). Could you please also provide more details about this?

I will really appreciate your advice.
Best regards,
Marco

Hi Marko,

A few opinions (if this is a workstation for general use / graphics display, possibly in your office)...

1) If you plan to do DFT, buy as much RAM as you can afford and/or, get SSDs for scratch space (Axel mentioned this). The scratch files for large DFT calculations can be huge (100s - 1000s GB) and not having them on a spinning disk can really help.

2) For displays, graphics, and VMD CUDA acceleration/rendering, it makes sense to use a separate GPU/video card from your compute GPUs. A high-end graphics card (not a dedicated GPU) works well for this... even a consumer model will work.

3) I have found that one GPU per socket seems to be the sweet spot for many things. If I was getting it, I would try for 2 Xeon processors 2 GPUs, 1 graphics card.

4) If this is in your office, for noise concerns, this has the potential to sound like a small jet. You need to keep things cool, but many large fans in the chassis running at lower RPM can provide equivalent cooling to tiny fans at high speed, and will be much quieter. I would look for a chassis that enables larger fans.

Best regards, Spence

marko,

please understand that giving more specific advice than already given
requires significant research effort and more knowledge about the
details of what you want to do. e.g. adding quantum espresso and
matlab to the mix makes things *much* more complicated. for QE the
possible GPU acceleration is rather moderate compared to what you get
with classical MD. MATLAB is an entirely different animal. the fact
that some code advertises GPU support or parallel computing ability
doesn't say anything about how well this works (i see rather poor
scaling with MATLAB's parallel programming toolkit on our machines,
but that may be due to the clumsy implementation as well as users not
making good use of it).

below are some quick answers. i don't have the time and desire to dig
deeper into this, since i just completed a large scale HPC hardware
procurement (which took the better part of two months to explore,
research, tweak and negotiate with vendors to find the optimal
configuration for the diverse purposes of our users).

Dear Axel and Ray,

thanks a lot for your emails. Here are few clarifications from my side:

1. it will be a rack (not tower) configuration.

2. I understand well that benchmarking is the only way to see amount of acceleration. But at the moment, with deadline so close, I only need to have some sensible configuration on the paper to put aside the money, and when we go shopping (hopefully around next Christmas) we'll know better what exactly do we need.

then just put "something approximately reasonable" in the proposal and
be done with it. most funding agencies will not check for details and
it is quite convincing to claim that good hardware configurations
change over the course of a year. no point to waste your time to
optimize a configuration that you won't buy right away. even more so,
if you don't even know exactly what you will be doing, what your
options are and how well individual software packages will perform for
your specific needs. performance (and GPU support) can change with the
smallest of details.

3. Simulation size will be big, because high energy (MeV) ion impact affects a lot of atoms, so the size of simulation will be large.

"big" is not a good descriptor. you can check with published
benchmarks, when GPUs scale out. some time ago, you should not have
less than 10,000 atoms per GPU for good utilization with most MD
codes. however, the point of best utilization changes a lot with the
implementation (e.g. between the GPU and KOKKOS package in LAMMPS).

4. Besides running LAMMPS, this machine will probably do other things like DFT calculations (Quantum Espresso supports NVIDIA Tesla). Also I have seen Matlab support for parallel computing.

Now for few more questions:

1. As Axel suggested, going for CPU workstation also makes sense. I guess it is also easier to run LAMMPS on the machine like this. Could you please specify more in detail such a system (CPU, memory)?

there are far too many options for all components in the market and
there is not "The one good solution(tm)" to recommend (like 20-30
years ago). finding the optimal solution for a small budget is
different than for a large budget and also the distribution of use
cases or the preference between capability and throughput matters.

2. Would it make sense to make some kind of hybrid CPU/GPU configuration, like 4xCPU + 1xGPU? Even only one node if two nodes are out of the budget? What would be sensible configuration in that case?

unlikely. there are usage scenarios, where this makes sense, but those
are unusual. compromises have the bad side effect that they not only
combine the benefits of multiple solutions, but also the shortcomings.
keep in mind, that pretty much any GPU machine these days is a
"hybrid" machine. as mentioned before, the main imperative for GPU
machines should be balance, not compromise.

3. Axel also suggested frontend desktop workstation (with good graphics and lot of RAM). Could you please also provide more details about this?

again, too many options to choose from. if i had to buy a machine for
myself, i would more-or-less know what to get (or rather certain key
specs to not miss), but that is because i know my typical workflows
and needs very well and where the current performance bottlenecks are.
for that reason (and a generous budget), my current desktop machine
has given my good service for over seven (7!) years now with only the
occasional upgrade/tweak (GPU, HD/SSD).

axel.

Dear Axel, and Spencer,

thanks once more for all of your emails, it is very helpful information.

With best wishes for 2017,
kind regards,
Marco