Hello,
I am looking for some general advice about desktop workstation we will buy for LAMMPS simulations containing cca 1M atoms or more.
Idea is to build a system around Threadripper 9980x (64 cores), Nvidia RTX 5090 GPU and 128 GB RAM (4x32 GB DDR5, 6400 MHz).
Any advice to what I should be paying attention, in order to avoid pitfalls when making the purchase?
Many thanks and kind regards
The most important thing is to check which acceleration packages are available to your simulation (kokkos, intel, etc.)
If kokkos is fully available then GPU acceleration is likely a good idea. I would recommend 4090 instead of 5090, because the improvement of 5090 is mostly on memory bandwidth, which is a minor factor in typical molecular simulations. In this case the CPU performance is not very important. Also note that you should use the latest version of lammps and enable mixed precision for kokkos.
If the intel package is available, then CPUs would be the main workhorse. In this case the best choice would be Intel Xeon CPUs released in recent years (e.g. Xeon Platinum 85xx or 84xx), which has decent AVX512 support and the Intel compiler has optimizations for them. Older Xeon Platinum/Gold processors are also fine (but with lower performance). Avoid AMD processors (which are badly optimized by Intel compilers) or consumer-grade CPUs (they frequently lack AVX512 support).
In any case, I would recommend study the acceleration options in lammps carefully (it is quite complicated), and do some performance tests with on demo machines if that is possible.
If you want to use GPU acceleration, having a single GPU and that many (and poswerful) CPU cores will result in a very unbalanced machine. If you would be using KOKKOS, there is no benefit from the extra CPU cores, since the data remains on the GPU as much as possible and transferring to the CPU for non-GPU accelerated features comes with a penalty. If you use the GPU package, there is some support for oversubscribing the GPU, but that is limited to about 4-6 MPI ranks per GPU. So you would need to construct a machine with as many GPUs as possible for significant GPU acceleration. Also, you need to keep in mind that double precision floating point support is crippled on consumer GPUs. Building a machine with multiple GPUs is challenging due to having to manage a large amount of power and requiring lots of cooling. Under full load such a machine will consume as much power as a space heater.
I use a 64-core threadripper machine as desktop, but it is used primarily for development work. It can compile LAMMPS from scratch in record time.
This is not true. You just have to be explicit about which vector instructions you want to be used.
I don’t know why this rumor is being propagated. There was some anti-AMD code detected in the Intel compilers and MKL in the early 2000s, but that was a long time ago.
I once tried running sw/intel on an AMD EPYC 9654 machine, which has AVX512 and is tested to perform similarly to Intel Xeon 84xx with similar # of cores on quantum mechanics codes (e.g. CP2K and VASP). I tried both Intel classic and LLVM compiler, and translated the compiler arguments by the table released by AMD (https://www.amd.com/content/dam/amd/en/documents/developer/compiler-options-quick-ref-guide-amd-epyc-9xx4-series-processors.pdf). The result I got is that, either the performance on AMD is way worse (like ~2x slower), or it does not run on AMD at all.
If you have test results that the intel styles can run on AMD at similar performance as Intel processors of the same level, I am willing to hear. Particularly I am interested in running with AVX512, i.e. EPYC 4th+ gen (indeed without AVX512 it’s not too difficult to modify the Intel package code so it would enable SIMD with gcc).
Honestly, I don’t care. For the most part, the problems I deal with are of the kind where the efficiency of the MPI parallelization is dominant and the performance of individual compute kernels becomes nearly irrelevant. In that case AVX can even be detrimental to performance, since the additional power it consumes will reduce the boost frequency of the CPU and thus slow down all non-vectorized operations. The AMD AVX units are different so a 2x speed difference for similar clocks is quite possible. One has to look at the details of the hardware for each generation. There are too many CPU variants to make this worth investigating unless you have a large high-throughput problem.
Add to that, that both the code in the INTEL package and anything compiled by Intel compilers needs to be carefully checked for missing features or miscompilation, I have given up trying.
When I started in the business of high-performance computing and MD simulations some 30 years ago, it was worth fighting for these margins, but now I worry more about correctness and parallel scaling, since there are far more resources available relative to the complexity of the problems worth investigating.
@flywheel74 I strongly recommend benchmarking LAMMPS on the GPU you are considering before purchasing, e.g. using a cloud service or borrowing one from someone else. You want to know exactly what you are going to get and not be disappointed. Also note that not all of LAMMPS supports GPU acceleration, so you need to make sure that your styles do.