Lammps parallel computing: Speed Up problem when exceeding 16 Cores (32 Threads)

Dear Lammps Users, I fear that this subject might be off topic, so let me apologize in advance. So far, I was not able to find some help elsewhere, so I decided to submit my problem to your attention, hoping that somebody could be so kind to give me some advice.

I am using Lammps since four years or so. I run my simulations on a PC eqipeed as follows:

CPU AMD Ryzen Threadripper 3970X, 32 core (3.7 GHz – 4.5 GHz, 147 MB CACHE

MOTHER BOARD Gycabyte TRX40 AORUS PRO Wi-Fi 6: ATX, USB 3.2 –ARGB Ready

RAM: DDR4 Corsair Vengeance 3000MHz 128 GB

GPU: AMD RADEON RX 580 8 GB – HDMI, DP – DX 12

I am using Lammps Dec 24 2020 Version on a Windows 10 machine.

Up to now, I was simulating systems with less than about 100,000 atoms and simulation runs up to 30 to 40 ns. The simulation time could arrive up to about one week, which was acceptable for me.

Recently I started to study systems with more than 100,000 atoms so the simulation time went up, above the one week. I started considering to make use of a more powerful machine, i.e. with more than 32 cores. Before to embark in this direction I decided to measure the Speed Up vs Core (thread) Number curve of my machine.

To my surprise, I found that the Speed Up curve topped at about 16 Cores (32 threads) and started to decline. Herebelow the figures:

# core Speed Up
1 1.00
2 3.33
4 6.10
6 8.53
12 13.03
16 14.08
24 13.51
32 11.48

The issue seems to be exclusively related to the CPU, RAM usage is limited. CPU temperature never exceeds 60 °C.

In summary, I am not able to identify the root cause of this strange behavior.

If preferable, do not hesitate to contact me at my personal address: [email protected]

This is not at all strange. You really only have 16 full CPU cores. The 32 number comes from a “trick” that makes the CPU cores look like two cores. But they are not. For certain applications that use many threads but not continuously (e.g. GUIs, operating systems, games) these extra threads can improve performance. But for simulations where you have all cores doing the same thing and being busy all the time, there is rarely a gain. That is why on HPC clusters this simultaneous multithreading (SMT) or hyper-threading (HT) is usually turned off in the BIOS.

Dear Axel thanks a lot for your patience. I am only a chemist, not a specialist of computers, so let me ask a silly question:

Is there a way of “turning on” simultaneous multi- or hyper-threading in BIOS?
If “no”, I will stay where I am.
If “yes” I will try to find somebody who may do it on my PC.

it is already turned on.

Bottom line, you have 16 cores. That is the maximum you can use in a meaningful way with LAMMPS. As you have seen, if you try to use more processors it will be less efficient since those are not real CPU cores.

That is no excuse. You are using computers as tools, so you need to learn about them. You would not let me do experiments in your chemistry lab without me having the proper expertise and training, right?

OK I got it.
Thanks again

More fundamentally this is a reality of high performance computing – as you increasingly parallelize an operation, the serial “overhead” takes over (Amdahl’s Law), so you lose parallel efficiency.

Going from 2 to 6 processors you only get a 2.8x speedup.
Going from 4 to 16 processors you get under 2.5x speedup.

Without knowing anything about your simulation system, this isn’t too unexpected.

1 Like

srtee, thanks for spending time to answer to my question. Notwithstanding my lack of competence in computers, Amdahl’s law is known to me, so I would expect to find the curve Speed Up vs. Core Number approach an upper limit for Core Number → ∞, which is precisely what you are decribing in your example.
What I found to be “strange” is to find a curve which decays after reaching a maximum. This is not described at all by Amdahl’s Law.
Further on Axel’s information, which points more precisely to the root cause of my problem, I tried to dig out from the web something about the “trick that makes the CPU cores look like two cores”. My search was unsuccessful.
Then I consulted the supplier of my machine who flatly denied to have supplied me with anything else than a CPU with 32 “physical” cores.
Please understand that I am not opposing to what Axel wrote: “Bottom line, you have 16 cores”. I am just curious and I am trying to understand the difference between “real” cores and “tricky” cores, as a part of the learning process Axel strongly recommends.

But I think I am really going off topics, so please accept my apologies.

It would be helpful if you were to upload the input and output files (16 and 24 cores cases should suffice) from the tests you performed.

A quick google for the processor you posed above does indicate it is a 32core model. The general nomenclature for the real/fake processors is cores/threads.

Using some other software I have encountered something similar due to my problem being too small for the full range of processors. Making the problem larger allowed me to see the benefits of using more processors again. Perhaps this is the case here? How large is your test simulation/What is the range of atoms/processor?

Amdahl’s Law is a theoretical guide, and real computer programs on real computers are not guaranteed to obey it.

After all, your LAMMPS run got a 3.3x speedup going from one core to two – that doesn’t obey Amdahl’s Law either!

Understanding your system’s performance would require much more analysis than we are discussing here. As a simple example you would want to look at the “section breakdowns” of the LAMMPS performance summary and see if, for example, communications is taking up more and more percentage of the run time compared to other parts. Furthermore, the performance drop at higher core counts is overdetermined, as any (or all!) of these causes could contribute:

  • Can’t hyperthread effectively (as Axel and you discussed)
  • Communication work grows with the number of processes, so each process does more comms work as nprocs increases and this could not just diminish but overtake the parallelization speedup
  • Similarly each proc has to “process” and communicate its boundary particles as ghosts – as nprocs increases, the proportion of ghost to local particles increases (more surface area “between procs”) and so does the overall work
  • The more procs you run your system on, the more pronounced an effect inhomogeneities will have

Thankfully, while understanding your system’s performance is a complicated task, optimizing it isn’t. Simply determine the number of processes at which you have a reasonable tradeoff between speed and efficiency.

Then run your MD simulations embarrassingly parallel by collecting data from multiple short, uncorrelated runs (starting from different initial conditions) for statistical efficiency. For example, I can get the same amount of data running four 8-proc runs of duration 100 ns as I can running one 32-proc run of duration 400 ns, but the former will almost certainly parallelize much better than the latter.

Thank you baerb. Actually my CPU is specified to have 32 cores and 64 threads.

I find difficult to believe that my system (about 120,000 atoms) is small for for my range of cores. It takes about 1 week wall time to complete a 40 ns long run. That’s why I am considering to move to a machine with an higher number of cores.

But I have another silly question.
I use the command “set OMP_NUM_THREADS= Nt” to specify the number of threads to be used in a given simulation.
Then I use the command “lmp_serial -pk omp Nt -sf omp -in myfile” to start the Lammps simulation.

The question is: given a CPU with Nc cores what is the maximum recommendable number of THREADS Nt I can specify? More simply: if Nc = 32 the recommendable value of Nt is 32 or 64?

Once again thanks a lot for your patience

If Yoda were here, he might say: try, or try not. There is no “recommended”.

Empirically I would be very surprised if you got good parallelism with OpenMP. From my understanding, LAMMPS parallelises most efficiently across intra-node MPI, then across (intra-node) OpenMP, and least efficiently across multi-node MPI (although still pretty damn well considering the complexity of MD!).

Why not parallelize across multiple simulations like I suggested?

1 Like