Estimating CPU Requirements for Scaling Your Supercomputer Model

bahman_daneshian1 · February 13, 2024, 9:10am

Dear Lammps experts,

I’m simulating a silicon cantilever vibrating in a gas environment to study gas influence on damping. The model consists of a large silicon cube with 200,000 unit cells (100 microns) in the X and Y axes, and 2 unit cells in the Z axis. The cantilever sits in the middle, surrounded by gas in the hollow part, with a load applied to its free end. Pair_style hybrid tersoff lj/cut was used in this sumlation beween silicon and nitrogen atoms.

Given the size of the model, scaling analysis becomes essential. I’m unsure how to begin. How many MPI tasks should I use? How many CPUs per task? Preliminary studies revealed imbalance, with memory requirements exceeding 245 GB on 1 or 2 nodes, the memory limit per node. To address this, I applied optimizations such as “processors * * 1,” “comm_style tiled,” “neighbor 0.3 bin,” and “neigh_modify every 1 delay 0 check yes page 1000000 one 20000” alongside monitoring CPU load balance using “timer full sync.”

Any insights would be greatly appreciated.

Best regards, Bahaman

akohlmey · February 13, 2024, 2:20pm

As a start, I would not begin to make those tests with such a mega-ultra-extra-over-gigantic system, but a scaled down much smaller system (say 1/100th or even 1/1000th of the size). With this smaller system, you can likely start doing a strong scaling test. I.e. do a series of runs with 1, 2, 4, 8, 16, 32, 64, 128 etc. MPI tasks in total (how many MPI tasks per node you may use depends on the hardware you are running on). There you should watch for two properties:

Per MPI rank memory allocation, it should go down
Loop time, it should go down

The number of MPI tasks where the calculation does not get significantly faster (remember to have multiple runs for each setting and take the average to minimize fluctuations) is your limit of scaling for the small system size.

Now, in zeroeth order approximation, the memory requirements will double if you have twice the number of atoms and the simulation will take twice as long with the same number of MPI tasks. Or memory should be back to the same and Loop time about the same, if you double the number of MPI ranks. This is not perfectly true, The memory consumption has a fixed minimum and the limit of scaling may be reached sooner. But you can use the test numbers to get a first estimate for the full size system. Since you won’t have perfect weak scaling, the limit of scaling will come sooner than for the idealized case. So after running with the estimated number of MPI tasks, you can test with half or a quarter and so on until you locate the limit of scaling for the big system.

Once you have this settled (and only after this) you can see if there is some load imbalance, or whether you would achieve additional speedup from using MPI + OpenMP or MPI + GPU. But that comes at the end, not at the beginning.

Neither of these settings lead to any significant performance improvements. In fact, they may degrade performance.

bahman_daneshian1 · February 13, 2024, 2:31pm

Hi Axel, thank you very much. I was really confused. Now, I know how to begin scaling.

srtee · February 14, 2024, 5:55am

… surely at such length scales a continuum description of both cantilever and gas will give results of similar accuracy in much shorter time.

bahman_daneshian1 · February 28, 2024, 3:12pm

it is also resonable. 100 micron is large enough!