Optimal number of cpus/Node

coldzer · September 12, 2022, 8:39am

Hello,
I would like to know more about parallelizing lammps on 1 node with multipe cpus, is there an optimal number of cpus?
I tried to run the same code on different number of cpus (same node Intel Xeon X5670) 10 times and averaged them and here are the result. It seems that 12 has better time than 16 and 20, May I ask why?
Not all the nodes (mainly 16 cpus) I can use have 24 cpus, so is it better to always use 12 cpus?

N_cpus	Total_time(s)	err
1	2320.6	23.56
2	1361.1	7.46
4	960.1	17.25
8	533.8	6.18
12	419.0	3.74
16	466.7	5.66
20	436.2	5.67
24	392.8	6.08

akohlmey · September 12, 2022, 9:21am

Yes, there is. But it crucially depends on several factors: the kind of hardware you have, the size of the system, the force field, the runtime settings (like how you parallelize between MPI, OpenMP or a mix) what additional LAMMPS features you use, and some compilation settings.

That is impossible to say without knowing more about your simulation and settings and command line. Typically, the dominant reasons are that there are too few atoms per process so that the distribution of work cannot save more time than the overhead added or too many processes to make the 3d-FFTs parallize well.

As explained above, the optimum depends on many factors. There is a discussion in the LAMMPS manual about how to get the optimal performance for a given system.

srtee · September 12, 2022, 10:04am

Two tips:

Calculate the resources needed for your job in terms of CPU-hours. For example 2 CPUs * 1361s = 0.75 CPU-hours; 20 CPUs * 436s = 2+ CPU-hours. From this you can tell that using 20 CPUs is a worse idea than using 2 CPUs. You have a fixed number of CPU-hours in your allocation; you can choose between using more CPUs to get results faster in real-time, or using fewer CPUs to get results more slowly but more efficiently, within the optimum range.
As much as possible, plan both benchmarks and production runs to fully utilise one node. Depending on system setup, when computationally-intensive programs (including LAMMPS) share a node, one might affect the other’s performance. For example, I use a cluster which has 128 CPUs per node. I ran benchmarks using a script that was set up to use 5 partitions, and assigned different CPU numbers using -p 8 8 16 32 64. When I had determined that my runs had reasonable speed on 8 CPUs, I set up another script to use 16 partitions (16 different conditions) and ran in parallel using -p 16x4.

akohlmey · September 12, 2022, 10:28am

Depending on whether the cluster in use has TurboBoost/SpeedStep or whatever it is called for your CPU where the clock rate is adjusted to the thermal conditions and what the maximum memory bandwidth is that your machine provides, it may give better overall throughput to not utilize all CPUs.
We operate a cluster with dual socket nodes that offer 28 CPU cores per node. Depending on the application and on the size of the problem and the number of parallel nodes used. The optimal usage (for either performance or CPU allocation) can vary between using 16 CPU cores per node (for applications that require the most memory bandwidth and cannot use the CPU cache efficiently) and using all 28 CPU cores (for applications that do not use the vector units much and can use the CPU cache efficiently). The higher clock at using fewer cores per node can overcompensate the loss of use more parallel tasks, especially when this already in a range where the strong scaling of the application for the given system size is limited.

If you want the optimum, there is just one thing to do: benchmarks

coldzer · September 12, 2022, 10:30am

Thank you for the reply, I will have to go through the manual discussion to figure more and try to find the optimal settings.
I also think that I might have too few atoms per process, but why 12 and 24 cpus have close time and less than 16 and 20?

These are main settings of LAMMPS

pair_style lj/smooth/linear ${rcut}
pair_coeff * * ${e_epsilon} 1.0 ${rcut}
pair_modify shift yes

set atom * dipole/random ${seed} 1 #0 0

fix step all brownian/sphere ${temp} ${seed} gamma_t ${gamma_t} gamma_r ${gamma_r} rng ${rng}

fix align_field all efield  0 ${ef} 0               # THIS IS ZERO FOR NOW
fix kick all addforce 0.0 ${fe} 0.0              # THIS IS ZERO FOR NOW
fix active all propel/self dipole ${fp}

I run for 800M timesteps with timestep= 1e-6

The command used to run them is

mpirun -np $N_cpus  lmp_clus_m -in in2d.lmp               #N_cpus is number of cpus 1->24

The nodes available at the time of run are different from each other, although no one is sharing the same node at run time.

in2d.lmp (1.7 KB)

akohlmey · September 12, 2022, 10:36am

That is impossible to say from remote and without making benchmarks of my own and monitoring the machine in question. It could be due to load imbalances due to having different particle densities in different subdomains when LAMMPS partitions the system. Also the effective clock rate due to TurboBoost may be different. If the different runs were on different machines with different CPUs then this can be a significant factor in performance (different performance per CPU core, different memory bandwidth).

mkanski · September 12, 2022, 10:56am

How many physical processors are in this node? Xeon X5670 has only 6 cores, so you would need a 4-socket motherboard to run your tests efficiently.