About a parallel efficiency different

Dear Lammps developer:

I am currently working on carbon hydrogen with different potentials: AIREBO-M and OPLSUA. The Lammps have a vastly different parallel efficiency on the two potential:

For a ~48K atoms system with AIREBO-M, I got 6.392 timesteps/s for a single node and 11.587time steps/s for 4 nodes. This is somewhat I expected considering the cluster I am working with.

However, For a ~8K UAs system with OPLS-UA, I got the following speed…:
Node(s) timesteps/s comm time%
1 551.193 15.34
2 343.020 46.24
4 74.668 44.32
Which is quite wired and not what I expected.

The cluster I am using is made of an AMD-like CPU with 64 cores per node. However, unlike most clusters, the nodes are communicate via ssh, not RDMA. The Lammps version is 5Jun2019 version as the cluster I am working with is only reachable from intranet and make it hard to compile recent version. Hyperthreaded is disabled for every node and only mpi not openMP is utilized.

I wonder why there is a dramatically different parallel efficiency and how I could improve the latter case.

Best regards,
Jiawei Zhao

Please note that you are comparing two different system sizes and two types of potentials with different computational complexity. The computational cost per atom of AIREBO is almost 15x compared to your OPLS system (assuming perfect weak scaling and no parallel overhead for in-node communication).
That means, you have much more time spent on computing before you have to do communication again.
Add to that, that your AIREBO system is 6x larger, you have an explanation for getting some speedup using multiple nodes (even though not a good speedup). With 4 nodes, i.e. 256 MPI processes, you have only 188 atoms per MPI process and that is not a lot, even for an “expensive” potential as you are reaching the limit of strong parallel scaling.

If your AIREBO system was smaller, you would likely see less speedup.
For your OPLS system, you are already at the limit (if not over) with a single node with only 125 atoms per MPI process. Due to the lower computational complexity of OPLS your limit of strong scaling is at a larger number of atoms than for AIREBO. …and this is assuming you are not using a PPPM Kspace solver. If you do (to treat long-range electrostatics), then you have the additional problem that you have to do 6 3d-FFTs (3x reverse and 3x forward) to get the forces and those are causing a particularly high communication overhead, which is much increased on your system due to using TCP/IP with orders of magnitude larger latencies than Infiniband.

In summary, your observations are not weird at all and there is not much that you can do. It would be interesting to see the performance within a single node for a sequence of 1, 2, 4, 8, 16, 32, 64 MPI processes. That will allow you to better judge where the strong parallel scaling limit of your two systems is. And similarly, I would be curious to see, how much performance and thus communication overhead you have on the AIREBO system with 2 nodes.

Dear Prof. Axel Kohlmey

Thanks for your detail answer as usual. I did manage to have a try on the cluster for those two system with lower MPI process on one node as its there are only one free at the moment. Here are the results in timesteps/s:

It seems that the parallel efficiency is fairly ok within a node to me.

As for the small size of the OPLSUA system, This is a trial system just to validate the parameter I put is correct. We intend to vastly increase the volume and time scale of the system so that is why we are trying to drop the complex cutoff function within the AIREBO and combine the hydrogens.

Just judging from the tests and your comment, it is safe to say that if the system is vastly increased in volume, for example 8m UAs (which probably close to the target…), I could expected a much better parallel efficiency with the current cluster? However, if I simply want to simulate longer time, more node will not help :frowning: . Please correct me if I am wrong.

Another things I would like to ask is can OpenMP make a different?

Best regards,
Jiawei Zhao

There is a significant drop in parallel efficiency when going from 32 processes to 64 processes.
It is difficult to judge whether this is due to the variable clock rate in modern CPUs (which will be limited when using all cores and thus producing more heat) or due to reaching the “algorithmic” scaling limit.

That is impossible to say from remote and without seeing all the details of the runs. There is some hope, but specifically if you are using long-range electrostatics, you have to also contend with the problem that the 3d-FFTs can only be parallelized in 2d and thus the larger the system, the lower the performance at the limit of strong scaling.

That is a tricky question. The major issue is that you don’t have a low-latency network, and your main strategy to improve performance would be reducing the impact of communication overhead. LAMMPS is not well set up for that which is a consequence of its design for flexibility. It may be worth looking into using NAMD, which is built around doing efficient load balancing and latency hiding at the expense of flexibility. Having a cluster with 64 cores per node and only ethernet interconnect is a very unbalanced machine to begin with and running multi-node calculations is asking for trouble. I would rather look to apply for time on a better machine in a regional or national supercomputing center than wasting a lot of time to optimize performance on a badly balanced machine. I would not want to use that local machine with jobs that extend beyond a single node. There is just too much CPU time wasted on the unsuitable interconnect.

Thanks again for your time and all your advices!

Best regards,
Jiawei Zhao