Poor performance of Kokkos OpenMP for multi-core CPU (Skylake AVX 512)

Prithwish_Nandi2 · May 17, 2020, 2:27pm

Hi,

I was running a LAMMPS run for the LJ-system using Kokkos (OpenMP) as accelerator for a multi-core CPU (Skylake AVX-512 with 40 physical cores).

First, I ran a job with no accelerator (40 MPI tasks only). It takes 129 sec walltime (128.455 loop-time).

Next, I ran several runs with different settings of Kokkos (OpenMP).

The option “-pk kokkos neigh half newton on comm device binsize 2.8” appears to be most efficient with 40 MPI tasks and 1 OpenMP threads per MPI process and it takes about 158 sec as walltime (153.624 sec as Loop time).

All others MPI/OpenMP threads combinations (e.g. 20MPI/2omp, 10MPI/4omp, 8MPI/5omp, 5MPI/8omp, 4MPI/10omp, 2MPI/20omp, 1MPI/40omp) requires more time to complete the job when compared to the no-Kokkos run with 40MPI/1omp settings.

So, when we compare performance for 1 full-node of a HPC, I am losing performance by using Kokkos (OpenMP).

Then, why should I use Kokkos (other than portability) when I am losing performance?

(By the way, I have also tried Kokkos-serial to eliminate OpenMP backend overhead while using 1 OpenMP thread, and this didn’t help!)

Or, is there a special trick to gain the acceleration when using Kokkos-OpenMP feature?

I am copying the LJ-input and my submission file below:

akohlmey · May 17, 2020, 3:11pm

Hi,

I was running a LAMMPS run for the LJ-system using Kokkos (OpenMP) as accelerator for a multi-core CPU (Skylake AVX-512 with 40 physical cores).

First, I ran a job with no accelerator (40 MPI tasks only). It takes 129 sec walltime (128.455 loop-time).

Next, I ran several runs with different settings of Kokkos (OpenMP).

The option “-pk kokkos neigh half newton on comm device binsize 2.8” appears to be most efficient with 40 MPI tasks and 1 OpenMP threads per MPI process and it takes about 158 sec as walltime (153.624 sec as Loop time).

All others MPI/OpenMP threads combinations (e.g. 20MPI/2omp, 10MPI/4omp, 8MPI/5omp, 5MPI/8omp, 4MPI/10omp, 2MPI/20omp, 1MPI/40omp) requires more time to complete the job when compared to the no-Kokkos run with 40MPI/1omp settings.

So, when we compare performance for 1 full-node of a HPC, I am losing performance by using Kokkos (OpenMP).

this is expected behavior. thread parallelization in USER-OMP and USER-INTEL will get similar parallel scaling results.
in USER-INTEL, you can can a little by using single or mixed precision, but the you would still be running faster with all-MPI than with MPI+OpenMP.

This happens because you have a dense, homogeneous, well behaved system with a sufficient number of atoms, so that the MPI parallelization can be at its most efficient. LAMMPS has been constructed from the ground up to do MPI parallel calculations very efficiently using domain decomposition. All other parallelization (being on a device or a host CPU) cannot be as efficient unless you have conditions where the domain decompositions is not as efficient anymore.

Then, why should I use Kokkos (other than portability) when I am losing performance?

because not all systems are as well behaved as this and you are misunderstanding the performance that can be achieved.
using domain decomposition for MD promotes cache efficiency (due to having better cache locality with fewer atoms per process) at the expense of more communication. however, for multi-thread parallelization you parallelize over particles, which is easy to add and compatible (since it is an orthogonal scheme) with domain decomposition. however, for multi-thread parallelism, you need to deal with having to avoid race conditions and false sharing. this can be done by having extra copies of data or using atomic operations or not using newton’s third law. all adds overhead or makes the calculation less efficient.

this relation between MPI parallelism and multi-thread parallelism will change, when the MPI parallelization scheme is less efficient. this will happen, e.g. when you have too few atoms per domain. at some point LAMMPS will scale out and not run

faster or even run slower, if you use more processors via MPI only. with a pair style like lj/cut this will happen at a rather small number of atoms, but if you have long-range electrostatic (via ewald or pppm), then the scaling of kspace will reach its limit
much earlier and you will lose performance overall as the extra overhead from the 3d-FFTs in pppm or the bad O(N^(3/2)) scaling of ewald will drag you down (despite the O(N) scaling of the pair style). at that point using MPI plus a some threads will give you better performance and particularly allows you to scale (i.e. improve total speed) to more total CPUs. The same will also happen, if your domain decomposition is not very well behaved, i.e. you have significant load imbalance. the balance command can help some, but not always, as some system are just pathological. in some cases, you may change the communication pattern to use recursive bisectioning, but using MPI plus thread parallelization can also help, as this will result in larger subdomains for the same number of total processors and usually then will reduce the load imbalance (especially with the help of the balance command). This works because the parallelization over individual atoms in the multi-thread parallelization does not cause (much) load imbalance. that is not to say that the situation cannot be improved. e.g. it would benefit from alternative, per-thread neighbor lists, that would be optimized for reducing the overhead associated with the current multi-thread schemes.

(By the way, I have also tried Kokkos-serial to eliminate OpenMP backend overhead while using 1 OpenMP thread, and this didn’t help!)

there is next to no overhead with just one OpenMP thread when using per-thread copies of data. that is why this scheme is used by default in Kokkos (and is the only choice in USER-OMP).

Or, is there a special trick to gain the acceleration when using Kokkos-OpenMP feature?

you are misunderstanding what kind of “acceleration” you can gain. this should all be explained in the manual, perhaps with less detail as it has to be done in a more general way.

You are also misunderstanding the purpose of Kokkos. The primary reason for Kokkos being develops is that it allows you to write a single pair style in C++ where and without (much) understanding of GPU programming, that will then work on both GPUs (or Xeon Phi) and CPUs with or without multi-threading. It cannot fully reach the performance of USER-INTEL or the GPU package on CPUs and GPUs, but adding new code to those for a pair style, for which something similar already exists, is significantly less effort with Kokkos and requires significantly less programming expertise in vectorization and directive based SIMD programming or CPU computing. Also support for a new kind of computing hardware will primarily need additional code in the Kokkos library and just a little bit of programming setup/management in LAMMPS.

In conclusion. If you run on a CPU-only, you are almost always best off running with MPI-only or just a small number of OpenMP threads on top of mostly MPI.

You should be repeating your experiment on a system with much fewer atoms so you can see the limit of MPI-only strong scaling.

And to understand while I like to call “the curse of the kspace ™”, you should be repeating your experiment with the rhodo benchmark example (also for a large and a not so large number of atoms). You may also want to check out some of the granular media examples, which often have significant load-imbalances, unless special measure are taken to reduce those.

Axel.

akohlmey · May 17, 2020, 3:22pm

Please let me add that I just noticed in my e-mail archive, that I already gave you some explanation of the situation on january 30th.

Prithwish_Nandi2 · May 18, 2020, 9:20pm

Hi Axel,
Thanks so much for such a detail response. Yes, you are correct that I sent email earlier too on different aspects of Kokkos in LAMMPS (as a blackbox).
I understand that the main purpose of Kokkos is the portability as you mentioned to port an existing code for upcoming architectures without investing much ‘man-years’ of coding.
Confusion arises when Kokkos is added to the accelerator packages, and I had a misunderstanding that Kokkos can provide more acceleration than that is offered by a GPU package or USER-OMP or USER-INTEL.
My earlier email was related with an issue with Kokkos/GPU and finally I could reproduce the timings that was mentioned by Stan. I found comparable speedup for a LJ system for the GPU package and the kokkos-GPU package.

Unfortunately, using Kokkos-OpenMP, the runs are slower than non-kokkos runs. I was studying LJ system which exhibits very good MPI-scalability and that is why OpenMP is less efficient. Thanks for your explanation.

As you suggested, I did a run for the Rhodopsin (32000 atoms) for cores ranging from 1 to 320 (1,4,8,16,24,32,40,80,120,160,200,240,280,320 cores). I also plotted the speed-up factor for these runs. The plot is attached here.

This is interesting. Pair part and Bond part show perfect linear scaling, whereas Neigh and Kspace show poor scalability, and the total walltime also suffers from the poor scalability when running with more number of cores. Again thank you to lead me up to this.

Okay, I thought that I got a good test case for OpenMP acceleration. I ran it in 4 nodes with 2 settings (shown below) and in both case the total walltime is more than the nono-kokkos version.
mpirun -np 80 -ppn 20 --bind-to socket --map-by socket lmp -k on t 2 -sf kk -pk kokkos neigh half newton on comm host binsize 2.8 -i in.rhodo.scaled

mpirun -np 40 -ppn 10 --bind-to socket --map-by socket lmp -k on t 4 -sf kk -pk kokkos neigh half newton on comm host binsize 2.8 -i in.rhodo.scaled

What it appears to me that since Pair part is perfectly linearly-scaled, I am losing quite a bit performance while using less MPI tasks and OpenMP threads also not improving the Kspace part or the Neigh part, plus Pair part takes most of the walltime. So, the loss in performance for using fewer MPI tasks in expense of having more OpenMP threads could not be even replenished by Kspace performance gain! Therefore, it might be difficult to see an OpenMP acceleration for this example too.

I just need a potential test case for seeing OpenMP thread-acceleration in LAMMPS.

Sorry for bothering you.
Best regards,
Prithwish

Screenshot 2020-05-18 at 21.53.15.png

Prithwish_Nandi2 · May 20, 2020, 9:31pm

Hi,
I did a thorough benchmark test for the Rhodopsin system (with 32000 atoms) using both the USER-OMP and Kokkos OpenMP.
Plots for the parallel efficiency are attached herewith. The benchmark was done in Intel Xeon Gold 6148 (Skylake) processors with 2x20-core 2.4 GHz with upto 10 nodes.
LAMMPS was compiled with Intel 2019u5, GCC 8.2.0, MKL.

The runs were done for MPI-only and MPI+OpenMP (details are in the plots).
Parallel efficiency = (1/Np)*(Ts/Tp)

I am getting about 10-15% speedup as compared to pure-MPI runs when using USER-OMP, but with Kokkos OpenMP I am losing the performance.

Is that expected?

One more thing that I noted, that OMP_PROC_BIND and OMP_PLACES make a significant difference for USER-OMP package, but for the Kokkos OpenMP it’s not making any difference.

Any comments will be highly appreciated.

Best regards,
Prithwish

Stan_Moore · June 2, 2020, 6:50pm

In LAMMPS, using MPI will almost always beat OpenMP until you go to say thousands of MPI ranks where the MPI overhead becomes large. There are overheads to making the kernels thread-safe. If USER-OMP is 10-15% faster than pure-MPI for small problems, then it is most likely NOT due to threading but rather due to better SIMD vectorization. The Kokkos package in LAMMPS currently doesn’t vectorize well. USER-INTEL should be even better than USER-OMP at vectorizing if the styles are supported in that package.

Stan

Prithwish_Nandi2 · June 4, 2020, 11:51am

Hi Stan,
Thanks for for the explanation.
I’ll try to arrange 1000-2000 cores to check whether the scenario changes than what is shown in the plots.

One more question that I would like to ask:
If you look at the plots that I sent you earlier emails, I find that pure-MPI based runs are a bit slower than MPI+OpenMP (with 1 OpenMP thread per MPI rank). e.g. a run with 40 MPI ranks (= 1 node) is little slower than a run with (40 MPI + 1 OpenMP thread per rank). When I use 1 OpenMP thread, this is identical to have no thread.
Shouldn’t it be the opposite since a mixed run with even 1 OpenMP thread could result in some overhead due to making the code thread-safe?

Thanks,
Prithwish

akohlmey · June 4, 2020, 12:19pm

One more question that I would like to ask:
If you look at the plots that I sent you earlier emails, I find that pure-MPI based runs are a bit slower than MPI+OpenMP (with 1 OpenMP thread per MPI rank). e.g. a run with 40 MPI ranks (= 1 node) is little slower than a run with (40 MPI + 1 OpenMP thread per rank). When I use 1 OpenMP thread, this is identical to have no thread.
Shouldn’t it be the opposite since a mixed run with even 1 OpenMP thread could result in some overhead due to making the code thread-safe?

not automatically. the code in USER-OMP was written in a way that the overhead for a few threads is minimal at the expense of having a higher overhead for many threads. the original intent was to not have this as a separate package, but integrate it into the core of LAMMPS and thus giving OpenMP type multi-threading the same level of support as MPI parallelization. this, unfortunately, has not happened, so we have to deal now with extra code duplication and maintenance effort. when compiled without -fopenmp the code in USER-OMP should have next to no overhead, as all extra operations are not just skipped but not even compiled in. but in addition to multi-thread support, the code in USER-OMP has additional optimizations not present in the plain versions of the same styles and uses several techniques to signal to the compiler where additional opportunities for compiler optimizations are (e.g. pointers that won’t alias) and to reduce pointer chasing (which is a known problem of the core data structures in LAMMPS for positions and forces stemming from the original design of LAMMPS in Fortran - where such data structures can be better optimized due to the nature of Fortran - and the desire to have the similar looking code when converting it to C++). most are based upon methods used in the OPT package, but in a portable way (OPT uses features that are not officially part of the C++ standard). depending on the pair style and system, the serial performance of USER-OMP pair styles is often 10-15% better than the corresponding plain styles, in some favorable cases up to 40%.

axel.