Re: [lammps-users] trouble building LAMMPS with Kokkos

there are several issues here. overall your numbers do not make sense to me.

first off, neither pair style lj/sf (or more correctly lj/smooth/linear) nor pair style gauss/cut are supported by either the GPU or the KOKKOS package, so you should see the CPU performance only, which is strange because you should not get the same performance for the CPU run on all cores and the KOKKOS/GPU runs with only 1 MPI rank.
it may be better to show the entire logfiles, so that we have more information about where time is spent and what else is happening.

then, your OMP_NUM_THREADS variable appears to be set to 16. you should have it set to 1 unless you explicitly want to use threads (which does not seem to be the case).
it can negatively affect performance under some circumstances, too. if OMP_NUM_THREADS is not set LAMMPS will thus default to 1 instead of the usual OpenMP implementation default of using all cores available. with hyperthreading enabled, you should always compare performance of “16 MPI + 1 OMP”, “8 MPI + 2 OMP” and “8 MPI + 1 OMP” depending on the specifics of the pair styles and your hardware you may benefit more from more per MPI CPU cache than from extra compute capability (which is somewhat limited to 5-20% for hyperthreading vs. real CPU cores and often at the lower end of that scale). but I have never seen such a huge performance penalty from oversubscribing CPUs with threads as you would have to have had.

you should try the GPU package with 4 or 8 MPI ranks for reasons explained in the manual and many times over here on the mailing list, too.

also an important question is whether you compiled the GPU package with single, mixed, or double precision floating point support. for consumer grade GPUs there is usually a significant benefit to using mixed precision, if that is acceptable.

TL;DR the numbers you provide make no sense.

axel.

Thanks again for more input.

Since gauss/cut (not the simpler version) is essential for my project, GPU and Kokkos are off the table for now.

I have now set OMP_NUM_THREADS explicitly to 1. The runtime is about the same. But what I have discovered is that if I watch the Ubuntu system monitor, only one CPU is active for every single test run. So naturally the timings all come out the same.

Here is the output from cmake regarding MPI:

– Found MPI_CXX: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so (found version “3.1”)
– Found MPI: TRUE (found version “3.1”)

What follows is the full log file (minus the output from a dump command) for

mpiexec -np 16 lmp -in in.lmp

It used about 6% of the 16 CPUs - about 1 CPU.

Quite obviously I’m missing something yet again.

There must be something else running on your machine that is occupying all CPUs.

Not sure what happened, but all cores are now running at 100%. I will set OMP_NUM_THREADS to 1 globally, and wait for some other application to try out GPU support. Thanks for your help.