there are several issues here. overall your numbers do not make sense to me.
first off, neither pair style lj/sf (or more correctly lj/smooth/linear) nor pair style gauss/cut are supported by either the GPU or the KOKKOS package, so you should see the CPU performance only, which is strange because you should not get the same performance for the CPU run on all cores and the KOKKOS/GPU runs with only 1 MPI rank.
it may be better to show the entire logfiles, so that we have more information about where time is spent and what else is happening.
then, your OMP_NUM_THREADS variable appears to be set to 16. you should have it set to 1 unless you explicitly want to use threads (which does not seem to be the case).
it can negatively affect performance under some circumstances, too. if OMP_NUM_THREADS is not set LAMMPS will thus default to 1 instead of the usual OpenMP implementation default of using all cores available. with hyperthreading enabled, you should always compare performance of “16 MPI + 1 OMP”, “8 MPI + 2 OMP” and “8 MPI + 1 OMP” depending on the specifics of the pair styles and your hardware you may benefit more from more per MPI CPU cache than from extra compute capability (which is somewhat limited to 5-20% for hyperthreading vs. real CPU cores and often at the lower end of that scale). but I have never seen such a huge performance penalty from oversubscribing CPUs with threads as you would have to have had.
you should try the GPU package with 4 or 8 MPI ranks for reasons explained in the manual and many times over here on the mailing list, too.
also an important question is whether you compiled the GPU package with single, mixed, or double precision floating point support. for consumer grade GPUs there is usually a significant benefit to using mixed precision, if that is acceptable.
TL;DR the numbers you provide make no sense.