Please note the following.
In your “normal” run you have:
Loop time of 2.57551 on 1 procs for 100 steps with 32000 atoms
Performance: 16773.409 tau/day, 38.827 timesteps/s
16.4% CPU use with 1 MPI tasks x 1 OpenMP threads
Which means, that there are other processes running on your computer consuming most of the CPU time and only about 1/6th is available to your LAMMPS process.
When I run on my (slow, old Windows 10) machine, I get instead (note the 100% CPU usage!):
Loop time of 3.67333 on 1 procs for 100 steps with 32000 atoms
Performance: 11760.434 tau/day, 27.223 timesteps/s
100.0% CPU use with 1 MPI tasks x 1 OpenMP threads
Your MPI parallel logs support my suspicion, that they are not run in parallel because either the executable is not MPI enabled, or you are using the wrong MPI runtime installation. Despite requesting two MPI processes, there is no report of it, which means that you are the same calculations twice concurrently.
Loop time of 3.34212 on 1 procs for 100 steps with 32000 atoms
Performance: 12925.947 tau/day, 29.921 timesteps/s
19.6% CPU use with 1 MPI tasks x 1 OpenMP threads
The corresponding output on my machine is (note the 2 MPI tasks and the nearly 100% CPU):
Loop time of 1.92714 on 2 procs for 100 steps with 32000 atoms
Performance: 22416.626 tau/day, 51.890 timesteps/s
99.7% CPU use with 2 MPI tasks x 1 OpenMP threads
With OpenMP threading there should be significantly more than 100% CPU. Here is my output for 2 OpenMP threads:
Loop time of 1.84196 on 2 procs for 100 steps with 32000 atoms
Performance: 23453.258 tau/day, 54.290 timesteps/s
184.9% CPU use with 1 MPI tasks x 2 OpenMP threads
That you get less is another indication that your machine is very busy with other processes and for as long as that is the case, your calculations will always be slow.