I ran your benchmark for 1000 timesteps on a single V100 GPU+Power9 with NVLINK. I got 151,437 atom-steps/s. Your system size of 5k atoms is pretty small, which makes it difficult to fully utilize the GPU. Profiling shows most of the time is spent in PairReaxComputeTorsion. This is known to have a high amount of warp divergence, hoping to improve in the future. Packing/unpacking comm buffers for fix/qeq is the next most expensive. For a system this small, it is probably better to pack/unpack on the host, I can add an option for that. But this should give you an idea if you are running well on the V100.
Loop time of 32.9645 on 1 procs for 1000 steps with 4992 atoms
Performance: 0.655 ns/day, 36.627 hours/ns, 30.336 timesteps/s
81.4% CPU use with 1 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total