Benchmarking new hardware


the tests you were running had 3 issues.
(i) without the -npersocket you might have gotten core binding (i.e. each process runs on a single core) in which case your threads will all fight for that single core and you almost deadlock. If you compile with hwloc support, kokkos can detect this issue and will stop with a runtime error.

(ii) Running such a small simulation over 4 GPUs will not work well. You will want something like 100k atoms per GPU for reasonable performance.
(iii) By default (though we want to change that) I think communication is running on the host (which will again trap you in situation (i) in addition to doing more data transfers). What you want is to run with a setting where the communication algorithms are performed on the GPU. Add to your input script this:

package kokkos neigh full comm/forward device comm/exchange device