The reason I want to try with MPS is because I dont seem to see a good scaling when I increase the number of GPUs and map multiple MPI process to a GPU.
For example with the LJ benchmark (scaled), these are the performance numbers I get on a DGX-1 station.
32 MPI processes (CPU only), 1 Million atoms -> 36 timesteps/sec
16 MPI processes (CPU + 1 V100 GPU), 1 M atoms -> 125 timesteps/sec -> nice speedup of ~3.5X
24 MPI processes (CPU + 2 V100 GPU), 1 M atoms -> 142 timesteps/sec -> poor scaling
nvprof (nvidia profiler) seems to indicate that a lot more time is spent in doing memcopy between host and device in the latter case. For example, running the LJ benchmark for 32K atoms (default) using 2 MPI processes mapped to 1 V100 has the following top5 GPU activities:
Type Time() Time Calls Avg Min Max Name GPU activities: 47.19 162.88ms 121 1.3461ms 1.5040us 4.3085ms [CUDA memcpy DtoH]
24.24% 83.658ms 114 733.84us 1.3760us 869.98us [CUDA memcpy HtoD]
17.92% 61.858ms 101 612.46us 592.00us 766.75us k_lj_fast
9.97% 34.402ms 6 5.7336ms 5.6087ms 6.0567ms calc_neigh_list_cell
0.04% 154.14us 48 3.2110us 3.1360us 3.4880us void vectorAddUniform4
Type Time() Time Calls Avg Min Max Name GPU activities: 54.64 237.65ms 121 1.9641ms 1.5360us 4.3147ms [CUDA memcpy DtoH]
19.10% 83.090ms 114 728.86us 1.3760us 866.97us [CUDA memcpy HtoD]
14.69% 63.887ms 101 632.54us 592.80us 766.43us k_lj_fast
10.87% 47.298ms 6 7.8830ms 7.6667ms 8.5328ms calc_neigh_list_cell
0.07% 306.82us 48 6.3920us 4.9600us 11.744us void scan4
while running 2 MPI mapped to 2 V100s gives the following profile:
Type Time() Time Calls Avg Min Max Name GPU activities: 51.34 258.81ms 121 2.1390ms 1.5360us 4.3919ms [CUDA memcpy DtoH]
32.45% 163.58ms 114 1.4349ms 1.4080us 1.7810ms [CUDA memcpy HtoD]
12.23% 61.632ms 101 610.22us 589.47us 765.08us k_lj_fast
3.51% 17.706ms 6 2.9510ms 2.8554ms 3.2880ms calc_neigh_list_cell
0.03% 155.81us 48 3.2450us 3.2000us 3.6800us void vectorAddUniform4
Type Time() Time Calls Avg Min Max Name
GPU activities: 51.35 259.03ms 121 2.1407ms 1.5360us 4.3650ms [CUDA memcpy DtoH]
32.42% 163.54ms 114 1.4345ms 1.4080us 1.7742ms [CUDA memcpy HtoD]
12.22% 61.662ms 101 610.51us 591.20us 763.99us k_lj_fast
3.53% 17.827ms 6 2.9712ms 2.8820ms 3.2960ms calc_neigh_list_cell
0.03% 156.22us 48 3.2540us 3.1990us 3.6800us void vectorAddUniform4
The time in memcopy has increased from ~245 ms to ~420ms on rank 0, though some of the kernel times has reduced as expected.
Could you please help with identifying why this is the case?