UNRESTRICTED | ILLIMITÉ
Hi all,
This isn’t a problem, but more a question about how LAMMPS works with CUDA. One line in the documentation that is confusing me is “However multiple MPI tasks can share the same GPU, and in many cases it will be more efficient to run this way”. In practice, I have found this to be true as well. I’m just having trouble understanding why.
I also believe that the CUDA kernels from different CUDA application contexts will run in series one after another depending on what order they were called in. If so, I would expect a small delay while switching contexts, so I would expect multiple MPI processes using a single GPU to be slightly slower.
I have also read part of the paper here:
and it sounds like there might be some sort of multitasking application context switching (see “10x Faster Application Context Switching”), but I can’t find any details about it anywhere else (same with details on Hyper-Q). I have also read a lot of forum posts online where people say that there is no multitasking application context switching with CUDA, so I’m not sure what to believe. Even if it does exist though, I would still expect it to be longer because of delays while switching contexts.
The only possible reason I can think of is that enough work is being done on the CPU that when it is divided up between MPI processes, it makes up for the delays in CUDA context switching.
Hopefully someone can enlighten me!
Thanks,
Steve