Sharing a GPU Between Multiple MPI Processes

UNRESTRICTED | ILLIMITÉ

Hi all,

This isn’t a problem, but more a question about how LAMMPS works with CUDA. One line in the documentation that is confusing me is “However multiple MPI tasks can share the same GPU, and in many cases it will be more efficient to run this way”. In practice, I have found this to be true as well. I’m just having trouble understanding why.

I also believe that the CUDA kernels from different CUDA application contexts will run in series one after another depending on what order they were called in. If so, I would expect a small delay while switching contexts, so I would expect multiple MPI processes using a single GPU to be slightly slower.

I have also read part of the paper here:

http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

and it sounds like there might be some sort of multitasking application context switching (see “10x Faster Application Context Switching”), but I can’t find any details about it anywhere else (same with details on Hyper-Q). I have also read a lot of forum posts online where people say that there is no multitasking application context switching with CUDA, so I’m not sure what to believe. Even if it does exist though, I would still expect it to be longer because of delays while switching contexts.

The only possible reason I can think of is that enough work is being done on the CPU that when it is divided up between MPI processes, it makes up for the delays in CUDA context switching.

Hopefully someone can enlighten me!

Thanks,

Steve

please note the following comments refer mainly to the GPU package in LAMMPS.

UNRESTRICTED | ILLIMITÉ

Hi all,

This isn’t a problem, but more a question about how LAMMPS works with CUDA.
One line in the documentation that is confusing me is “However multiple MPI
tasks can share the same GPU, and in many cases it will be more efficient to
run this way”. In practice, I have found this to be true as well. I’m just
having trouble understanding why.

From my understanding, CUDA arch >= 2.0 has the ability to run multiple
kernels at the same time as long as they’re running from the same CUDA
context (assuming the kernels are small enough). Since each MPI process has
a different CUDA context, I don’t believe this is what’s happening here.

how the CUDA side of things is handled is of little relevance, for as
long as the GPU is well occupied. this is more likely to happen, when
multiple kernels (or the same kernel on multiple data sets) can
execute concurrently. in principle, the total workload the can be
offloaded to the GPU doesn't change much with running with one or more
MPI tasks, same as for CPU-only execution. however, since the Neighbor
and Pair parts of the calculation (i.e. what is offloaded to the GPU)
is usually the majority of the total time, it now becomes an issue,
how much time is spend on the rest of the calculation, since the GPU
kernels itself (without the surrounding MD code) can easily provide a
50x to 100x acceleration, i.e. they make this part of the calculation
a tiny part of the total time. furthermore, the GPU kernels are
launched asynchronously, i.e. the computation on the CPU continues
while the GPU is busy and only after other force contributions are
computed, LAMMPS will poll the GPU for the results of the GPU kernels
(it is a bit more complex, when you use GPU acceleration for Kspace,
but when using MPI and oversubscription, that is rarely an efficient
choice anyway). that will practically hide the cost of the Pair part
entirely, if the rest (Bond and Kspace) take longer. which in turn
means that running with more MPI tasks will fully exploit the MPI
parallelism in those *non-accelerated* and thus provide additional
speedup. also the entire rest of the calculation (e.g. time
integration, computes, etc.) will run on the CPU and thus can benefit
from MPI parallelization.

now, *how many times* you can oversubscribe the GPU and still see
speedup depends on both, the GPU hardware generation and its driver
support. that is where all the nvidia gimmicks can give you an extra
edge, e.g. on high-end machines like ORNL's Titan. But the fundamental
benefit is from being able to parallelize non-accelerated parts of the
code. so basically you are looking at a version of Amdahl's law: it is
the serial part of a program that limits how well you can parallelize
a code, not how efficient the parallelization is. the latter is only a
secondary.

HTH,
     axel.

UNRESTRICTED | ILLIMITÉ

Hi Axel,

Thanks, that was what I was looking for! I appreciate your help.

- Steve

Hi Steve,

there’s indeed overhead due to context switching when multiple MPI processes share the GPU, particularly when it is in the default mode.

The Kepler GPUs on the ORNL’s Titan are set the exclusive-process mode by default, which allows for only one CUDA context to be created on the device. In this case, the CUDA 5.0 proxy server on Titan should be enabled so that multiple MPI processes can share the GPU, and more importantly, they will share the same context- the context switching overhead in this particular setting is then removed.

For Fermi cards, because there’s only a single hardware work queue, there are false dependencies between kernels launched from multiple MPI processes sharing the GPU. Kepler cards support hyper-q, which allows multiple hardware work queues, which mean to eliminate false dependencies. The device can overlap host-device data transfers from a MPI process with kernel execution from different MPI processes. This improved pipelining helps maximize the device utilization, i.e. keeping as many SMs busy at a time as possible. That explains why oversubscribing the GPU can be beneficial with the GPU package before Amdahl’s law kicks in.

I agree with Axel’s comment that the speedup you can get from oversubscribing the GPU (with the GPU package) will depend on the GPU hardware, driver and toolkit. You can find out more details on hyper-q from CUDA documentation, and on multiple MPI processes sharing a GPU from a recent paper by Mike Brown (DOI: 10.1016/j.cpc.2013.08.002).

Best,

-Trung

UNRESTRICTED | ILLIMITÉ

Hi Trung,

Thanks! These things are good to know. I think that’s a good overview of how it works on the different architectures of CUDA-enabled GPUs, which is what I was trying to find. Hopefully these emails will help other people in the future as well.

  • Steve