Use 8CPU+2GPU with GPU compute mode "exclusive process"

Dear list, dear developers,

I am running LAMMPS with the GPU package on a cluster that has 8 CPUs
and 2 GPUs (Tesla S2050, CUDA 5.0) per node. I compiled it with
mvapich2/1.9__intel-2013. nvidia-smi tells me that the compute mode of
the GPUs is "Exclusive Process".

So far, I managed to run 8 CPU + 0 GPU; 1 CPU + 1 GPU; 1CPU + 2 GPU.
More than 1 CPU with a non-zero number of GPUs results in

LAMMPS (30 Oct 2013)
package gpu force/neigh 0 0 1
ERROR: Accelerator sharing is not currently supported on system
(../gpu_extra.h:47)

- From previous conversations on this list, it seemed to me that running
several CPUs with GPU support should be possible if the compute mode
of the GPUs was "Default". So I asked the cluster people if there was
a way to run using all CPUs and GPUs of a node to which they replied

... The limitation is from LAMMPS itself -- you are restricted to
one MPI process per GPU. ... The only way to utilise [more CPUs
than GPUs] would be if the LAMMPS processes could multi-thread.
...

Could someone please advise me how to proceed?

Thank you,
Sebastian.

- --
Dr. Sebastian Busch
University of Oxford
Department of Biochemistry
Laboratory of Molecular Biophysics
South Parks Road, Oxford, OX1 3QU, UK
+44 1865 61 33 11
[email protected]...
http://www2.bioch.ox.ac.uk/mclaingroup/Sebastian.html

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dear list, dear developers,

I am running LAMMPS with the GPU package on a cluster that has 8 CPUs
and 2 GPUs (Tesla S2050, CUDA 5.0) per node. I compiled it with
mvapich2/1.9__intel-2013. nvidia-smi tells me that the compute mode of
the GPUs is "Exclusive Process".

So far, I managed to run 8 CPU + 0 GPU; 1 CPU + 1 GPU; 1CPU + 2 GPU.
More than 1 CPU with a non-zero number of GPUs results in

LAMMPS (30 Oct 2013)
package gpu force/neigh 0 0 1
ERROR: Accelerator sharing is not currently supported on system
(../gpu_extra.h:47)

2 CPU and 2 GPU is also possible, but you need to properly configure
it in the "package gpu" command. it defaults to using only the first
GPU. 1CPU and 2 GPU is not happening. the second GPU is ignored.

- From previous conversations on this list, it seemed to me that running
several CPUs with GPU support should be possible if the compute mode

it is. it is done regularly on various machines, including the #2
machine in the top500 list hosted at ORNL.

of the GPUs was "Default". So I asked the cluster people if there was
a way to run using all CPUs and GPUs of a node to which they replied

... The limitation is from LAMMPS itself -- you are restricted to
one MPI process per GPU. ... The only way to utilise [more CPUs
than GPUs] would be if the LAMMPS processes could multi-thread.
...

Could someone please advise me how to proceed?

this reply you got is incorrect and is a "myth-understanding" that is
percolating through the sysadmin circles. even nvidia experts
occasionally spread this. however, this is a misunderstanding of a
message stating that using multi-threading is *more efficient* than
attaching multiple processes. other parallel classical MD packages use
oversubscribing via multiple processes as well to increase efficiency.
for a Tesla S2050 you should see a significant speedup at least until
using 2x as many MPI processes than GPUs. beyond that speedup with be
slow (with K20 you can probably go up to 4 or even 6 depending on the
input). also when using kspace you may test if running pppm on the CPU
while doing pair on the GPU is faster.

with CUDA 5.5 there is a (multi-threaded) "GPU-daemon" that should
allow using the GPUs even in dedicated mode, as this somehow combines
the benefits of using MPI for parallelization on the CPU with
multi-threading inside a single GPU context for more efficient
concurrent GPU utilization.

in short. you'll have to tell your cluster people that they have to
get their facts straight.

axel.

Hello,

Dear list, dear developers,

I am running LAMMPS with the GPU package on a cluster that has 8
CPUs and 2 GPUs (Tesla S2050, CUDA 5.0) per node. I compiled it
with mvapich2/1.9__intel-2013. nvidia-smi tells me that the
compute mode of the GPUs is "Exclusive Process".

So far, I managed to run 8 CPU + 0 GPU; 1 CPU + 1 GPU; 1CPU + 2
GPU. More than 1 CPU with a non-zero number of GPUs results in

LAMMPS (30 Oct 2013) package gpu force/neigh 0 0 1 ERROR:
Accelerator sharing is not currently supported on system
(../gpu_extra.h:47)

2 CPU and 2 GPU is also possible, but you need to properly
configure it in the "package gpu" command. it defaults to using
only the first GPU. 1CPU and 2 GPU is not happening. the second GPU
is ignored.

Thanks. For the record: I didn't get 2CPU + 2GPU to work initially
because I did set
package gpu force/neigh 0 1 -1
in the LAMMPS input file but I had also used the "-suffix gpu" command
line argument which defaults to
package gpu force/neigh 0 0 1
and crashes before it gets to process the input file.

- From previous conversations on this list, it seemed to me that
running several CPUs with GPU support should be possible if the
compute mode

it is. it is done regularly on various machines, including the #2
machine in the top500 list hosted at ORNL.

of the GPUs was "Default". So I asked the cluster people if there
was a way to run using all CPUs and GPUs of a node to which they
replied

... The limitation is from LAMMPS itself -- you are restricted
to one MPI process per GPU. ... The only way to utilise [more
CPUs than GPUs] would be if the LAMMPS processes could
multi-thread. ...

Could someone please advise me how to proceed?

this reply you got is incorrect and is a "myth-understanding" that
is percolating through the sysadmin circles. even nvidia experts
occasionally spread this. however, this is a misunderstanding of a
message stating that using multi-threading is *more efficient*
than attaching multiple processes. other parallel classical MD
packages use oversubscribing via multiple processes as well to
increase efficiency. for a Tesla S2050 you should see a significant
speedup at least until using 2x as many MPI processes than GPUs.
beyond that speedup with be slow (with K20 you can probably go up
to 4 or even 6 depending on the input). also when using kspace you
may test if running pppm on the CPU while doing pair on the GPU is
faster.

I will test that.

with CUDA 5.5 there is a (multi-threaded) "GPU-daemon" that should
allow using the GPUs even in dedicated mode, as this somehow
combines the benefits of using MPI for parallelization on the CPU
with multi-threading inside a single GPU context for more
efficient concurrent GPU utilization.

in short. you'll have to tell your cluster people that they have
to get their facts straight.

axel.

Thank you, Sebastian.

I reported this back to the cluster people and it seems they had
chosen "Exclusive Process" because it works best for the majority of
programs. At the same time, they'll set some node back to "Default"
and we will experiment with LAMMPS on that one. Also, they are
planning to upgrade to CUDA 5.5 so maybe we'll test that as well at a
later stage.

Thanks again for your help and best regards,
Sebastian.

------------------------------------------------------------------------------

November Webinars for C, C++, Fortran Developers