K20 - understanding the cuda-proxy

Dear all,

I compiled lammps with the GPU package on an intel-based K20 system
with 16 cores and 4 K20 cards. I used the options mentioned below.

The K20 cards are in state 0/Default.

I can start an nvidia-cuda-proxy-control daemon and then speed up a
single thread of lammps about 10 times(!) by using one gpu card.
I can also run 16 mpi threads and see that nvidia-smi shows that there
are 16 gpu processes, each gpu card is handling 4 processes, one per
mpi thread.

What I don't seem to manage is to use the hyper-q via the cuda proxy
using more than one gpu card. the cuda-proxy always loads on gpu 0.
Also when I start two or more threads while the cuda proxy daemon is
running, only gpu 0 will show up to be using processing power and
memory in nvidia-smi.

Is this the expected behaviour? It would be great if I could get to
use all 13 hyper-q threads per card on all 4 cards simultaneously
while running e.g. a 16-thread mpi process on the cpu.
Otherwise for mainboards with less cores than available hyper-q
threads, for all but single-core calculations there would never be
optimal speed-up.
Did anyone manage to get the cuda-proxy to fully spread multiple cpu
threads over multiple gpu cards?

Greetings, Pim

here are my gpu compile options:

CUDA_HOME = /usr/local/cuda
NVCC = nvcc

CUDA_ARCH = -arch=sm_35

CUDA_PRECISION = -D_SINGLE_SINGLE
CUDA_INCLUDE = -I$(CUDA_HOME)/include
CUDA_LIB = -L$(CUDA_HOME)/lib64
CUDA_OPTS = -DUNIX -O3 -Xptxas -v --use_fast_math

CUDR_CPP = mpic++ -DCUDA_PROXY -DMPI_GERYON -DUCL_NO_EXIT
-DMPICH_IGNORE_CXX_SEEK
CUDR_OPTS = -O2 -ftree-vectorize
# -march=bdver1

BIN_DIR = ./
OBJ_DIR = ./
LIB_DIR = ./
AR = ar
BSH = /bin/sh

CUDPP_OPT = -DUSE_CUDPP -Icudpp_mini

include Nvidia.makefile

Hi Pim,

The proxy documentation that I have (Oct 2012) states:

"Currently, proxy is only supported on single-GPU machines. This is a
limitation that
may be removed in a future release."

I think that you will need to use default mode (without hyper-q) in order
to use all of the cards with 16 MPI currently. Depending on your
simulation, the performance impact can be small.

- Mike