I compiled lammps with the GPU package on an intel-based K20 system
with 16 cores and 4 K20 cards. I used the options mentioned below.
The K20 cards are in state 0/Default.
I can start an nvidia-cuda-proxy-control daemon and then speed up a
single thread of lammps about 10 times(!) by using one gpu card.
I can also run 16 mpi threads and see that nvidia-smi shows that there
are 16 gpu processes, each gpu card is handling 4 processes, one per
What I don't seem to manage is to use the hyper-q via the cuda proxy
using more than one gpu card. the cuda-proxy always loads on gpu 0.
Also when I start two or more threads while the cuda proxy daemon is
running, only gpu 0 will show up to be using processing power and
memory in nvidia-smi.
Is this the expected behaviour? It would be great if I could get to
use all 13 hyper-q threads per card on all 4 cards simultaneously
while running e.g. a 16-thread mpi process on the cpu.
Otherwise for mainboards with less cores than available hyper-q
threads, for all but single-core calculations there would never be
Did anyone manage to get the cuda-proxy to fully spread multiple cpu
threads over multiple gpu cards?
here are my gpu compile options:
CUDA_HOME = /usr/local/cuda
NVCC = nvcc
CUDA_ARCH = -arch=sm_35
CUDA_PRECISION = -D_SINGLE_SINGLE
CUDA_INCLUDE = -I$(CUDA_HOME)/include
CUDA_LIB = -L$(CUDA_HOME)/lib64
CUDA_OPTS = -DUNIX -O3 -Xptxas -v --use_fast_math
CUDR_CPP = mpic++ -DCUDA_PROXY -DMPI_GERYON -DUCL_NO_EXIT
CUDR_OPTS = -O2 -ftree-vectorize
BIN_DIR = ./
OBJ_DIR = ./
LIB_DIR = ./
AR = ar
BSH = /bin/sh
CUDPP_OPT = -DUSE_CUDPP -Icudpp_mini