How to evoke all GPUs to run a single LAMMPS job, with gpu-package

Dear Sir,

My name is Yang Yang, I'm now trying to run lammps (compiled with
GPU-package) with TeslaGPU+CUDA. We have 4 nodes with 2 GPU each.

I've tried to use "fix 0 all gpu force/neigh 0 1 -1" to evoke two GPU,
and './lmp_g++3<in.melt_gpu_2.5' to run the job.
The output on the screen seems well, but evoking 2 GPU have the same
speed as evoking 1 GPU. Which is not correct and we need your help.

Dear Sir,

My name is Yang Yang, I'm now trying to run lammps (compiled with
GPU-package) with TeslaGPU+CUDA. We have 4 nodes with 2 GPU each.

I've tried to use "fix 0 all gpu force/neigh 0 1 -1" to evoke two GPU,
and './lmp_g++3<in.melt_gpu_2.5' to run the job.
The output on the screen seems well, but evoking 2 GPU have the same
speed as evoking 1 GPU. Which is not correct and we need your help.

how do you conclude that using more GPUs has to be faster?
how much acceleration you get from GPUs depends a _lot_
on the problem that you are simulating and a Lennard-Jones
potential with no bonds and a short 2.5 sigma cutoff is about
the worst case scenario that you can do GPU accelerated MD on.

in order to get decent acceleration with GPUs, you need a
sufficient workload for each GPU that has enough algorithmic
intensity and sufficient concurrency, so that the compute
units of the GPUs are well utilized. if that is not the case,
GPUs can quickly turn into _DE_celerators.
BTW: the same is true for CPUs, but it is not that obvious,
since you need _much_ less concurrency to utilize a CPU.
CPUs are designed to handle complexity well, GPUs are
optimized for concurrency. as for performance, have a look
at the benchmark numbers that mike brown posted here.
that should give you a handle of what is possible.

http://users.nccs.gov/~wb8/gpu/keeneland.htm

--------------------------------------------------------------------------
- Using GPGPU acceleration for lj/cut:
- with 1 proc(s) per device.
--------------------------------------------------------------------------
GPU 0: Tesla M2090, 512 cores, 5.2/5.2 GB, 1.3 GHZ (Double Precision)
GPU 1: Tesla M2090, 512 cores, 5.2/5.2 GB, 1.3 GHZ (Double Precision)
--------------------------------------------------------------------------
Initializing GPU and compiling on process 0...Done.
Initializing GPUs 0-1 on core 0...Done.
--------------------------------------------------------------------------

Do we want to add more arguments in the input file? such as
"pair_style lj/cut/gpu multi/gpu 2 2.5", or do we need to run job with

this is unsupported obsolete syntax.

mpirun?

yes. absolutely. the GPU package will only utilize one GPU
per MPI task. multiple MPI tasks can share one GPU, but
not the other way around. there is currently no benefit to it.

axel.

Dear Axel,

Thank you for your detailed reply, I've learned a lot from your email!

Best wishes,
Yang