Lammps packages Gpu, max CPU threads per GPU

Dear LAMMPS ,
I have Lampps installed with packages GPU. My system have 2 GPU (M2090) and 12 CPU cores per node. As the manual says I should create as much as possible CPU threads per GPU for maximum performance. But I observed Lampps cannot run if number of CPU threads per GPU is more than 4.
Is it really the case or I am missing some thing here.
Thanks

Dear LAMMPS ,
I have Lampps installed with packages GPU. My system have 2 GPU (M2090) and
12 CPU cores per node. As the manual says I should create as much as
possible CPU threads per GPU for maximum performance. But I observed Lampps

that is not exactly what the manual says. you have to experiment.
efficient use of GPUs cannot be found by such a simple measure.

the more CPU tasks per GPU also means more overhead and
once a GPU is saturated, you cannot gain additional performance
unless through factors that have nothing to do with the GPUs.
also, the efficiency of (current) gpus depends quite a bit on the
number of atoms per process. if you go below a certain number,
the efficiency will drop until the GPU code will be slower than
running on the CPU.

cannot run if number of CPU threads per GPU is more than 4.

please provide more details. there is no principal problem.
i have been running up to 16 CPU threads on a single GPU
(not very efficient, but technically possible).

Is it really the case or I am missing some thing here.

using GPUs well does require some technical knowledge
and common sense....

axel.

Thanks Axel,
I am playing around for various combination of CPU and GPU cores to get optimized performance. Although for my current system size I don’t get significant performance enhancement by increasing the number of CPU cores per GPU 2 cores.
We have just got this machine and using first time gpu version of cuda that why I want to make sure every thing is correct. Here are the relevant details-
System size - ~15000
pair_style lj/charmm/coul/long/gpu , pppm/gpu

Intel® Xeon® CPU X5650 (12 CPU cores), two M2090 tesla cards
CUDA version 4.0
compiled with openmpi-1.4.3 (intel compilers version 12)
using KISS FFT.
Thanks for your help.
As far as common sense is concerned, people have plenty of that. you just doesn’t seem to get that. Since you are doing LAMMPS a lot so questions of other people may seem to you ‘common senseless’.

Currently my system has ~15000 atoms and I am using . Each node have " " and two M2090 card installed.
cuda version is 4.0

Thanks Axel,
I am playing around for various combination of CPU and GPU cores to
get optimized performance. Although for my current system size I don't get
significant performance enhancement by increasing the number of CPU cores
per GPU 2 cores.
We have just got this machine and using first time gpu version of cuda that
why I want to make sure every thing is correct. Here are
the relevant details-
System size - ~15000

this is little to get good GPU acceleration
across multiple GPUs with many CPU cores.

pair_style lj/charmm/coul/long/gpu , pppm/gpu

try running with pppm on the CPU. you have rather
little work to distribute and the pair styles are more
easily accelerated than PPPM. if you run PPPM on
the CPU you can run it concurrently with the pair
style and thus it my be faster overall. you can also
play around with the coulomb cutoff to adjust the
balance between pair and kspace.

axel

To add to Alex's comment. With only 15K atoms
and 12 CPU cores (6 per GPU I think you said),
that means each CPU is only giving 1.5K atoms to
the GPU to work on it's turn, which may be too
small a chunk of work to get good GPU performance.

So you could try a larger problem to test scalability,
speed-up, etc.

Steve

As has been stated, the optimal number of CPU cores per GPU varies with the simulation, number of particles, GPU, and nvidia driver. Future drivers and kepler gpus (hyper-q) might allow more processes per GPU. For most cases, with fewer particles, you need fewer MPI processes per GPU.

I will add, however, that when using long range with pppm, using the verlet/split style with GPU acceleration can significantly improve performance. For CPU-only runs, i typically get best performance with one p3m process per numa node. For accelerated runs, i can get better performance with more than one p3m process per numa node. Unfortunately, I find this option a little difficult to use in lammps right now.

There is some limited evaluation of this presented for Cray XK6 nodes in this paper:

http://www.sciencedirect.com/science/article/pii/S187705091200141X

- Mike