USER-CUDA: "package cuda gpu/node 3" doesn't use 3 gpus if run standalone

Hi,

I have observe a strange behavior. Probably a stupid thing I just
don't realize. I am using the standard input script from
examples/USER/cuda/
in.melt_2.5.cuda. One of the first instructions configuring the number
of gpus I want to use. Since I have three of them, I instruct LAMMPS
to use 3 gpus:

package cuda gpu/node 3

Now, when I run LAMMPS with this command:

~/opt/lammps/lmp_kid_my_mv18c32f321_nocufft -sf cuda <
../lammps-input/in.melt_2.5.3.cuda

the LAMMPS uses only 1 gpu. This is confirmed by the LAMMPS output

$ ~/opt/lammps/lmp_kid_my_mv18c32f321_nocufft -sf cuda <
../lammps-input/in.melt_2.5.3.cuda
LAMMPS (1 Jul 2012)
# Using LAMMPS_CUDA
USER-CUDA mode is enabled (lammps.cpp:396)
# CUDA: Activate GPU
# Using device 0: Tesla M2090
Lattice spacing in x,y,z = 1.16961 1.16961 1.16961
Created orthogonal box = (0 0 0) to (46.7843 46.7843 46.7
....

However, if I run LAMMPS with mpiexec on the same input file LAMMPS
uses 3 gpus as anticipated.

$ mpiexec -np 3 -npernode 3
~/opt/lammps/lmp_kid_my_mv18c32f321_nocufft -sf cuda <
../lammps-input/in.melt_2.5.3.cuda
LAMMPS (1 Jul 2012)
# Using LAMMPS_CUDA
USER-CUDA mode is enabled (lammps.cpp:396)
# CUDA: Activate GPU
# Using device 0: Tesla M2090
Lattice spacing in x,y,z = 1.16961 1.16961 1.16961
# Using device 1: Tesla M2090
Created orthogonal box = (0 0 0) to (46.7843 46.7843 46.7843)
# Using device 2: Tesla M2090
  1 by 1 by 3 MPI processor grid

Do I have to specify other options for not-mpiexec-controlled
execution? I thought that it is sufficient to specify a request for 3
gpus in the input script? Do I have to articulate this requirement
anywhere else for a standalone execution?

Best,
Magda

Sorry, please just forget about the mpiexec case. The answer is
obvious: it is because I want to run 3 lammps and each LAMMPS uses 1
gpu. The question is why not 3?

Magda

Hi Magda

to use more than one GPU form an MPI process, one has to handle different synchronisation streams, has to switch GPU context extremely often and so on. Its a big load of hassle, and will probably incur overhead, which makes it run slower than just using one MPI process per GPU. Hence you can only use one GPU per MPI process. The gpu/node keyword just lets the USER-CUDA package know how many GPUs are available per node, not how many it should use.

Christian

-------- Original-Nachricht --------