package user/cuda hangs on more than one node

Hi,

I'm interested in doing a performance study of LAMMPS and in particular the user/cuda package. I did not encounter any problems building LAMMPS and running the examples in USER/cuda works fine it I request only a single node allocation. However if I attempt to run with MPI ranks spread across two or more nodes the simulation hangs at "Setting up run ...".

The specifics are:
Machine: Keeneland with 3 Tesla M2090 per node.
Allocation: 2 nodes with 3 processes per node.

%> mpirun -np 6 /lustre/medusa/biersdor/mylammps/src/lmp_keeneland -cuda on -sf cuda < in.melt_2.5.cuda
LAMMPS (5 Mar 2012)
# Using LAMMPS_CUDA
USER-CUDA mode is enabled (lammps.cpp:396)
# CUDA: Activate GPU
# Using device 1: Tesla M2090
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (67.1838 67.1838 67.1838)
# Using device 0: Tesla M2090
# Using device 1: Tesla M2090
# Using device 2: Tesla M2090
   1 by 2 by 3 MPI processor grid
Created 256000 atoms
# CUDA: VerletCuda::setup: Allocate memory on device for maximum of 50000 atoms...
# CUDA: Using precision: Global: 4 X: 4 V: 4 F: 4 PPPM: 4
Setting up run ...
<<<process hangs>>>

Attaching gdb to the running jobs suggests the program might be stuck in MPI_Send, though I'm not sure. Has anyone encounter this problem before or knows of a work-around?

Thanks,

- Scott Biersdorff

scott,

first of all, try without oversubscribing the GPUs,
i.e. use one MPI task per GPU not two.

axel.

The GPU package will allow two (or more) GPU
tasks per GPU, but USER-CUDA will not.

Christian may want to comment on if the
hang is a bug, or there should be some error
message.

Steve

Hi Scott

actually I encountered a similar problem on keeneland, but was not able to reproduce it on any other machine. Unfortunately I had not that much time to look into it recently since I am about to move to the USA to start working at Sandia (flights going tomorrow). I would appreciate if you could send me an as small as possible reproduction case.

Also one thing to try out would be different MPI versions. I had the impression that using mvapich instead of openmpi did work more reliable. In any case the issue is high on my priority list, and I hope to have time for solving it in two weeks or so.

Best regards
Christian

-------- Original-Nachricht --------

Hi Christian,

This is fairly easy to reproduce. Here is what I did to run LAMMPS on Keeneland

1. build LAMMPS:

cd src
make yes-user-cuda
cd ../lib/cuda
(modify Makefile.common and Makefile.lammps setting the path to CUDA to be '/sw/keeneland/cuda/4.1/linux_binary').
make lib
cd ../../src
make keeneland

2. Run

qlogin -V -l nodes=2:ppn=3 (request two nodes)
cd examples/USER/cuda/
mpirun -np 3 /lustre/medusa/biersdor/mylammps/src/lmp_keeneland_noinst -sf cuda -cuda on < in.melt_2.5.cuda
(works fine)

mpirun -np 6 /lustre/medusa/biersdor/mylammps/src/lmp_keeneland_noinst -sf cuda -cuda on < in.melt_2.5.cuda
(hangs at "Setting up run ...)

I not trying to oversubscribe the GPU, the 6 MPI processes should be divided evenly between both nodes. Currently I'm using openmpi 1.5.1, I can try mpich2 to see if I encounter the same problem.

Thanks,

- Scott

Hi Christian,

This is fairly easy to reproduce. Here is what I did to run LAMMPS on
Keeneland

1. build LAMMPS:

cd src
make yes-user-cuda
cd ../lib/cuda
(modify Makefile.common and Makefile.lammps setting the path to CUDA to be
'/sw/keeneland/cuda/4.1/linux_binary').
make lib
cd ../../src
make keeneland

2. Run

qlogin -V -l nodes=2:ppn=3 (request two nodes)
cd examples/USER/cuda/
mpirun -np 3 /lustre/medusa/biersdor/mylammps/src/lmp_keeneland_noinst -sf
cuda -cuda on < in.melt_2.5.cuda
(works fine)

mpirun -np 6 /lustre/medusa/biersdor/mylammps/src/lmp_keeneland_noinst -sf
cuda -cuda on < in.melt_2.5.cuda
(hangs at "Setting up run ...)

I not trying to oversubscribe the GPU, the 6 MPI processes should be divided

your quoted output indicated it, though.
i would expect that the scheduler will
allocate entire nodes for you in any case.

evenly between both nodes. Currently I'm using openmpi 1.5.1, I can try

if openmpi is properly installed, you need
not use -np but you have to use -npernode 3

mpich2 to see if I encounter the same problem.

"MPICH is evil" :wink:

axel.

Maybe so, but using it on Keeneland does resolve this issue and allow you to run on more than one node. Christian, if you want to do this yourself you can follow these steps to setup mpich2:

qlogin -V -l nodes=2:ppn=3 (same as with OpenMPI)
mpdboot --totalnum=2 --ncpus=3 -f mpich_machines.txt

where mpich_machines.txt is a list of your hostnames with ':3' appended, something like:

kid008.nics.utk.edu:3
kid009.nics.utk.edu:3

mpirun -np 6 /lustre/medusa/biersdor/mylammps/src/lmp_keeneland -sf cuda -cuda on < in.melt_2.5.cuda

- Scott

"MPICH is evil" :wink:

axel.

Maybe so, but using it on Keeneland does resolve this issue and allow you to

for the sake of completeness, would you mind checking out, if the
following OpenMPI
command line would work as well?

mpirun --mca btl_openib_flags 1 --mca mpi_leave_pinned 0 --mca
btl_openib_warn_default_gid_prefix 0 \
       -np 6 -npernode 3 --hostfile ${PBS_NODEFILE}

thanks,
    axel.

Hi,

Yes, with those options openmpi will work as well. In particular '--mca mpi_leave_pinned 0' appears to be the key flag that allows multi-node runs.

Thanks,

- Scott

Hi

oh wow that should be helpfull in finding the issue. Maybe there is some workaround I can use to make it run out of the box. If anyone has a suggestion what might go wrong and why that option solves the issue please feel encouraged to share any bit of wisdom now ...

Cheers
Christian

-------- Original-Nachricht --------