MPI_Send: Internal MPI error with USER-CUDA

Lev_Shamardin · June 27, 2012, 5:18pm

Hi all,

I'm trying to run LAMMPS with USER-CUDA package on some new GPU cluster, and when I run it on a single node everything works fine. But when I submit a job spanning multiple nodes I start getting errors like these:

time mpirun -lsf -prot /home/ucl/eisuc011/public/lmp_emerald -cuda on -in in.cuda.rhodo
[... skipped ...]
Host 0 -- ip xx.xx.xx.xx -- ranks 0 - 2
Host 1 -- ip xx.xx.xx.xx -- ranks 3 - 5
[... skipped ...]

Prot - All Intra-node communication is: SHM
Prot - All Inter-node communication is: IBV

LAMMPS (29 Jun 2012)
# Using LAMMPS_CUDA
USER-CUDA mode is enabled (lammps.cpp:396)
# CUDA: Activate GPU
Scanning data file ...
   4 = max bonds/atom
   18 = max angles/atom
# Using device 0: Tesla M2090
   40 = max dihedrals/atom
   4 = max impropers/atom
Reading data file ...
   orthogonal box = (-27.5 -38.5 -36.2676) to (27.5 38.5 36.2645)
   1 by 3 by 2 MPI processor grid
# Using device 2: Tesla M2090
# Using device 1: Tesla M2090
[... skipped ...]
# CUDA: VerletCuda::setup: Allocate memory on device for maximum of 46933 atoms...
# CUDA: Using precision: Global: 4 X: 4 V: 4 F: 4 PPPM: 4
Setting up run ...
lmp_emerald: Rank 0:5: MPI_Wait: ibv_reg_mr() failed: addr 0x7fbaf3a77000, len 780864
lmp_emerald: Rank 0:5: MPI_Wait: Internal MPI error
lmp_emerald: Rank 0:4: MPI_Send: ibv_reg_mr() failed: addr 0x7f13cf8af000, len 780864
lmp_emerald: Rank 0:4: MPI_Send: Internal MPI error
MPI Application rank 4 exited before MPI_Finalize() with status 16

Any ideas what can be wrong?

Cheers,

Lev.

_Christian_Muller · June 27, 2012, 9:21pm

Hi all

someone else run into the same problem recently and one thing seemed to help: if you add "pinned=0" to the "package cuda" command in your script.

Could you try if that helps for you as well? If it does I have the terrible feeling that the way GPU-Direct worked is screwed up now.

Also please provide a full stack of information regarding your hardware and software. In particular: exact MPI version, Infiniband adapter vendor and driver version, linux kernel, GPU driver, cuda version.

Bit of a background: both GPUs and IB cards can do DMA access on memory. For that the memory has to be pinned. Unfortunately only one device can get pinned access to a certain memory. Thats where GPU-Direct (the first instance not GPU-Direct2 which is all about intra node communication) comes into play. Nvidia, mellanox, qlogic and the MPI vendors (mainly mvapich and then openmpi) came up with a way to allow both GPU and IB adapter to access the same pinned memory. This allows for sending data from one GPU on a node to another GPU on another node, without that data being buffered by the CPU. Unfortunately for that a Kernel Patch is (was?) required.

If GPU-Direct was not available it used to work anyway. If I remember correct, if the GPU already pinned the communication buffer, the IB adapter used to employ a fallback strategy using regular memory access instead of the DMA engine.
Since that was slower then letting the IB adapter pin the memory, and using standard non DMA transfers for the Host-GPU side, I added an option to avoid GPU pinning of memory: "pinned=0".

If this is now really the solution I think there is either something wrong with some version of IB drivers/GPU-Direct implementation/MPI version, or they changed the way how one has to implement this.

Otherwise I think internal MPI errors are a bit hard for me to debug, but maybe we could cooperate with some of the MPI developers on that.

Regards
Christian

-------- Original-Nachricht --------

_Christian_Muller · June 27, 2012, 11:51pm

Hi again

I was finally able to reproduce the issue on one of the machiens I have access to. It was just pushed up to CUDA-4.2 and the latest drivers.
I only have this issue using OpenMPI right now. With mvapich2-1.8 it works.

I am going to investigate further.

Christian

-------- Original-Nachricht --------

Lev_Shamardin · June 28, 2012, 2:26pm

Hi,

someone else run into the same problem recently and one thing seemed to help:
if you add "pinned=0" to the "package cuda" command in your script.

Could you try if that helps for you as well? If it does I have the terrible
feeling that the way GPU-Direct worked is screwed up now.

This does help, thanks.

Also please provide a full stack of information regarding your hardware and
software. In particular: exact MPI version, Infiniband adapter vendor and
driver version, linux kernel, GPU driver, cuda version.

MPI Version: Platform MPI 08.01.01.00 [9535] Linux x86-64,
Hardware: Mellanox Technologies MT26438 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR /
10GigE Virtualization+] (rev b0)
Linux kernel: 2.6.32-279.el6.x86_64 (vanilla redhat kernel I believe)
Nvidia driver version 295.33 with NVIDIA Corporation Tesla M2090 (rev a1) boards
CUDA version 4.2.9

If this is now really the solution I think there is either something wrong
with some version of IB drivers/GPU-Direct implementation/MPI version, or
they changed the way how one has to implement this.

Can you please point me in the direction of the required patch? I'm not the
person running this cluster, but probably I'll ask the admins to apply the
fixes, since everybody seems to be interested in that.

Thanks a lot!

Lev.

_Christian_Muller · June 28, 2012, 4:28pm

Hi

An issue with LAMMPS stalling or crashing at the begin of the execution turned up on a number of clusters now, when using the USER-CUDA package. Two people on the LAMMPS list reported it and I got it now on a system at Sandia and one other system I have access to.
The common thing is that all had the 29x.xx NVIDIA drivers running, were using Infiniband clusters, and OpenMPI. The issue goes away if one is not using pinned memory for the communication buffers. This can be done by adding the option "pinned 0" to the package cuda command e.g.:

"package cuda gpu/node 2 pinned 0"

Also the problem does not seem to occur when using Mvapich2-1.8.

I am going to look into it but I probably need help from the OpenMPI guys and from NVIDIA to figure out what is going on.
While it might be some latent bug in my code, it could also be some issue within OpenMPI when one is trying to use GPU-Direct features (as I do for the USER-CUDA package). The GPU package is not affected since communication buffers are constructed on the host side.

If someone else has the same problem, please let me know. Also please let me know if the two solutions mentioned here "pinned 0" and mvapich2 work.

best regards
Christian

-------- Original-Nachricht --------

Luca_Ferraro · July 1, 2012, 10:50am

Hi Chris,

I found the following useful post from Massimiliano Fatica:
http://cudamusing.blogspot.it/2011/08/cuda-mpi-and-infiniband.html

... unfortunately, setting the CUDA_NIC_INTEROP=1 environment variable
DO NOT solved the problem on our cluster. The problem is present with OpenMPI 1.4.x 1.5.x and 1.6, both with cuda 3.2 and 4.x. and the infiniband
Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE]

Luca_Ferraro2 · July 23, 2012, 3:01pm

Any good news about it?

Thanks in advance,

luca

_Christian_Muller · July 23, 2012, 3:09pm

Hi

according to a OpenMPI developer this is expected behaviour, since OpenMPI is not supposed to work on memory pinned to some other device. Not sure how reliable that information is. I'd recommend sticking to mvapich for now.

Best regards
Christian

-------- Original-Nachricht --------

Lev_Shamardin · August 8, 2012, 1:30pm

Hi all,

I've found a workaround for Platform MPI. It appears that if you set these environment variables:

export PMPI_GPU_AWARE=1
export MPI_RDMA_REPIN=1

Everything starts working. The first one is probably not strictly required, for the second one the Platform MPI release notes say this:

"Controls the MPI buffer optimization for RDMA messaging. If set, Platform MPI checks if the buffer can be pinned for RDMA messaging. If the buffer cannot be pinned (for example, for GPU usage), PMPI will use an RDMA protocol that will not attempt to "repin" the MPI buffer."

However on my trivial test problems with these variables set and pinned=1 LAMMPS appears to be slower than without those variables and pinned=0.

Hope this helps.

With best regards,

Lev.