USER-CUDA segfaulting when used with MPI

I've compiled the Jan 15 2012 version of lammps and user-cuda up using gcc
4.3.4, openmpi 1.4.2, cuda 4.0, and the kiss fftw.

When trying the examples I get the following sort of behaviour
(in.melt_2.4.cuda_ is just the example file with the gpu layout modified)

[[email protected]... lammps]$ mpirun ./lmp < in.melt_2.5.cuda_
LAMMPS (15 Jan 2012)
# Using LAMMPS_CUDA
USER-CUDA mode is enabled (lammps.cpp:396)
# CUDA: Activate GPU
# Using device 0: Tesla T10 Processor
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (67.1838 67.1838 67.1838)
# Using device 0: Tesla T10 Processor
  1 by 1 by 2 MPI processor grid
Created 256000 atoms
# CUDA: VerletCuda::setup: Allocate memory on device for maximum of 130000
atoms...
# CUDA: Using precision: Global: 4 X: 4 V: 4 F: 4 PPPM: 4
Setting up run ...
# CUDA: VerletCuda::setup: Upload data...
# CUDA: Total Device Memory useage post setup: 99.239990 MB
Memory usage per processor = 46.2874 Mbytes
Step Temp E_pair E_mol TotEng Press
       0 1.44 -6.7733681 0 -4.6133765 -5.019674
WARNING: # CUDA: You asked for a Verlet integration using Cuda, but selected a
pair force which has not yet been ported to Cuda (verlet_cuda.cpp:542)
WARNING: # CUDA: You asked for a Verlet integration using Cuda, but several
fixes have not yet been ported to Cuda.
This can cause a severe speed penalty due to frequent data synchronization
between host and GPU. (verlet_cuda.cpp:548)
     100 0.75865643 -5.7603269 0 -4.6223467 0.19585431
     200 0.75642951 -5.7572851 0 -4.6226453 0.2264133
     300 0.74927281 -5.7464029 0 -4.6224981 0.29736503
     400 0.7405083 -5.7329551 0 -4.6221969 0.37753202
     500 0.73086599 -5.7181528 0 -4.6218581 0.46923941
     600 0.72413582 -5.7078289 0 -4.6216294 0.52830792
     700 0.71599547 -5.6952684 0 -4.6212794 0.5985125
     800 0.71309998 -5.6906572 0 -4.6210114 0.63173726
     900 0.7068169 -5.6809429 0 -4.6207217 0.67903671
[ang13:15560] *** Process received signal ***
[ang13:15560] Signal: Segmentation fault (11)
[ang13:15560] Signal code: Address not mapped (1)
[ang13:15560] Failing at address: 0x2aabafece4d8
[ang13:15560] [ 0] /lib64/libpthread.so.0 [0x38ae20eb10]
[ang13:15560] [ 1]
./lmp(_ZN9LAMMPS_NS8Neighbor18half_bin_no_newtonEPNS_9NeighListE+0x2e9)
[0x649441]
[ang13:15560] [ 2] ./lmp(_ZN9LAMMPS_NS12NeighborCuda5buildEv+0x4c3) [0x64374b]
[ang13:15560] [ 3] ./lmp(_ZN9LAMMPS_NS10VerletCuda3runEi+0xb99) [0x6d6c41]
[ang13:15560] [ 4] ./lmp(_ZN9LAMMPS_NS3Run7commandEiPPc+0x786) [0x6b5cf2]
[ang13:15560] [ 5] ./lmp(_ZN9LAMMPS_NS5Input15execute_commandEv+0xdf4)
[0x616f24]
[ang13:15560] [ 6] ./lmp(_ZN9LAMMPS_NS5Input4fileEv+0x2ba) [0x61767c]
[ang13:15560] [ 7] ./lmp(main+0x5f) [0x6228d6]
[ang13:15560] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) [0x38ad61d994]
[ang13:15560] [ 9] ./lmp(_ZNSt8ios_base4InitD1Ev+0x41) [0x47ade9]
[ang13:15560] *** End of error message ***

tyson,

two comments:

1) the input you are using requires the use
of the additional command line flag -sf cuda
to actually select the /cuda styles where
available. without you'll be still using the
CPU almost everywhere.

2) the USER-CUDA package does not benefit
from oversubscribing the GPU, so only use
as many MPI tasks as you have GPUs and
that are configured via the "package gpu" command
in the lammps input. christian and i are working
on improving the USER-OMP package so that
you can use OpenMP threading on styles that
have not (yet) been ported to USER-CUDA or
that would run more efficiently on the CPU
and make good use of otherwise idle cpu cores
this way.

cheers,
    axel.

Hi

yeah as Axel said you are not actually using the CUDA styles (as the warning at the begin of your run states). But nevertheless this looks like some type of incompatibility between the verlet/cuda and using CPU pair styles - not that anyone would want to use that combination.

Cheers
Christian

-------- Original-Nachricht --------

1) the input you are using requires the use
of the additional command line flag -sf cuda
to actually select the /cuda styles where
available. without you'll be still using the
CPU almost everywhere.

Thanks Alex. That did the trick. I'm just getting started trying this
out on our GPU cluster and for some reason I had it in my mind that the
GPU package specification caused it to also set cuda styles.

2) the USER-CUDA package does not benefit
from oversubscribing the GPU, so only use
as many MPI tasks as you have GPUs and
that are configured via the "package gpu" command
in the lammps input. christian and i are working
on improving the USER-OMP package so that
you can use OpenMP threading on styles that
have not (yet) been ported to USER-CUDA or
that would run more efficiently on the CPU
and make good use of otherwise idle cpu cores
this way.

I think that should be okay. The job actually spanned two nodes on our
22 node GPU cluster (it has two gpus per node).

https://www.sharcnet.ca/help/index.php/Angel

The scheduler (moab/torque) does per gpu scheduling, and the job start
up scripts change ownership of the /dev/nvidia[01] files to give your
job access to the gpus you were allocated.

The gpus are also set in EXCLUSIVE_THREAD mode to ensure that if a
single user is allocated both gpus on a node, they use both.

As far as I can tell, the appropriate lammps settings for this system
seems to be just

package cuda gpu/node 1

as it automatically skips over GPUs it can't access (either because it
doesn't have access to the /dev/nvidia[01] file or it is already in use
due to the EXCLUSIVE_THREAD is setting.

Thanks again! -Tyson

As far as I can tell, the appropriate lammps settings for this system
seems to be just

package cuda gpu/node 1

as it automatically skips over GPUs it can't access (either because it
doesn't have access to the /dev/nvidia[01] file or it is already in use
due to the EXCLUSIVE_THREAD is setting.

Actually it doesnt matter what your gpu/node setting is in this case. If GPUs are in exclusive mode the setting is neither needed nor used (at least not in the USER-CUDA package). And if you allocate too many MPI proceses per node it will just exit with an error since it cant get enough GPUs.

Cheers
Christian

I went through and tried all combinations of specifying the cuda style
on the atom, pair, and fix commands in the in.melt_2.5.cuda example
file. It turns out the key is

atom_style atomic/cuda

You get a crash if and only if this style is not set to cuda.

I realize you want everything on the GPU, but presumably not using the
atomic/cuda style exposes a bug as it would seem at worst you should get
an unsupported message and not a segmentation fault.

As well, instead of a segfault, sometimes I get the message

ERROR: Lost atoms: original 256000 current 255994 (thermo.cpp:385)

Thanks! -Tyson

PS: The reason I got digging into this is because we have a fix that we
want to port to the GPU. As a first step, I started with just compiling
in the CUDA package and tried running it.

I presumed everything should still work (albeit not very efficiently) as
the data would be moved back to CPU for fixes that don't use the GPU.
Instead it kept crashing.

I then proceeded to try the examples (first with our fix compiled in and
then with a completely clean tree), and they crashed too.

Hi

yes you are right that should get fixed, in the sense that there should be a gracefull exit or something. I am going to look into how to solve that in the most efficient way.

Cheers
Christian

-------- Original-Nachricht --------