problem with GPU run

Jeff_Nucciarone · April 1, 2011, 1:30pm

I have built LAMMPS version (18 Feb 2011) for gpu using openmpi, gnu compilers, and cuda 3.2. My test run using 2 nodes, 8 cores each, and 4 gpus, fails with

[tesla2:00561] Signal: Segmentation fault (11)
[tesla2:00561] Signal code: Address not mapped (1)

My GPU setup is:

fix 0 all gpu force/neigh 0 1 -1

and I am using pair style:

lj/cut/coul/long

Below is output from near where things start to go bad:

#1
fix 1 all nvt temp 600.0 600.0 100.0
velocity all create 600 58447419
run 50000
Ewald initialization ...
  G vector = 0.182103
  vectors: actual 1d max = 11443 17 21437

---------------------------------------------------------------------
      GPU Time Info (average):
---------------------------------------------------------------------
Average split: 0.9995.
Max Mem / Proc: 0.75 MB.
---------------------------------------------------------------------

--------------------------------------------------------------------------
- Using GPGPU acceleration for lj/cut/coul/long:
- with 4 procs per device.
--------------------------------------------------------------------------
GPU 0: Tesla T10 Processor, 240 cores, 3.9/4 GB, 1.4 GHZ (Mixed Precision)
GPU 1: Tesla T10 Processor, 240 cores, 3.9/4 GB, 1.4 GHZ (Mixed Precision)
--------------------------------------------------------------------------

Initializing GPU and compiling on process 0...Done.
Initializing GPUs 0-1 on core 0...Done.
Initializing GPUs 0-1 on core 1...Done.
Initializing GPUs 0-1 on core 2...Done.
Initializing GPUs 0-1 on core 3...Done.

Setting up run ...
Memory usage per processor = 13.2739 Mbytes
Step TotEng PotEng KinEng Temp Press Volume E_vdwl E_coul E_bond E_angle E_dihed
     192 13681.396 7337.6365 6343.7599 600 286.40435 1000000 4235.9953 3538.9173 519.48009 2274.5046 376.66896

...

16500 19746.546 13416.608 6329.9374 598.69264 -35.471398 1000000 4463.6599 3522.1983 2997.7977 4341.7209 1707.243
17000 19699.376 13433.31 6266.0661 592.65163 67.47609 1000000 4548.828 3522.4822 2944.6102 4314.4425 1716.4254
[tesla2:00561] *** Process received signal ***
[tesla2:00561] Signal: Segmentation fault (11)
[tesla2:00561] Signal code: Address not mapped (1)
[tesla2:00561] Failing at address: 0xfffffffe034017f0
[tesla2:00561] [ 0] /lib64/libpthread.so.0 [0x341320eb10]
[tesla2:00561] [ 1] ./lmp_openmpi-gpu(_ZN9LAMMPS_NS20PairLJCutCoulLongGPU11cpu_computeEPiiii+0x14b) [0x6d8aeb]
[tesla2:00561] [ 2] ./lmp_openmpi-gpu(_ZN9LAMMPS_NS20PairLJCutCoulLongGPU7computeEii+0x309) [0x6d97c9]
[tesla2:00561] [ 3] ./lmp_openmpi-gpu(_ZN9LAMMPS_NS6Verlet3runEi+0x195) [0x7514e5]
[tesla2:00561] [ 4] ./lmp_openmpi-gpu(_ZN9LAMMPS_NS3Run7commandEiPPc+0x284) [0x728814]
[tesla2:00561] [ 5] ./lmp_openmpi-gpu(_ZN9LAMMPS_NS5Input15execute_commandEv+0xa28) [0x63fa08]
[tesla2:00561] [ 6] ./lmp_openmpi-gpu(_ZN9LAMMPS_NS5Input4fileEv+0x3a0) [0x641300]
[tesla2:00561] [ 7] ./lmp_openmpi-gpu(main+0x4b) [0x64990b]
[tesla2:00561] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) [0x341261d994]
[tesla2:00561] [ 9] ./lmp_openmpi-gpu(__gxx_personality_v0+0x481) [0x487d29]
[tesla2:00561] *** End of error message ***

I can send the complete output file if that will help.

System info:

[[email protected]... gpu2]$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.6 (Tikanga)

[[email protected]... gpu2] mpicc \-v Using built\-in specs\. COLLECT\_GCC=/usr/global/gcc/4\.5\.1/bin/gcc COLLECT\_LTO\_WRAPPER=/gpfs/apps/x86\_64\-rhel5/gcc/4\.5\.1/bin/\.\./libexec/gcc/x86\_64\-unknown\-linux\-gnu/4\.5\.1/lto\-wrapper Target: x86\_64\-unknown\-linux\-gnu Configured with: \.\./gcc\-4\.5\.1/configure \-\-prefix=/usr/global/gcc/4\.5\.1 \-\-with\-mpc=/usr/global/mpc/0\.8\.1 \-\-with\-mpfr=/usr/global/mpfr/2\.4\.2 \-\-with\-gmp=/usr/global/gmp/5\.0\.1 \-\-with\-ppl=/usr/global/ppl/0\.10\.2 \-\-with\-cloog=/usr/global/cloog\-ppl/0\.15\.9 \-\-disable\-multilib Thread model: posix gcc version 4\.5\.1 $GCC$ \[nucci@\.\.\.2526\.\.\. gpu2\] nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2010 NVIDIA Corporation
Built on Wed_Nov__3_16:16:57_PDT_2010
Cuda compilation tools, release 3.2, V0.2.1221
[[email protected]... gpu2]$

I'd like to know from the GPU experts if anything obvious jumps out here. As I mentioned I can attach the entire output file if that'll help.

Thanks,

--Jeff

akohlmey · April 1, 2011, 1:51pm

jeff,

it looks like your simulation system is not well behaved.
there is a significant increase in potential energy (which is
positive to boot, indicating some high-energy configuration).

does the same input work with the CPU?

axel.

Jeff_Nucciarone · April 1, 2011, 2:13pm

yes, this seems to work fine with cpu. I have not compared the results generated to the point of failure between cpu and gpu to see if they are behaving similarly.

--Jeff

Axel Kohlmeyer wrote:

akohlmey · April 1, 2011, 2:19pm

yes, this seems to work fine with cpu. I have not compared the results
generated to the point of failure between cpu and gpu to see if they are
behaving similarly.

keep in mind, that your GPU compile is using single precision to
compute the individual forces. so if your input is marginal, it might
work on the CPU but fail on the GPU due to overflows in the forces.

if your input is far from equilibrium, it might be advantageous to
have an all-double binary around or run on the CPU until the system
has straightened itself.

cheers,
axel.

Jeff_Nucciarone · April 5, 2011, 3:38pm

A followup to this issue I raised a few days ago.

I recompiled for DOUBLE_DOUBLE o see if accumulated errors in the forces caused this issue. Even with DOUBLE_DOUBLE I still received the error:

[tesla2:00561] *** Process received signal ***
[tesla2:00561] Signal: Segmentation fault (11)
[tesla2:00561] Signal code: Address not mapped (1)
[tesla2:00561] Failing at address: 0xfffffffe034017f0
[tesla2:00561] [ 0] /lib64/libpthread.so.0 [0x341320eb10]
[tesla2:00561] [ 1]

Based on another conversation with Axel (where I forgot to cc: the list):

i don't see any advantage from passing work to the cpu
and very rarely as speedup from oversubscribing. so using
one mpi task per gpu and balancing set to 1.0 is a safe setting.

p.s.: please keep the list in cc: thanks.

(first, sorry for forgetting to reply-all!)

I cut back from using 16 cores across 2 nodes sharing 4 gpus to just 4 cores across 2 nodes sharing the same 4 gpus. Now this lammps run does not generate the error message. It seems the second node did not like being oversubscribed. I will run additional tests to see if adding more cores causes the problem to resurface or not.

--Jeff

Axel Kohlmeyer wrote:

_Brown_W_Michael · April 5, 2011, 4:00pm

I doubt there is a problem from oversubscribing the GPUs. I would be surprised if you do not see a performance improvement from using at least 2 cores per GPU. Can you try this, but keep the 1.0 setting instead of -1?

Can you also try updating your GPU lib with the code here:

http://users.nccs.gov/~wb8/gpu/download.htm

and see if the -1 option still fails?

Thanks.

- Mike

Jeff_Nucciarone · April 6, 2011, 3:36pm

I tried using the updated gpu code but the link failed:

pair_gayberne_gpu.o: In function `LAMMPS_NS::PairGayBerneGPU::compute(int, int)':
/gpfs/home/nucci/scratch/lammps/mylammps2/src/Obj_openmpi-gpu/pair_gayberne_gpu.cpp:114: undefined reference to `gb_gpu_compute(int, int, int, int, double**, int*, int*, int*, int**, bool, bool, bool, bool, int&, double, bool&, double**)'
/gpfs/home/nucci/scratch/lammps/mylammps2/src/Obj_openmpi-gpu/pair_gayberne_gpu.cpp:107: undefined reference to `gb_gpu_compute_n(int, int, int, int, double**, int*, double*, double*, bool, bool, bool, bool, int&, double, bool&, double**)'
pair_lj96_cut_gpu.o: In function `LAMMPS_NS::PairLJ96CutGPU::compute(int, int)':
/gpfs/home/nucci/scratch/lammps/mylammps2/src/Obj_openmpi-gpu/pair_lj96_cut_gpu.cpp:108: undefined reference to `lj96_gpu_compute(int, int, int, int, double**, int*, int*, int*, int**, bool, bool, bool, bool, int&, double, bool&)'
/gpfs/home/nucci/scratch/lammps/mylammps2/src/Obj_openmpi-gpu/pair_lj96_cut_gpu.cpp:102: undefined reference to `lj96_gpu_compute_n(int, int, int, int, double**, int*, double*, double*, int*, int**, int**, bool, bool, bool, bool, int&, double, bool&)'
pair_lj_charmm_coul_long_gpu.o: In function `LAMMPS_NS::PairLJCharmmCoulLongGPU::compute(int, int)':
/gpfs/home/nucci/scratch/lammps/mylammps2/src/Obj_openmpi-gpu/pair_lj_charmm_coul_long_gpu.cpp:126: undefined reference to `crml_gpu_compute(int, int, int, int, double**, int*, int*, int*, int**, bool, bool, bool, bool, int&, double, bool&, double*)'
/gpfs/home/nucci/scratch/lammps/mylammps2/src/Obj_openmpi-gpu/pair_lj_charmm_coul_long_gpu.cpp:120: undefined reference to `crml_gpu_compute_n(int, int, int, int, double**, int*, double*, double*, int*, int**, int**, bool, bool, bool, bool, int&, double, bool&, double*)'
pair_lj_cut_coul_cut_gpu.o: In function `LAMMPS_NS::PairLJCutCoulCutGPU::compute(int, int)':
/gpfs/home/nucci/scratch/lammps/mylammps2/src/Obj_openmpi-gpu/pair_lj_cut_coul_cut_gpu.cpp:111: undefined reference to `ljc_gpu_compute(int, int, int, int, double**, int*, int*, int*, int**, bool, bool, bool, bool, int&, double, bool&, double*)'
/gpfs/home/nucci/scratch/lammps/mylammps2/src/Obj_openmpi-gpu/pair_lj_cut_coul_cut_gpu.cpp:105: undefined reference to `ljc_gpu_compute_n(int, int, int, int, double**, int*, double*, double*, int*, int**, int**, bool, bool, bool, bool, int&, double, bool&, double*)'
pair_lj_cut_coul_long_gpu.o: In function `LAMMPS_NS::PairLJCutCoulLongGPU::compute(int, int)':
/gpfs/home/nucci/scratch/lammps/mylammps2/src/Obj_openmpi-gpu/pair_lj_cut_coul_long_gpu.cpp:122: undefined reference to `ljcl_gpu_compute(int, int, int, int, double**, int*, int*, int*, int**, bool, bool, bool, bool, int&, double, bool&, double*)'
/gpfs/home/nucci/scratch/lammps/mylammps2/src/Obj_openmpi-gpu/pair_lj_cut_coul_long_gpu.cpp:116: undefined reference to `ljcl_gpu_compute_n(int, int, int, int, double**, int*, double*, double*, int*, int**, int**, bool, bool, bool, bool, int&, double, bool&, double*)'
pair_lj_cut_gpu.o: In function `LAMMPS_NS::PairLJCutGPU::compute(int, int)':
/gpfs/home/nucci/scratch/lammps/mylammps2/src/Obj_openmpi-gpu/pair_lj_cut_gpu.cpp:108: undefined reference to `ljl_gpu_compute(int, int, int, int, double**, int*, int*, int*, int**, bool, bool, bool, bool, int&, double, bool&)'
/gpfs/home/nucci/scratch/lammps/mylammps2/src/Obj_openmpi-gpu/pair_lj_cut_gpu.cpp:102: undefined reference to `ljl_gpu_compute_n(int, int, int, int, double**, int*, double*, double*, int*, int**, int**, bool, bool, bool, bool, int&, double, bool&)'
collect2: ld returned 1 exit status
make[1]: *** [../lmp_openmpi-gpu] Error 1
make[1]: Leaving directory `/gpfs/scratch/nucci/lammps/mylammps2/src/Obj_openmpi-gpu'
make: *** [openmpi-gpu] Error 2

wondering if I missed a step someplace.

--Jeff

W. Michael Brown wrote:

_Brown_W_Michael · April 6, 2011, 3:49pm

Thanks for trying this out. Be sure to rebuild in lib/gpu. and:

make yes-gpu

before building in src. - Mike

Jeff_Nucciarone · April 6, 2011, 5:55pm

tried that, now it gets worse:

mpic++ -O2 -funroll-loops -fstrict-aliasing -Wall -W -Wno-uninitialized -DLAMMPS_GZIP -DFFT_FFTW -I/usr/global/fftw/2.1.5/gnu/include/ -c pair_cg_cmm_coul_msm.cpp
In file included from /usr/global/openmpi/1.4.2/gnu/include/openmpi/ompi/mpi/cxx/mpicxx.h:288:0,
from /usr/global/openmpi/1.4.2/gnu/include/mpi.h:1886,
from pointers.h:24,
from pair.h:17,
from pair_cmm_common.h:22,
from pair_cg_cmm_coul_msm.h:23,
from pair_cg_cmm_coul_msm.cpp:19:
/usr/global/openmpi/1.4.2/gnu/include/openmpi/ompi/mpi/cxx/comm_inln.h:644:1: warning: unused parameter ‘oldcomm’
/usr/global/openmpi/1.4.2/gnu/include/openmpi/ompi/mpi/cxx/comm_inln.h:644:1: warning: unused parameter ‘comm_keyval’
/usr/global/openmpi/1.4.2/gnu/include/openmpi/ompi/mpi/cxx/comm_inln.h:644:1: warning: unused parameter ‘extra_state’
/usr/global/openmpi/1.4.2/gnu/include/openmpi/ompi/mpi/cxx/comm_inln.h:644:1: warning: unused parameter ‘attribute_val_in’
/usr/global/openmpi/1.4.2/gnu/include/openmpi/ompi/mpi/cxx/comm_inln.h:644:1: warning: unused parameter ‘attribute_val_out’
/usr/global/openmpi/1.4.2/gnu/include/openmpi/ompi/mpi/cxx/comm_inln.h:671:1: warning: unused parameter ‘comm’
/usr/global/openmpi/1.4.2/gnu/include/openmpi/ompi/mpi/cxx/comm_inln.h:671:1: warning: unused parameter ‘comm_keyval’
/usr/global/openmpi/1.4.2/gnu/include/openmpi/ompi/mpi/cxx/comm_inln.h:671:1: warning: unused parameter ‘attribute_val’
/usr/global/openmpi/1.4.2/gnu/include/openmpi/ompi/mpi/cxx/comm_inln.h:671:1: warning: unused parameter ‘extra_state’
pair_cg_cmm_coul_msm.cpp: In destructor ‘virtual LAMMPS_NS::PairCGCMMCoulMSM::~PairCGCMMCoulMSM()’:
pair_cg_cmm_coul_msm.cpp:45:13: error: ‘class LAMMPS_NS::Memory’ has no member named ‘destroy’
pair_cg_cmm_coul_msm.cpp:46:13: error: ‘class LAMMPS_NS::Memory’ has no member named ‘destroy’
pair_cg_cmm_coul_msm.cpp:47:13: error: ‘class LAMMPS_NS::Memory’ has no member named ‘destroy’
pair_cg_cmm_coul_msm.cpp:48:13: error: ‘class LAMMPS_NS::Memory’ has no member named ‘destroy’
pair_cg_cmm_coul_msm.cpp: In member function ‘virtual void LAMMPS_NS::PairCGCMMCoulMSM::allocate()’:
pair_cg_cmm_coul_msm.cpp:62:11: error: ‘class LAMMPS_NS::Memory’ has no member named ‘create’
pair_cg_cmm_coul_msm.cpp:63:11: error: ‘class LAMMPS_NS::Memory’ has no member named ‘create’
pair_cg_cmm_coul_msm.cpp:64:11: error: ‘class LAMMPS_NS::Memory’ has no member named ‘create’
pair_cg_cmm_coul_msm.cpp:65:11: error: ‘class LAMMPS_NS::Memory’ has no member named ‘create’
pair_cg_cmm_coul_msm.cpp: In member function ‘virtual void LAMMPS_NS::PairCGCMMCoulMSM::compute(int, int)’:
pair_cg_cmm_coul_msm.cpp:115:33: warning: unused variable ‘jtype’
pair_cg_cmm_coul_msm.cpp:115:39: warning: unused variable ‘itable’
pair_cg_cmm_coul_msm.cpp:116:30: warning: unused variable ‘delx’
pair_cg_cmm_coul_msm.cpp:116:35: warning: unused variable ‘dely’
pair_cg_cmm_coul_msm.cpp:116:40: warning: unused variable ‘delz’
pair_cg_cmm_coul_msm.cpp:117:10: warning: unused variable ‘fraction’
pair_cg_cmm_coul_msm.cpp:117:19: warning: unused variable ‘table’
pair_cg_cmm_coul_msm.cpp:118:10: warning: unused variable ‘r’
pair_cg_cmm_coul_msm.cpp:118:12: warning: unused variable ‘r2inv’
pair_cg_cmm_coul_msm.cpp:118:18: warning: unused variable ‘r6inv’
pair_cg_cmm_coul_msm.cpp:118:24: warning: unused variable ‘forcecoul’
pair_cg_cmm_coul_msm.cpp:118:34: warning: unused variable ‘forcelj’
pair_cg_cmm_coul_msm.cpp:119:10: warning: unused variable ‘grij’
pair_cg_cmm_coul_msm.cpp:119:15: warning: unused variable ‘expm2’
pair_cg_cmm_coul_msm.cpp:119:21: warning: unused variable ‘prefactor’
pair_cg_cmm_coul_msm.cpp:119:31: warning: unused variable ‘t’
pair_cg_cmm_coul_msm.cpp:119:33: warning: unused variable ‘erfc’
pair_cg_cmm_coul_msm.cpp:121:10: warning: unused variable ‘rsq’
make[1]: *** [pair_cg_cmm_coul_msm.o] Error 1
make[1]: Leaving directory `/gpfs/scratch/nucci/lammps/mylammps2/src/Obj_openmpi-gpu'
make: *** [openmpi-gpu] Error 2

is there a specific lammps source version to try this with?

--Jeff

W. Michael Brown wrote:

_Brown_W_Michael · April 6, 2011, 6:30pm

Yes, it needs to be at least the Mar 28th version. For the current version, you would have to re-download the gpu update tarball (updated today).

- Mike

Jeff_Nucciarone · April 6, 2011, 8:44pm

I downloaded the 5-Apr version along with the latest gpu tarball. The build problem is solved.

Setting the split to 1.0 and running over 16 cores across 2 nodes and 4 gpus I was able to get the code to run beyond the point where it previously threw the segv. I am re-experimenting with the split set to -1 to see if also now gets beyond the segv point. After that I'll run a full timing run to see how it performed.

I built the gpu portion with DOUBLE_DOUBLE and will also try SINGLE_DOUBLE.

Thanks for all the suggestions.

--Jeff

W. Michael Brown wrote: