reax kokkos cuda cudaDeviceSynchronize() error( cudaErrorIllegalAddress)

Dear lammps developers,

I encountered this issue while working with reax potential. File with complete error message is attached, also the scripts causing it. The file names are explicit.

All scripts are working on lmp_serial and lmp_kokkos_mpi_only.

SCRIPT FILE DESCRIPTION

Randomly creating atoms in the simulation box is causing error. ( in.not_working_random_O_test)

Simulating an off-center (in relation with simulation box) silicon cluster is causing same error. (in.not_working_Si_cluster)

Simulating a centered silicon cluster is working. (in.working_Si_cluster)

Simulation with entire box occupied by silicon lattice is working. (script file not attached)

SYSTEM INFO

LAMMPS (16 Mar 2018)

nvcc version: V9.1.85

mpirun (Open MPI) 3.1.0

gcc (Ubuntu 6.4.0-17ubuntu1~16.04) 6.4.0 20180424

nvidia-smi v390.30

kernel release: 4.15.0-29-generic

GPU: Tesla P100 (8 GPU)

CPU : Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz

All the best,
Ionut Nicolae.

in.not_working_Si_cluster (688 Bytes)

in.not_working_Si_cluster.error (17.9 KB)

in.working_Si_cluster_c (560 Bytes)

ffield.reax.SiOH (21.5 KB)

in.not_working_random_O.error (16.6 KB)

in.not_working_random_O (538 Bytes)

Please try your input with the latest patch version (currently 2 August 2018) and let us know, if the error persists.
Thanks, Axel.

First, a correction: the Si cluster had to be zero-centered in order to
work, NOT simulation-box-centered as stated in my first email.

After applying the patches, simulations with randomly generated atoms
work, and the error generated by in.not_working_Si_cluster has changed
in:

"Cuda const random access View using Cuda texture memory requires Kokkos
to allocate the View's memory[ServerS:08920]". Complete error message is
attached.

Also, an update of conditions triggering the error: if the cluster is
centered in (16 16 17) or less, the simulation works; if the cluster is
centered in (16 17 17) or higher, the error occurs.

Best regards,
Ionut Nicolae.

in.not_working_Si_cluster (649 Bytes)

in.not_working_Si_cluster.error (10.7 KB)

works for me as shown below. this is with the latest LAMMPS development version, compiled with gcc 4.9 cuda 9.1 on a GeForce Titan (Kepler35)

axel.

$ ./lmp_kokkos_cuda_mpi -in in.not_working_Si_cluster -kokkos on g 1 -sf kk
LAMMPS (2 Aug 2018)
KOKKOS mode is enabled (…/kokkos.cpp:45)
using 1 GPU(s)
Created orthogonal box = (-100 -100 -100) to (100 100 100)
1 by 1 by 1 MPI processor grid
Lattice spacing in x,y,z = 5.4305 5.4305 5.4305
Created 209 atoms
Time spent = 0.020252 secs
Neighbor list info …
update every 10 steps, delay 0 steps, check no
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 12
ghost atom cutoff = 12
binsize = 6, bins = 34 34 34
2 neighbor lists, perpetual/occasional/extra = 2 0 0
(1) pair reax/c/kk, perpetual
attributes: full, newton off, ghost, kokkos_device
pair build: full/bin/ghost/kk/device
stencil: full/ghost/bin/3d
bin: kk/device
(2) fix qeq/reax/kk, perpetual, copy from (1)
attributes: full, newton off, ghost, kokkos_device
pair build: copy/kk/device
stencil: none
bin: none
Setting up Verlet run …
Unit style : real
Current step : 0
Time step : 0.25
WARNING: Fixes cannot yet send data in Kokkos communication, switching to classic communication (…/comm_kokkos.cpp:463)
Per MPI rank memory allocation (min/avg/max) = 17.64 | 17.64 | 17.64 Mbytes
Step Temp E_pair E_mol TotEng Press
0 100 -17916.132 0 -17854.131 -28.648759
1000 95.683711 -18074.194 0 -18014.869 -0.92573939
Loop time of 18.5283 on 1 procs for 1000 steps with 209 atoms

Performance: 1.166 ns/day, 20.587 hours/ns, 53.971 timesteps/s
85.2% CPU use with 1 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total

thanks, problem (kind of) solved.

ERROR when:

mpirun -np 8 ./lammps-16Mar18/src/lmp_kokkos_cuda_mpi -in in.not_working_Si_cluster -k on g 8 -sf kk
or
mpirun ./lammps-16Mar18/src/lmp_kokkos_cuda_mpi -in in.not_working_Si_cluster -k on g 8 -sf kk //this launches 1 process for each core= 20 processes

WORKS when:

mpirun -np 1 ./lammps-16Mar18/src/lmp_kokkos_cuda_mpi -in in.not_working_Si_cluster -k on g 8 -sf kk
or
./lammps-16Mar18/src/lmp_kokkos_cuda_mpi -in in.not_working_Si_cluster -k on g 1 -sf kk

All the best,
Ionut Nicolae.

thanks, problem (kind of) solved.

ERROR when:

mpirun -np 8 ./lammps-16Mar18/src/lmp_kokkos_cuda_mpi -in
in.not_working_Si_cluster -k on g 8 -sf kk
or
mpirun ./lammps-16Mar18/src/lmp_kokkos_cuda_mpi -in
in.not_working_Si_cluster -k on g 8 -sf kk //this launches 1 process for
each core= 20 processes

Doing this makes no sense. Your input is tiny. Even running with 1 Gpu is slower than running the same on a decent CPU. When using more Mpi tasks, most will get with standard domain decomposition empty volumes.

Axel.

it's a test-only script, obviously.

@45000 atoms (60 angstrom radius) cluster, 1000 steps, mpirun -np 8 => 30 sec total wall time, while mpirun -np 1 => 50 sec total wall time. almost double... see the attached file.

All the best,
Ionut

mpirun 1 vs 8 test.txt (5.28 KB)