Lammps + reaxx + kokkos

Hi,

For some reason, one of my structures throws a floating-point exception errors when using July 2025 LAMMPS and later (I tried it with the latest 11 February version as well) with Kokkos, but not with Jun 2024 LAMMPS. Its the exact same script and the structure.

The simulation is a CHNO structure with the 2018 ReaxFF potential, and QEQ turned on. I am attaching the structure file and the input files herein. It would be great if someone could help me with this issue. I was running them on an NVIDIA A30 GPU. The REAXX also runs without issues on a CPU.

structure.data (162.0 KB)

in.reaxQEQ (792 Bytes)

ffield (22.2 KB)

@Rushik_Desai I cannot help you with reproducing and debugging your issue, but you could help people that can (like @stamoor) by also providing your exact compilation configuration settings and ideally also the configuration summary that CMake outputs at the end of the configuration run.

Furthermore, it would be useful if you could copy the exact error message output plus a few lines of context.

2 Likes

Hi Axel,
Thank you for your response, and sorry for the delay. I realized that my 2024 LAMMPS was not compiled with CUDA so it was different than my 2025 installation. But my question then is why does it fail for GPU and not for CPU, here are the tags I use to cmake my compilation (2025 and 2026 deployments)

cmake ../cmake/
-D CMAKE_C_COMPILER=gcc-11
-D CMAKE_CXX_COMPILER=g+±11
-D NEQUIP_AOT_COMPILE=ON
-D CMAKE_PREFIX_PATH=$(python -c ‘import torch;print(torch.utils.cmake_prefix_path)’)
-D MKL_INCLUDE_DIR=/usr/include/mkl
-D PKG_KOKKOS=ON
-D Kokkos_ENABLE_CUDA=ON
-D PKG_OPENMP=yes
-D Kokkos_ENABLE_OPENMP=yes
-D Kokkos_ARCH_AMPERE80=ON
-D Kokkos_ARCH_HOSTARCH=ON
-D PKG_MOLECULE=yes
-D PKG_KSPACE=yes
-D PKG_ML-ALLEGRO=yes
-D PKG_PYTHON=yes
-D PKG_REAXFF=yes
-D PYTHON_EXECUTABLE=$(which python)

The error is this:

[gilbreth-fe01:3413779:0:3413779] Caught signal 8 (Floating point exception: integer divide by zero)
==== backtrace (tid:3413779) ====
 0  /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2e4) [0x154020ac6fc4]
 1  /lib/x86_64-linux-gnu/libucs.so.0(+0x24fec) [0x154020acafec]
 2  /lib/x86_64-linux-gnu/libucs.so.0(+0x252aa) [0x154020acb2aa]
 3  lmp(+0x6f140f) [0x56328336440f]
 4  lmp(+0x6f1731) [0x563283364731]
 5  lmp(+0x701c4b) [0x563283374c4b]
 6  lmp(+0x66c3f8) [0x5632832df3f8]
 7  lmp(+0xab70f5) [0x56328372a0f5]
 8  lmp(+0x3e38b2) [0x5632830568b2]
 9  lmp(+0x2b97cc) [0x563282f2c7cc]
10  lmp(+0x2b9f0e) [0x563282f2cf0e]
11  lmp(+0x208781) [0x563282e7b781]
12  /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x1540ad130d90]
13  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x1540ad130e40]
14  lmp(+0x2ad2d5) [0x563282f202d5]
=================================
[gilbreth-fe01:3413779] *** Process received signal ***
[gilbreth-fe01:3413779] Signal: Floating point exception (8)
[gilbreth-fe01:3413779] Signal code:  (-6)
[gilbreth-fe01:3413779] Failing at address: 0xea00600341713
[gilbreth-fe01:3413779] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x1540ad149520]
[gilbreth-fe01:3413779] [ 1] lmp(+0x6f140f)[0x56328336440f]
[gilbreth-fe01:3413779] [ 2] lmp(+0x6f1731)[0x563283364731]
[gilbreth-fe01:3413779] [ 3] lmp(+0x701c4b)[0x563283374c4b]
[gilbreth-fe01:3413779] [ 4] lmp(+0x66c3f8)[0x5632832df3f8]
[gilbreth-fe01:3413779] [ 5] lmp(+0xab70f5)[0x56328372a0f5]
[gilbreth-fe01:3413779] [ 6] lmp(+0x3e38b2)[0x5632830568b2]
[gilbreth-fe01:3413779] [ 7] lmp(+0x2b97cc)[0x563282f2c7cc]
[gilbreth-fe01:3413779] [ 8] lmp(+0x2b9f0e)[0x563282f2cf0e]
[gilbreth-fe01:3413779] [ 9] lmp(+0x208781)[0x563282e7b781]
[gilbreth-fe01:3413779] [10] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x1540ad130d90]
[gilbreth-fe01:3413779] [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x1540ad130e40]
[gilbreth-fe01:3413779] [12] lmp(+0x2ad2d5)[0x563282f202d5]
[gilbreth-fe01:3413779] *** End of error message ***
Floating point exception (core dumped)

Now I know this might be my structure but then why the discrepancy between CPU and GPU version, is it relating to the memory?

Sorry if any of the above seems a bit silly, I am relatively new to using Kokkos and REAX.

Regards,
Rushik

The error you have happens in a low-level library that is called from MPI. There are different code paths for communication in KOKKOS when you use the CPU and the GPU.

Can you try to add -pk kokkos gpu/aware off to your LAMMPS command line and report back if that makes a difference?

Still the same error.

(1) start by adding -D CMAKE_BUILD_TYPE=Debug to get better stack traces.

(2) try removing -D PKG_OPENMP=yes and -D Kokkos_ENABLE_OPENMP=yes. someone else ([BUG] Kokkos package breaks with fix langevin · Issue #4912 · lammps/lammps · GitHub) just fixed their problem with another ML package by turning openmp off when cuda on.

(3) gcc 12 or 13 might work better ?

Thank you for this suggestion, this is the exact error now:

Kokkos contract violation:
Asserted condition block_size > 0 evaluated false.
Error at “/opt/lammps_source/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel_Range.hpp”:98

Backtrace:
Kokkos contract violation:
Asserted condition block_size > 0 evaluated false.
Error at “/opt/lammps_source/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel_Range.hpp”:98

Backtrace:
[0x55bd872e8059]
[0x55bd872a457e]
[0x55bd845ed427]
[0x55bd84ac8414]
[0x55bd84abde34]
[0x55bd84aaceb9]
[0x55bd84aa6d31]
[0x55bd84a73c63]
[0x55bd849c1ce7]
[0x55bd849bb1e2]
[0x55bd85073fc8]
[0x55bd845bc118]
[0x55bd84399cde]
[0x55bd84395f8b]
[0x55bd84391d4e]
[0x149c11130d90]
[0x149c11130e40] __libc_start_main
[0x55bd84391bb5]
[gilbreth-b003:1367056] *** Process received signal ***
[gilbreth-b003:1367056] Signal: Aborted (6)
[gilbreth-b003:1367056] Signal code:  (-6)
[gilbreth-b003:1367056] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x149c11149520]
[gilbreth-b003:1367056] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x149c1119d9fc]
[gilbreth-b003:1367056] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x149c11149476]
[gilbreth-b003:1367056] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x149c1112f7f3]
[gilbreth-b003:1367056] [ 4] lmp(+0x3088592)[0x55bd872a4592]
[gilbreth-b003:1367056] [ 5] lmp(+0x3d1427)[0x55bd845ed427]
[gilbreth-b003:1367056] [ 6] lmp(+0x8ac414)[0x55bd84ac8414]
[gilbreth-b003:1367056] [ 7] lmp(+0x8a1e34)[0x55bd84abde34]
[gilbreth-b003:1367056] [ 8] lmp(+0x890eb9)[0x55bd84aaceb9]
[gilbreth-b003:1367056] [ 9] lmp(+0x88ad31)[0x55bd84aa6d31]
[gilbreth-b003:1367056] [10] lmp(+0x857c63)[0x55bd84a73c63]
[gilbreth-b003:1367056] [11] lmp(+0x7a5ce7)[0x55bd849c1ce7]
[gilbreth-b003:1367056] [12] lmp(+0x79f1e2)[0x55bd849bb1e2]
[gilbreth-b003:1367056] [13] lmp(+0xe57fc8)[0x55bd85073fc8]
[gilbreth-b003:1367056] [14] lmp(+0x3a0118)[0x55bd845bc118]
[gilbreth-b003:1367056] [15] lmp(+0x17dcde)[0x55bd84399cde]
[gilbreth-b003:1367056] [16] lmp(+0x179f8b)[0x55bd84395f8b]
[gilbreth-b003:1367056] [17] lmp(+0x175d4e)[0x55bd84391d4e]
[gilbreth-b003:1367056] [18] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x149c11130d90]
[gilbreth-b003:1367056] [19] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x149c11130e40]
[gilbreth-b003:1367056] [20] lmp(+0x175bb5)[0x55bd84391bb5]
[gilbreth-b003:1367056] *** End of error message ***
[0x55c426116059]
[0x55c4260d257e]
[0x55c42341b427]
[0x55c4238f6414]
[0x55c4238ebe34]
[0x55c4238daeb9]
[0x55c4238d4d31]
[0x55c4238a1c63]
[0x55c4237efce7]
[0x55c4237e91e2]
[0x55c423ea1fc8]
[0x55c4233ea118]
[0x55c4231c7cde]
[0x55c4231c3f8b]
[0x55c4231bfd4e]
[0x14a3c901dd90]
[0x14a3c901de40] __libc_start_main
[0x55c4231bfbb5]
[gilbreth-b003:1367055] *** Process received signal ***
[gilbreth-b003:1367055] Signal: Aborted (6)
[gilbreth-b003:1367055] Signal code:  (-6)
[gilbreth-b003:1367055] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x14a3c9036520]
[gilbreth-b003:1367055] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x14a3c908a9fc]
[gilbreth-b003:1367055] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x14a3c9036476]
[gilbreth-b003:1367055] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x14a3c901c7f3]
[gilbreth-b003:1367055] [ 4] lmp(+0x3088592)[0x55c4260d2592]
[gilbreth-b003:1367055] [ 5] lmp(+0x3d1427)[0x55c42341b427]
[gilbreth-b003:1367055] [ 6] lmp(+0x8ac414)[0x55c4238f6414]
[gilbreth-b003:1367055] [ 7] lmp(+0x8a1e34)[0x55c4238ebe34]
[gilbreth-b003:1367055] [ 8] lmp(+0x890eb9)[0x55c4238daeb9]
[gilbreth-b003:1367055] [ 9] lmp(+0x88ad31)[0x55c4238d4d31]
[gilbreth-b003:1367055] [10] lmp(+0x857c63)[0x55c4238a1c63]
[gilbreth-b003:1367055] [11] lmp(+0x7a5ce7)[0x55c4237efce7]
[gilbreth-b003:1367055] [12] lmp(+0x79f1e2)[0x55c4237e91e2]
[gilbreth-b003:1367055] [13] lmp(+0xe57fc8)[0x55c423ea1fc8]
[gilbreth-b003:1367055] [14] lmp(+0x3a0118)[0x55c4233ea118]
[gilbreth-b003:1367055] [15] lmp(+0x17dcde)[0x55c4231c7cde]
[gilbreth-b003:1367055] [16] lmp(+0x179f8b)[0x55c4231c3f8b]
[gilbreth-b003:1367055] [17] lmp(+0x175d4e)[0x55c4231bfd4e]
[gilbreth-b003:1367055] [18] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x14a3c901dd90]
[gilbreth-b003:1367055] [19] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x14a3c901de40]
[gilbreth-b003:1367055] [20] lmp(+0x175bb5)[0x55c4231bfbb5]
[gilbreth-b003:1367055] *** End of error message ***

doesnt look like a stack trace with -D CMAKE_BUILD_TYPE=Debug. make sure where you’re running lmp can see the debug symbols from the executable

I did but was not running it correctly maybe, but here is the correct version with more info:

Kokkos contract violation:

    Asserted condition `block_size > 0` evaluated false.

Error at "/scratch/gilbreth/desai224/lammps_27Jun2024/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel_Range.hpp":94 



Backtrace:

Kokkos contract violation:

    Asserted condition `block_size > 0` evaluated false.

Error at "/scratch/gilbreth/desai224/lammps_27Jun2024/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel_Range.hpp":94 



Backtrace:

[0x245c680] 

[0x24202b0] 

[0x5fcb01] 

[0x948aa4] 

[0x93efdc] 

[0x934114] 

[0x92f087] 

[0x8fcf43] [0x245c680] 

[0x24202b0] 

[0x5fcb01] 

[0x948aa4] 

[0x93efdc] 

[0x934114] 

[0x92f087] 

[0x8fcf43] 

[0x8739da] 

[0x86d700] 

[0xda61cb] 

[0x5d5af5] 

[0x41fde0] 

[0x41c3b3] 

[0x419c6f] 

[0x15554ecd9610] 

[0x15554ecd96c0] __libc_start_main

[0x419b35] 



[0x8739da] 

[0x86d700] 

[0xda61cb] 

[0x5d5af5] 

[0x41fde0] 

[0x41c3b3] 

[0x419c6f] 

[0x15554ecd9610] 

[0x15554ecd96c0] __libc_start_main

[0x419b35] 



Thread 1 "lmp" received signal SIGABRT, Aborted.



Thread 1 "lmp" received signal SIGABRT, Aborted.

0x000015554ed3c02c in __pthread_kill_implementation () from /lib64/libc.so.6

0x000015554ed3c02c in __pthread_kill_implementation () from /lib64/libc.so.6

#0  0x000015554ed3c02c in __pthread_kill_implementation () from /lib64/libc.so.6

#0  0x000015554ed3c02c in __pthread_kill_implementation () from /lib64/libc.so.6

#1  0x000015554eceeb86 in raise () from /lib64/libc.so.6

#1  0x000015554eceeb86 in raise () from /lib64/libc.so.6

#2  0x000015554ecd8873 in abort () from /lib64/libc.so.6

#2  0x000015554ecd8873 in abort () from /lib64/libc.so.6

#3  0x00000000024202c4 in Kokkos::Impl::host_abort (message=0x2505c60 "Kokkos contract violation:\n    Asserted condition `block_size > 0` evaluated false.\nError at \"/scratch/gilbreth/desai224/lammps_27Jun2024/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel_Range.hpp\":94 \n") at /scratch/gilbreth/desai224/lammps_27Jun2024/lib/kokkos/core/src/impl/Kokkos_Abort.cpp:40

#4  0x00000000005fcb01 in Kokkos::abort (message=0x2505c60 "Kokkos contract violation:\n    Asserted condition `block_size > 0` evaluated false.\nError at \"/scratch/gilbreth/desai224/lammps_27Jun2024/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel_Range.hpp\":94 \n") at /scratch/gilbreth/desai224/lammps_27Jun2024/lib/kokkos/core/src/Kokkos_Abort.hpp:97

#3  0x00000000024202c4 in Kokkos::Impl::host_abort (message=0x2505c60 "Kokkos contract violation:\n    Asserted condition `block_size > 0` evaluated false.\nError at \"/scratch/gilbreth/desai224/lammps_27Jun2024/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel_Range.hpp\":94 \n") at /scratch/gilbreth/desai224/lammps_27Jun2024/lib/kokkos/core/src/impl/Kokkos_Abort.cpp:40

#4  0x00000000005fcb01 in Kokkos::abort (message=0x2505c60 "Kokkos contract violation:\n    Asserted condition `block_size > 0` evaluated false.\nError at \"/scratch/gilbreth/desai224/lammps_27Jun2024/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel_Range.hpp\":94 \n") at /scratch/gilbreth/desai224/lammps_27Jun2024/lib/kokkos/core/src/Kokkos_Abort.hpp:97

#5  0x0000000000948aa4 in Kokkos::Impl::ParallelFor<LAMMPS_NS::NPairKokkosBuildFunctorGhost<Kokkos::Cuda, 1>, Kokkos::RangePolicy<Kokkos::Cuda>, Kokkos::Cuda>::execute (this=0x7fffffff6f20) at /scratch/gilbreth/desai224/lammps_27Jun2024/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel_Range.hpp:94

#5  0x0000000000948aa4 in Kokkos::Impl::ParallelFor<LAMMPS_NS::NPairKokkosBuildFunctorGhost<Kokkos::Cuda, 1>, Kokkos::RangePolicy<Kokkos::Cuda>, Kokkos::Cuda>::execute (this=0x7fffffff6f20) at /scratch/gilbreth/desai224/lammps_27Jun2024/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel_Range.hpp:94

#6  0x000000000093efdc in Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::Cuda>, LAMMPS_NS::NPairKokkosBuildFunctorGhost<Kokkos::Cuda, 1>, void> (str="", policy=..., functor=...) at /scratch/gilbreth/desai224/lammps_27Jun2024/lib/kokkos/core/src/Kokkos_Parallel.hpp:144

#6  0x000000000093efdc in Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::Cuda>, LAMMPS_NS::NPairKokkosBuildFunctorGhost<Kokkos::Cuda, 1>, void> (str="", policy=..., functor=...) at /scratch/gilbreth/desai224/lammps_27Jun2024/lib/kokkos/core/src/Kokkos_Parallel.hpp:144

#7  0x0000000000934114 in Kokkos::parallel_for<LAMMPS_NS::NPairKokkosBuildFunctorGhost<Kokkos::Cuda, 1> > (str="", work_count=10694, functor=...) at /scratch/gilbreth/desai224/lammps_27Jun2024/lib/kokkos/core/src/Kokkos_Parallel.hpp:165

#7  0x0000000000934114 in Kokkos::parallel_for<LAMMPS_NS::NPairKokkosBuildFunctorGhost<Kokkos::Cuda, 1> > (str="", work_count=10656, functor=...) at /scratch/gilbreth/desai224/lammps_27Jun2024/lib/kokkos/core/src/Kokkos_Parallel.hpp:165

#8  0x000000000092f087 in Kokkos::parallel_for<LAMMPS_NS::NPairKokkosBuildFunctorGhost<Kokkos::Cuda, 1> > (work_count=10656, functor=...) at /scratch/gilbreth/desai224/lammps_27Jun2024/lib/kokkos/core/src/Kokkos_Parallel.hpp:170

#8  0x000000000092f087 in Kokkos::parallel_for<LAMMPS_NS::NPairKokkosBuildFunctorGhost<Kokkos::Cuda, 1> > (work_count=10694, functor=...) at /scratch/gilbreth/desai224/lammps_27Jun2024/lib/kokkos/core/src/Kokkos_Parallel.hpp:170

#9  0x00000000008fcf43 in LAMMPS_NS::NPairKokkos<Kokkos::Cuda, 1, 0, 1, 0, 0>::build (this=0xfdcd540, list_=0xfdcc1c0) at /scratch/gilbreth/desai224/lammps_27Jun2024/src/KOKKOS/npair_kokkos.cpp:264

#9  0x00000000008fcf43 in LAMMPS_NS::NPairKokkos<Kokkos::Cuda, 1, 0, 1, 0, 0>::build (this=0xfdae0c0, list_=0xfdacd40) at /scratch/gilbreth/desai224/lammps_27Jun2024/src/KOKKOS/npair_kokkos.cpp:264

#10 0x00000000008739da in LAMMPS_NS::NeighborKokkos::build_kokkos<Kokkos::Cuda> (this=0xdbfb950, topoflag=1) at /scratch/gilbreth/desai224/lammps_27Jun2024/src/KOKKOS/neighbor_kokkos.cpp:312

#10 0x00000000008739da in LAMMPS_NS::NeighborKokkos::build_kokkos<Kokkos::Cuda> (this=0xdbfd6c0, topoflag=1) at /scratch/gilbreth/desai224/lammps_27Jun2024/src/KOKKOS/neighbor_kokkos.cpp:312

#11 0x000000000086d700 in LAMMPS_NS::NeighborKokkos::build (this=0xdbfb950, topoflag=1) at /scratch/gilbreth/desai224/lammps_27Jun2024/src/KOKKOS/neighbor_kokkos.cpp:232

#11 0x000000000086d700 in LAMMPS_NS::NeighborKokkos::build (this=0xdbfd6c0, topoflag=1) at /scratch/gilbreth/desai224/lammps_27Jun2024/src/KOKKOS/neighbor_kokkos.cpp:232

#12 0x0000000000da61cb in LAMMPS_NS::VerletKokkos::setup (this=0xde58fa0, flag=1) at /scratch/gilbreth/desai224/lammps_27Jun2024/src/KOKKOS/verlet_kokkos.cpp:110

#12 0x0000000000da61cb in LAMMPS_NS::VerletKokkos::setup (this=0xde57510, flag=1) at /scratch/gilbreth/desai224/lammps_27Jun2024/src/KOKKOS/verlet_kokkos.cpp:110

#13 0x00000000005d5af5 in LAMMPS_NS::Run::command (this=0xde9f7c0, narg=1, arg=0xdeae9b0) at /scratch/gilbreth/desai224/lammps_27Jun2024/src/run.cpp:171

#13 0x00000000005d5af5 in LAMMPS_NS::Run::command (this=0xde95370, narg=1, arg=0xde94a90) at /scratch/gilbreth/desai224/lammps_27Jun2024/src/run.cpp:171

#14 0x000000000041fde0 in LAMMPS_NS::Input::execute_command (this=0xc828190) at /scratch/gilbreth/desai224/lammps_27Jun2024/src/input.cpp:868

#14 0x000000000041fde0 in LAMMPS_NS::Input::execute_command (this=0xc8281d0) at /scratch/gilbreth/desai224/lammps_27Jun2024/src/input.cpp:868

#15 0x000000000041c3b3 in LAMMPS_NS::Input::file (this=0xc828190) at /scratch/gilbreth/desai224/lammps_27Jun2024/src/input.cpp:313

#15 0x000000000041c3b3 in LAMMPS_NS::Input::file (this=0xc8281d0) at /scratch/gilbreth/desai224/lammps_27Jun2024/src/input.cpp:313

#16 0x0000000000419c6f in main (argc=9, argv=0x7fffffff9f68) at /scratch/gilbreth/desai224/lammps_27Jun2024/src/main.cpp:77

#16 0x0000000000419c6f in main (argc=9, argv=0x7fffffff9f68) at /scratch/gilbreth/desai224/lammps_27Jun2024/src/main.cpp:77

Does this help understand the error?

Yes this will help @stamoor now that you know your problem is in NPairKokkosBuildFunctorGhost<Kokkos::Cuda,1>

Thanks, I will wait for him to review this!

he might be away now, not sure until when so i had a quick look while waiting for something else.

i see this in code:

    if (GHOST) {
      // assumes newton off

      NPairKokkosBuildFunctorGhost<DeviceType,HALF> f(data,atoms_per_bin * 5 * sizeof(double) * factor);

// temporarily disable team policy for ghost due to known bug

//#ifdef LMP_KOKKOS_GPU
//      if (ExecutionSpaceFromDevice<DeviceType>::space == Device) {
//        int team_size = atoms_per_bin*factor;
//        int team_size_max = Kokkos::TeamPolicy<DeviceType>(team_size,Kokkos::AUTO).team_size_max(f,Kokkos::ParallelForTag());
//        if (team_size <= team_size_max) {
//          Kokkos::TeamPolicy<DeviceType> config((mbins+factor-1)/factor,team_size);
//          Kokkos::parallel_for(config, f);
//        } else { // fall back to flat method
//          f.sharedsize = 0;
//          Kokkos::parallel_for(nall, f);
//        }
//      } else
//        Kokkos::parallel_for(nall, f);
//#else
      Kokkos::parallel_for(nall, f);
//#endif

the “temporarily disable” comment is from 3 years ago so if you’re using the latest stable or release or develop you’re ok.

the other thing i see is “assumes newton off” so make sure you have in your script:

newton off
package kokkos newton off

also try package kokkos newton off neigh full so that you would use a different code path NPairKokkosBuildFunctorGhost<Kokkos::Cuda,0> instead of NPairKokkosBuildFunctorGhost<Kokkos::Cuda,1> which might help.

also try package kokkos newton off neigh/thread off and package kokkos newton off neigh full neigh/thread off

try all possibilities and report back

REAX requires newton to be on (or so is lammps telling me), so I think none of these commands will work with my script. I can check if there is a way to run REAX without newton on.

are you using -k on g 1 -sf kk when launching lammps ? if not it could be using the legacy pair_style reaxx instead of pair_style reaxx/kk

also do you really need -D NEQUIP_AOT_COMPILE=ON and -D PKG_ML-ALLEGRO=yes (not part of LAMMPS) if youre just running reaxff ?

I am using it and I need nequip and allegro for that version. I also compiled another version without all of it and it still has the same issues. (I compiled one having only Reax Kokkos Molecule)

@Rushik_Desai you may try the following change to src/KOKKOS/npair_kokkos.cpp

diff --git a/src/KOKKOS/npair_kokkos.cpp b/src/KOKKOS/npair_kokkos.cpp
index 07fb8d1d77..f1bc4a9b90 100644
--- a/src/KOKKOS/npair_kokkos.cpp
+++ b/src/KOKKOS/npair_kokkos.cpp
@@ -269,6 +269,7 @@ void NPairKokkos<DeviceType,HALF,NEWTON,GHOST,TRI,SIZE>::build(NeighList *list_)
 //      } else
 //        Kokkos::parallel_for(nall, f);
 //#else
+      f.sharedsize = 0;
       Kokkos::parallel_for(nall, f);
 //#endif
     } else {

which initializes the valid sharedsize for the functor. With this change the run with -k on g 1 goes through on my end and gives consistent results with the CPU only version. In fact, uncommenting the whole section makes it work as well and suggests me setting f.sharedsize. Without this setting, the backtrace goes deep into the Kokkos library source code at invalid kernel launch params (block_size being zero) as seen in your past post.

1 Like

Thank you so much. This does resolve the issue for me. I will run some tests to see if it breaks anything else, and if it doesn’t, I will close this!

It does work for the application I was applying it for as well, not sure how you close topics here but I would considert his solved!

Happy to hear that. I will submit a pull request with the change.