Malloc: tiny_free_list_remove_ptr: Internal invariant broken (next ptr of prev)

this wins the award for probably the most cryptic error message i’ve ever seen in my coding career. (the funniest one was “length of sex exceeds maximum value of 1”)

this segfault happens in the context of my current project FitSNAP-ReaxFF. im calling LAMMPS from python interface repeatedly to calculate the energy and forces of multiple configurations of dft training data with different parameters of the force field being optimized. it doesnt make sense to reset lmp for every config (as fitsnap does by default) because im going through thousands of configurations with 4-60 atoms on every mpi rank at each step of the optimization algorithm. the overhead of initializing lmp, run 0, shutting down lmp over and over again accounted for >50% of running time.

Python(22038,0x16e4bb000) malloc: tiny_free_list_remove_ptr: Internal invariant broken (next ptr of prev): ptr=0x13ad37640, prev_next=0x1ba
Python(22038,0x16e4bb000) malloc: *** set a breakpoint in malloc_error_break to debug
[macmini:22038] *** Process received signal ***
[macmini:22038] Signal: Abort trap: 6 (6)
[macmini:22038] Signal code:  (0)
[macmini:22038] [ 0] 0   libsystem_platform.dylib            0x00000001886a3584 _sigtramp + 56
[macmini:22038] [ 1] 0   libsystem_pthread.dylib             0x0000000188672c20 pthread_kill + 288
[macmini:22038] [ 2] 0   libsystem_c.dylib                   0x000000018857fa20 abort + 180
[macmini:22038] [ 3] 0   libsystem_malloc.dylib              0x000000018848faa8 malloc_vreport + 896
[macmini:22038] [ 4] 0   libsystem_malloc.dylib              0x00000001884b3ea8 malloc_zone_error + 104
[macmini:22038] [ 5] 0   libsystem_malloc.dylib              0x0000000188488238 tiny_free_list_remove_ptr + 500
[macmini:22038] [ 6] 0   libsystem_malloc.dylib              0x000000018848797c tiny_free_no_lock + 1060
[macmini:22038] [ 7] 0   libsystem_malloc.dylib              0x00000001884873d4 free_tiny + 496
[macmini:22038] [ 8] 0   liblammps.0.dylib                   0x0000000136afde00 _ZNK6Kokkos9HostSpace15impl_deallocateEPKcPvmm28Kokkos_Profiling_SpaceHandle + 212
[macmini:22038] [ 9] 0   liblammps.0.dylib                   0x0000000136afdf74 _ZNK6Kokkos9HostSpace10deallocateEPKcPvmm + 244
[macmini:22038] [10] 0   liblammps.0.dylib                   0x0000000136afe124 _ZN6Kokkos4Impl28SharedAllocationRecordCommonINS_9HostSpaceEED2Ev + 100
[macmini:22038] [11] 0   liblammps.0.dylib                   0x000000013671989c _ZN6Kokkos4Impl22SharedAllocationRecordINS_9HostSpaceENS0_16ViewValueFunctorINS_6DeviceINS_6OpenMPES2_EEdLb1EEEED0Ev + 124
[macmini:22038] [12] 0   liblammps.0.dylib                   0x0000000136b07b40 _ZN6Kokkos4Impl22SharedAllocationRecordIvvE9decrementEPS2_ + 92
[macmini:22038] [13] 0   liblammps.0.dylib                   0x0000000136650cb8 _ZN6Kokkos4Impl23SharedAllocationTracker13assign_directERKS1_ + 56
[macmini:22038] [14] 0   liblammps.0.dylib                   0x0000000136818ca8 _ZN9LAMMPS_NS20FixACKS2ReaxFFKokkosIN6Kokkos6OpenMPEE19sparse_matvec_acks2ERNS1_4ViewIPdJNS1_11LayoutRightES2_vEEES8_ + 680
[macmini:22038] [15] 0   liblammps.0.dylib                   0x000000013681951c _ZN9LAMMPS_NS20FixACKS2ReaxFFKokkosIN6Kokkos6OpenMPEE14bicgstab_solveEv + 44
[macmini:22038] [16] 0   liblammps.0.dylib                   0x000000013681d03c _ZN9LAMMPS_NS20FixACKS2ReaxFFKokkosIN6Kokkos6OpenMPEE9pre_forceEi + 3516
[macmini:22038] [17] 0   liblammps.0.dylib                   0x00000001366caef0 _ZN9LAMMPS_NS12ModifyKokkos15setup_pre_forceEi + 496
[macmini:22038] [18] 0   liblammps.0.dylib                   0x000000013684a254 _ZN9LAMMPS_NS12VerletKokkos5setupEi + 532
[macmini:22038] [19] 0   liblammps.0.dylib                   0x0000000136534ba4 _ZN9LAMMPS_NS3Run7commandEiPPc + 3812
[macmini:22038] [20] 0   liblammps.0.dylib                   0x00000001363d33d4 _ZN9LAMMPS_NS5Input15execute_commandEv + 1844
[macmini:22038] [21] 0   liblammps.0.dylib                   0x00000001363d4638 _ZN9LAMMPS_NS5Input3oneERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE + 120
[macmini:22038] [22] 0   liblammps.0.dylib                   0x0000000136411da0 lammps_command + 124
[macmini:22038] [23] 0   libffi.dylib                        0x0000000199628050 ffi_call_SYSV + 80
[macmini:22038] [24] 0   libffi.dylib                        0x0000000199630ae0 ffi_call_int + 1212
[macmini:22038] [25] 0   _ctypes.cpython-313-darwin.so       0x00000001042f3838 _ctypes_callproc + 940
[macmini:22038] [26] 0   _ctypes.cpython-313-darwin.so       0x00000001042e9480 PyCFuncPtr_call + 256
[macmini:22038] [27] 0   Python                              0x0000000103e57da0 _PyEval_EvalFrameDefault + 71332
[macmini:22038] [28] 0   Python                              0x0000000103cb2024 method_vectorcall + 396
[macmini:22038] [29] 0   Python                              0x0000000103f9ef3c thread_run + 160

after creating a minimal script and data

trnka2018-segfault.in (335 Bytes)
reaxff-trnka2018.ff (11.5 KB)
trnka2018-H3O.json.data (399 Bytes)
trnka2018-GlyProSer-17-16.json.data (1.8 KB)

to replicate what the python interface is feeding lammps, and recompiling with:

-D CMAKE_BUILD_TYPE=DEBUG
-D Kokkos_ENABLE_DEBUG=on
-D Kokkos_ENABLE_DEBUG_BOUNDS_CHECK=on
-D Kokkos_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK=on

this is what i got:

Kokkos::View ERROR: out of bounds access label=("acks2/kk:jlist") with indices [6] but extents [6]
[...]
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x000000018863aa60 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x0000000188672c20 libsystem_pthread.dylib`pthread_kill + 288
    frame #2: 0x000000018857fa20 libsystem_c.dylib`abort + 180
    frame #3: 0x0000000104981324 liblammps.0.dylib`Kokkos::Impl::host_abort(message="Kokkos::View ERROR: out of bounds access label=(\"acks2/kk:jlist\") with indices [6] but extents [6]") at Kokkos_Abort.cpp:40:10
    frame #4: 0x00000001040c2af0 liblammps.0.dylib`Kokkos::abort(message="Kokkos::View ERROR: out of bounds access label=(\"acks2/kk:jlist\") with indices [6] but extents [6]") at Kokkos_Abort.hpp:97:3
    frame #5: 0x00000001044a6f10 liblammps.0.dylib`void Kokkos::Impl::view_verify_operator_bounds<Kokkos::HostSpace, Kokkos::View<int*, Kokkos::LayoutRight, Kokkos::OpenMP, void>, Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<int*, Kokkos::LayoutRight, Kokkos::OpenMP, void>, void>, long long>(tracker=0x000000016fdfc970, map=0x000000016fdfc978, (null)=6) at Kokkos_ViewMapping.hpp:3453:18
    frame #6: 0x00000001044a15e0 liblammps.0.dylib`std::enable_if<Kokkos::Impl::always_true<long long>::value && 1 == Kokkos::View<int*, Kokkos::LayoutRight, Kokkos::OpenMP, void>::rank && Kokkos::View<int*, Kokkos::LayoutRight, Kokkos::OpenMP, void>::is_default_map && !Kokkos::View<int*, Kokkos::LayoutRight, Kokkos::OpenMP, void>::is_layout_stride, int&>::type Kokkos::View<int*, Kokkos::LayoutRight, Kokkos::OpenMP, void>::operator()<long long>(this=0x000000016fdfc970, i0=6) const at Kokkos_View.hpp:905:5
    frame #7: 0x00000001044cdd14 liblammps.0.dylib`void LAMMPS_NS::FixACKS2ReaxFFKokkos<Kokkos::OpenMP>::compute_h_item<4>(this=0x000000016fdfc1c0, ii=0, m_fill=0x000000014a01ea88, final=0x000000016fdf8fdf) const at fix_acks2_reaxff_kokkos.cpp:616:16
    frame #8: 0x00000001044c6c88 liblammps.0.dylib`LAMMPS_NS::FixACKS2ReaxFFKokkosComputeHFunctor<Kokkos::OpenMP, 4>::operator()(this=0x000000016fdfc1b8, ii=0, m_fill=0x000000014a01ea88, final=0x000000016fdf8fdf) const at fix_acks2_reaxff_kokkos.h:298:41
    frame #9: 0x00000001044c1734 liblammps.0.dylib`_ZN6Kokkos4Impl12ParallelScanIN9LAMMPS_NS35FixACKS2ReaxFFKokkosComputeHFunctorINS_6OpenMPELi4EEENS_11RangePolicyIJS4_EEES4_E10exec_rangeIvEENSt9enable_ifIXsrSt7is_voidIT_E5valueEvE4typeERKS5_mmRxb(functor=0x000000016fdfc1b8, ibeg=0, iend=35, update=0x000000014a01ea88, final=true) at Kokkos_OpenMP_Parallel_Scan.hpp:54:14
    frame #10: 0x00000001044d81c4 liblammps.0.dylib`_ZNK6Kokkos4Impl12ParallelScanIN9LAMMPS_NS35FixACKS2ReaxFFKokkosComputeHFunctorINS_6OpenMPELi4EEENS_11RangePolicyIJS4_EEES4_E7executeEv._omp_fn.0((null)=0x000000016fdfb128) at Kokkos_OpenMP_Parallel_Scan.hpp:137:49
    frame #11: 0x00000001004c6f3c libgomp.1.dylib`GOMP_parallel + 84
    frame #12: 0x00000001044b902c liblammps.0.dylib`Kokkos::Impl::ParallelScan<LAMMPS_NS::FixACKS2ReaxFFKokkosComputeHFunctor<Kokkos::OpenMP, 4>, Kokkos::RangePolicy<Kokkos::OpenMP>, Kokkos::OpenMP>::execute(this=0x000000016fdfc1b0) const at Kokkos_OpenMP_Parallel_Scan.hpp:95:9
    frame #13: 0x00000001044ad5ac liblammps.0.dylib`void Kokkos::parallel_scan<Kokkos::RangePolicy<Kokkos::OpenMP>, LAMMPS_NS::FixACKS2ReaxFFKokkosComputeHFunctor<Kokkos::OpenMP, 4>, void>(str="", policy=0x000000016fdfd270, functor=0x000000016fdfd380) at Kokkos_Parallel.hpp:360:18
    frame #14: 0x00000001044a5a60 liblammps.0.dylib`void Kokkos::parallel_scan<LAMMPS_NS::FixACKS2ReaxFFKokkosComputeHFunctor<Kokkos::OpenMP, 4>>(str="", work_count=35, functor=0x000000016fdfd380) at Kokkos_Parallel.hpp:382:16
    frame #15: 0x000000010449fd0c liblammps.0.dylib`void Kokkos::parallel_scan<LAMMPS_NS::FixACKS2ReaxFFKokkosComputeHFunctor<Kokkos::OpenMP, 4>>(work_count=35, functor=0x000000016fdfd380) at Kokkos_Parallel.hpp:387:26
    frame #16: 0x00000001044968ec liblammps.0.dylib`LAMMPS_NS::FixACKS2ReaxFFKokkos<Kokkos::OpenMP>::pre_force(this=0x000000014a01b000, (null)=2) at fix_acks2_reaxff_kokkos.cpp:294:28
    frame #17: 0x0000000104496144 liblammps.0.dylib`LAMMPS_NS::FixACKS2ReaxFFKokkos<Kokkos::OpenMP>::setup_pre_force(this=0x000000014a01b000, vflag=2) at fix_acks2_reaxff_kokkos.cpp:187:12
    frame #18: 0x0000000104315028 liblammps.0.dylib`LAMMPS_NS::ModifyKokkos::setup_pre_force(this=0x0000000148f103a0, vflag=2) at modify_kokkos.cpp:184:46
    frame #19: 0x0000000104518dfc liblammps.0.dylib`LAMMPS_NS::VerletKokkos::setup(this=0x0000000148f11590, flag=1) at verlet_kokkos.cpp:119:26
    frame #20: 0x00000001040a8cb0 liblammps.0.dylib`LAMMPS_NS::Run::command(this=0x0000600003aa8240, narg=1, arg=0x0000600002bb4480) at run.cpp:171:31
    frame #21: 0x0000000103e5e410 liblammps.0.dylib`LAMMPS_NS::Input::execute_command(this=0x0000000148f07dd0) at input.cpp:868:17
    frame #22: 0x0000000103e5b0c4 liblammps.0.dylib`LAMMPS_NS::Input::file(this=0x0000000148f07dd0) at input.cpp:313:24
    frame #23: 0x0000000100003a60 lmp`main(argc=9, argv=0x000000016fdff508) at main.cpp:78:24
    frame #24: 0x00000001882ea0e0 dyld`start + 2360

this only happens with fix acks2/kk not fix qeq/kk. running this minimal test with serial fix acks2 gives the following error message

ERROR: Too many ghost atoms (src/REAXFF/pair_reaxff.cpp:589)

but i’ve given up on classic reaxff a long time ago and i’ve been using lmp -k on t 1 -sf kk -in ... for a while now. H3O has 4 atoms but GlyProSer has 35 atoms so this appears to be a bug in fix acks2/kk when the number of atoms is increased after the fix has been initialized. this bug likely also happens in other situations where the number of atoms is increasing during the simulation, eg. create_atoms, fix pour, fix deposit, …

any suggestions @stamoor how to fix this ? anyone else is welcome to comment also…

Agree this looks like a bug in fix acks2/kk, I will take a look.

Using post no in the run command could help, see run command — LAMMPS documentation. But probably still not as fast as calling LAMMPS from python.

bugfix in commit a8a93dc of draft PR 4441.

sometimes just talking to myself on matsci is enough to fix a bug :crazy_face:

1 Like

I do this a lot through explaining my problems to colleagues. Even if they have no clue of what I am talking about. I call it “verbal debugging”. By having to sort your thoughts and get clear about what is the issue, which is required when telling things to somebody else, you often discover the thing that you missed or come across a thought that you had not applied or discarded initially.

1 Like