Why GPU RAM usage for KOKKOS pair_style hybrid/overlay scales worse with number of atoms than for the pair_styles separately?

mkanski · February 4, 2026, 2:53pm

I use GPU acceleration through KOKKOS and I have noticed that when I use pair_style hybrid/overlay with reaxff and table pair_styles, the GPU RAM usage increases faster with the number of atoms than for the pair_styles used separately (for systems large enough). For 5M atoms, hybrid/overlay uses 15% more RAM than the sum of two simulations with the potentials used separately. I wonder if this overhead can be somehow reduced as the GPU RAM is a precious resource.

LAMMPS version: 4 Feb 2025 - Development - patch_4Feb2025-446-g6f0a59372c-modified

The input script I used for this test is attached at the bottom; it is based on examples/reaxff/ci-reaxFF/ and should work on Linux. I also tried to run the script starting from large systems and it changes nothing, so memory leaks are not an issue.

The GPU RAM usage was measured with nvidia-smi. The graphs below show the GPU memory usage (left) and the overhead (right). I also wonder why the overhead saturates at ~10M atoms.

Input script:

#ci-reax potential for CH systems with tabulated ZBL correction

label loop
variable        n loop 34

atom_style      charge
units           real

read_data       CH4.dat

replicate       $n $n $n

pair_style      hybrid/overlay reaxff control checkqeq no table linear 11000
pair_coeff      * * reaxff ffield.ci-reax.CH C H
pair_coeff      1 1 table ci-reaxFF_ZBL.dat CC_cireaxFF 
pair_coeff      1 2 table ci-reaxFF_ZBL.dat CH_cireaxFF 
pair_coeff      2 2 table ci-reaxFF_ZBL.dat HH_cireaxFF

run             0 post no

shell "nvidia-smi | grep kokkos | awk '{print $(NF-1)}'"

clear

atom_style      charge
units           real

read_data       CH4.dat

replicate       $n $n $n

pair_style      reaxff control checkqeq no
pair_coeff      * * ffield.ci-reax.CH C H

run             0 post no
shell "nvidia-smi | grep kokkos | awk '{print $(NF-1)}'"

clear

atom_style      charge
units           real

read_data       CH4.dat

replicate       $n $n $n

pair_style      table linear 11000
pair_coeff      1 1 ci-reaxFF_ZBL.dat CC_cireaxFF 
pair_coeff      1 2 ci-reaxFF_ZBL.dat CH_cireaxFF 
pair_coeff      2 2 ci-reaxFF_ZBL.dat HH_cireaxFF

run             0 post no
shell "nvidia-smi | grep kokkos | awk '{print $(NF-1)}'"

next            n
clear
jump            SELF loop

stamoor · February 4, 2026, 6:21pm

@mkanski can you post the full minimal working example? i.e. including CH4.dat, field.ci-reax.CH and anything else needed to run. Thanks

mkanski · February 4, 2026, 6:24pm

Hi, the files are in the LAMMPS directory I mentioned (examples/reaxff/ci-reaxFF), but I’m also attaching them below. Thank you for looking into this.

ffield.ci-reax.CH (6.4 KB)

control (26 Bytes)

ci-reaxFF_ZBL.dat (1.3 MB)

CH4.dat (55.4 KB)

in.ci-reax.CH_mod (1.3 KB)

stamoor · February 4, 2026, 6:27pm

Ah I missed that it was already in the examples, thanks.