LAMMPS combined GPU: ERROR on proc 1: Neighbor list problem on the GPU

Hi, I want to use the GPU accelerator package, but there is an error:

ERROR on proc 1: Neighbor list problem on the GPU. Try increasing the value of 'neigh_modify one' or the GPU neighbor list 'binsize'. (../fix_gpu.cpp:340)

Here is my input file:

units lj
dimension 3
boundary p p p
atom_style full
special_bonds fene
package gpu 1 neigh yes

read_data lammps.data extra/bond/per/atom 10 extra/special/per/atom 100

group colloid type 1
group polymer type 2 3 4
group reaction type 1 2

#kspace_style pppm 1e-6
pair_style lj/expand 0.448 #lj/expand/coul/long 0.448 5.0
pair_modify shift yes
pair_coeff 1 1 0.0 0.2 0.0 0.224
pair_coeff 1 2 1.0 0.4 -0.1 2.5
pair_coeff 1 3 0.1 0.4 -0.1 0.448
pair_coeff 1 4 0.1 0.4 -0.1 0.448
pair_coeff 2 2 0.1 0.2 0.0 0.224
pair_coeff 2 3 0.1 0.2 0.0 0.224
pair_coeff 2 4 0.1 0.2 0.0 0.224
pair_coeff 3 3 0.1 0.2 0.0 0.224
pair_coeff 3 4 0.1 0.2 0.0 0.224
pair_coeff 4 4 0.1 0.2 0.0 0.224

bond_style fene/expand
bond_coeff 1 300.0 1.0 0.1 0.2 0.0
bond_coeff 2 300.0 1.0 0.1 0.4 -0.1

neighbor 1.0 bin
neigh_modify delay 0 every 1 check yes
thermo 1000
thermo_style custom step temp ke pe etotal

#dump output all custom 1000 output.lammpstrj id type x y z ix iy iz
#dump_modify output sort id

#fix step1 colloid rigid/nve single langevin 1.0 1.0 0.1 85723610
fix fix_colloid colloid setforce 0 0 0
fix step2 polymer nve
fix step3 polymer langevin 1.0 1.0 1.0 93627450
fix reaction_step reaction bond/create 5000 1 2 0.348 2 iparam 1 1 jparam 1 2 prob 0.5 85784221
timestep 0.001
run 1000000

write_data final.data nocoeff

#unfix step1
unfix fix_colloid
unfix step2
unfix step3
unfix reaction_step
clear

and I use the command: mpirun -np 2 lmp_mpi -sf gpu -in in.lammps
Here is my GPU information:

Found 1 platform(s).
CUDA Driver Version:                           12.80

Device 0: "NVIDIA GeForce RTX 5080"
  Type of device:                                GPU
  Compute capability:                            12
  Double precision support:                      Yes
  Total amount of global memory:                 15.4791 GB
  Number of compute units/multiprocessors:       84
  Number of cores:                               16128
  Total amount of constant memory:               65536 bytes
  Total amount of local/shared memory per block: 49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum group size (# of threads per block)    1024 x 1024 x 64
  Maximum item sizes (# threads for each dim)    2147483647 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Clock rate:                                    2.655 GHz
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default
  Concurrent kernel execution:                   Yes
  Device has ECC support enabled:                No

Besides, I tested the lammps example/shear: mpirun -np 2 lmp_mpi -sf gpu -in in.shear, it can run successfully:

LAMMPS (29 Aug 2024 - Update 1)
Lattice spacing in x,y,z = 3.52 3.52 3.52
Created orthogonal box = (0 0 0) to (56.32 35.2 9.956063)
  2 by 1 by 1 MPI processor grid
Lattice spacing in x,y,z = 3.52 4.9780317 4.9780317
Created 1912 atoms
  using lattice units in orthogonal box = (-0.005632 -0.00352 0) to (56.325632 35.20352 9.956063)
  create_atoms CPU = 0.000 seconds
Reading eam potential file Ni_u3.eam with DATE: 2007-06-11
264 atoms in group lower
264 atoms in group upper
528 atoms in group boundary
1384 atoms in group mobile
Setting atom values ...
  264 settings made for type
Setting atom values ...
  264 settings made for type
WARNING: Temperature for thermo pressure is not for group all (../thermo.cpp:533)

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:
- GPU package (short-range, long-range and three-body potentials): doi:10.1016/j.cpc.2010.12.021, doi:10.1016/j.cpc.2011.10.012, doi:10.1016/j.cpc.2013.08.002, doi:10.1016/j.commatsci.2014.10.068, doi:10.1016/j.cpc.2016.10.020, doi:10.3233/APC200086
The log file lists these citations in BibTeX format.

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE


--------------------------------------------------------------------------
- Using acceleration for eam:
-  with 1 proc(s) per device.
-  Horizontal vector operations: ENABLED
-  Shared memory system: No
--------------------------------------------------------------------------
Device 0: NVIDIA GeForce RTX 5080, 84 CUs, 15/15 GB, 2.7 GHZ (Mixed Precision)
Device 1: NVIDIA GeForce RTX 5080, 84 CUs, 2.7 GHZ (Mixed Precision)
--------------------------------------------------------------------------

Initializing Device and compiling on process 0...Done.
Initializing Devices 0-1 on core 0...Done.

Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.001
Per MPI rank memory allocation (min/avg/max) = 2.574 | 2.574 | 2.575 Mbytes
   Step          Temp          E_pair         E_mol          TotEng         Press          Volume    
         0   300           -8317.4367      0             -8263.8066     -7100.6278      19547.02     
        25   218.96729     -8271.6701      0             -8232.526       5018.9555      19547.02     
        50   300           -8238.2244      0             -8184.5944      12937.814      19693.626    
        75   295.22062     -8232.3247      0             -8179.549       13096.236      19751.41     
       100   300           -8248.5066      0             -8194.8765      7631.67        19813.143    
Loop time of 0.594331 on 2 procs for 100 steps with 1912 atoms

Performance: 14.537 ns/day, 1.651 hours/ns, 168.256 timesteps/s, 321.706 katom-step/s
100.0% CPU use with 2 MPI tasks x no OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0.30009    | 0.44656    | 0.59303    |  21.9 | 75.14
Neigh   | 4.187e-06  | 4.929e-06  | 5.671e-06  |   0.0 |  0.00
Comm    | 0.00071203 | 0.026911   | 0.053111   |  16.0 |  4.53
Output  | 4.5775e-05 | 0.0021482  | 0.0042506  |   4.5 |  0.36
Modify  | 0.00031167 | 0.014323   | 0.028334   |  11.7 |  2.41
Other   |            | 0.1044     |            |       | 17.56

Nlocal:            956 ave         976 max         936 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Nghost:           1364 ave        1379 max        1349 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Neighs:              0 ave           0 max           0 min
Histogram: 2 0 0 0 0 0 0 0 0 0

Total # of neighbors = 0
Ave neighs/atom = 0
Neighbor list builds = 4
Dangerous builds = 0
WARNING: Temperature for thermo pressure is not for group all (../thermo.cpp:533)


---------------------------------------------------------------------
      Device Time Info (average): 
---------------------------------------------------------------------
Data Transfer:   0.0071 s.
Neighbor copy:   0.0010 s.
Neighbor build:  0.0062 s.
Force calc:      0.2824 s.
Device Overhead: 0.1267 s.
CPU Neighbor:    0.0001 s.
CPU Cast/Pack:   0.0004 s.
CPU Driver_Time: 0.0016 s.
CPU Idle_Time:   0.1488 s.
Average split:   1.0000.
Max Mem / Proc:  0.70 MB.
Prefetch mode:   None.
Vector width:    32.
Lanes / atom:    4.
Pair block:      256.
Neigh block:     128.
Neigh mode:      Hybrid (binning on host) with subgroup support
---------------------------------------------------------------------


--------------------------------------------------------------------------
- Using acceleration for eam:
-  with 1 proc(s) per device.
-  Horizontal vector operations: ENABLED
-  Shared memory system: No
--------------------------------------------------------------------------
Device 0: NVIDIA GeForce RTX 5080, 84 CUs, 15/15 GB, 2.7 GHZ (Mixed Precision)
Device 1: NVIDIA GeForce RTX 5080, 84 CUs, 2.7 GHZ (Mixed Precision)
--------------------------------------------------------------------------

Initializing Device and compiling on process 0...Done.
Initializing Devices 0-1 on core 0...Done.

Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.001
Per MPI rank memory allocation (min/avg/max) = 2.574 | 2.574 | 2.575 Mbytes
   Step          Temp          E_pair         E_mol          TotEng         Press          Volume    
         0   301.61467     -8248.5066      0             -8212.5608      6671.213       19834.251    
       100   305.24299     -8258.5336      0             -8222.1554      671.24999      19820.953    
       200   294.91959     -8256.7221      0             -8221.5742     -100.78904      19951.889    
       300   308.14232     -8252.9564      0             -8216.2327     -1188.1455      20066.072    
       400   299.14864     -8246.529       0             -8210.8771     -566.39181      20098.934    
       500   291.26312     -8238.3115      0             -8203.5994      1088.1969      20167.288    
       600   290.23348     -8230.4777      0             -8195.8883      3509.3183      20273.78     
       700   300           -8221.1504      0             -8185.3971      5951.52        20398.367    
       800   294.03107     -8206.273       0             -8171.231       11114.423      20508.297    
       900   294.38115     -8195.0448      0             -8159.9611      13565.239      20635.327    
      1000   300.21982     -8183.5656      0             -8147.786       16578.22       20766.022    
      1100   306.88214     -8167.459       0             -8130.8854      20551.628      20877.174    
      1200   297.88108     -8156.1944      0             -8120.6936      22500.643      21007.868    
      1300   300           -8149.5453      0             -8113.792       20848.779      21121.462    
      1400   300           -8141.6935      0             -8105.9401      20037.619      21249.714    
      1500   300           -8134.2882      0             -8098.5348      18731.246      21374.301    
      1600   300           -8133.3805      0             -8097.6272      15445.526      21494.003    
      1700   300           -8146.3578      0             -8110.6044      9680.3758      21613.705    
      1800   307.69417     -8151.5348      0             -8114.8645      9399.931       21732.185    
      1900   300           -8159.8865      0             -8124.1331      8478.6512      21861.658    
      2000   300           -8160.1243      0             -8124.371       7536.8943      21980.138    
      2100   307.27369     -8163.9448      0             -8127.3246      3854.3162      22107.168    
      2200   307.26749     -8163.2971      0             -8126.6776      2361.9376      22224.427    
      2300   300           -8168.2346      0             -8132.4813      1945.4271      22347.793    
      2400   300           -8179.5165      0             -8143.7631     -1979.8264      22468.716    
      2500   302.78588     -8179.0012      0             -8142.9158     -4748.8636      22594.524    
      2600   304.81386     -8174.8013      0             -8138.4742     -4885.6959      22709.34     
      2700   300.78581     -8176.429       0             -8140.582      -6164.9424      22840.035    
      2800   300           -8174.6046      0             -8138.8513     -7735.8932      22967.065    
      2900   300           -8171.2686      0             -8135.5153     -9678.5758      23080.659    
      3000   295.35661     -8168.4433      0             -8133.2433     -9148.2385      23204.025    
Loop time of 18.0334 on 2 procs for 3000 steps with 1912 atoms

Performance: 14.373 ns/day, 1.670 hours/ns, 166.358 timesteps/s, 318.076 katom-step/s
100.0% CPU use with 2 MPI tasks x no OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 9.357      | 13.68      | 18.004     | 116.9 | 75.86
Neigh   | 0.00014138 | 0.00014883 | 0.00015627 |   0.0 |  0.00
Comm    | 0.016654   | 1.1698     | 2.3229     | 106.6 |  6.49
Output  | 0.00031152 | 0.00037786 | 0.00044421 |   0.0 |  0.00
Modify  | 0.0079357  | 0.43592    | 0.86391    |  64.8 |  2.42
Other   |            | 2.747      |            |       | 15.23

Nlocal:            956 ave         962 max         950 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Nghost:         1373.5 ave        1375 max        1372 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Neighs:              0 ave           0 max           0 min
Histogram: 2 0 0 0 0 0 0 0 0 0

Total # of neighbors = 0
Ave neighs/atom = 0
Neighbor list builds = 226
Dangerous builds = 0


---------------------------------------------------------------------
      Device Time Info (average): 
---------------------------------------------------------------------
Data Transfer:   0.2189 s.
Neighbor copy:   0.0043 s.
Neighbor build:  0.2977 s.
Force calc:      8.4098 s.
Device Overhead: 2.6529 s.
CPU Neighbor:    0.0040 s.
CPU Cast/Pack:   0.0106 s.
CPU Driver_Time: 0.0340 s.
CPU Idle_Time:   4.3663 s.
Average split:   1.0000.
Max Mem / Proc:  0.62 MB.
Prefetch mode:   None.
Vector width:    32.
Lanes / atom:    4.
Pair block:      256.
Neigh block:     128.
Neigh mode:      Hybrid (binning on host) with subgroup support
---------------------------------------------------------------------

Total wall time: 0:00:19

How can I solve this error?

With some debugging. For example:

  • can you run the same input on the CPU?
  • have you checked that your data file is valid?
  • what is the output of your crashing run up to the crash?
  • have you applied any of the suggested fixes?

Yes, when I run this program only CPU, it can run successfully:

LAMMPS (29 Aug 2024 - Update 1)
Reading data file ...
  orthogonal box = (0 0 0) to (50 50 50)
  1 by 1 by 2 MPI processor grid
  reading atoms ...
  4000 atoms
  scanning bonds ...
  11 = max bonds/atom
  orthogonal box = (0 0 0) to (50 50 50)
  1 by 1 by 2 MPI processor grid
  reading bonds ...
  2000 bonds
Finding 1-2 1-3 1-4 neighbors ...
  special bond factors lj:    0        1        1       
  special bond factors coul:  0        1        1       
     2 = max # of 1-2 neighbors
   102 = max # of special neighbors
  special bonds CPU = 0.000 seconds
  read_data CPU = 0.006 seconds
1000 atoms in group colloid
3000 atoms in group polymer
2000 atoms in group reaction

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:
- Type Label Framework: https://doi.org/10.1021/acs.jpcb.3c08419
The log file lists these citations in BibTeX format.

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Generated 0 of 6 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
  update: every = 1 steps, delay = 0 steps, check = yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 3.4
  ghost atom cutoff = 3.4
  binsize = 1.7, bins = 30 30 30
  2 neighbor lists, perpetual/occasional/extra = 1 1 0
  (1) pair lj/expand, perpetual
      attributes: half, newton on
      pair build: half/bin/newton
      stencil: half/bin/3d
      bin: standard
  (2) fix bond/create, occasional, copy from (1)
      attributes: half, newton on
      pair build: copy
      stencil: none
      bin: none
Setting up Verlet run ...
  Unit style    : lj
  Current step  : 0
  Time step     : 0.001
Per MPI rank memory allocation (min/avg/max) = 16 | 16.01 | 16.01 Mbytes
   Step          Temp          KinEng         PotEng         TotEng    
         0   0              0              33.471532      33.471532    
      1000   4.3479616      6.5203118      8.1060436      14.626355    
      2000   2.0499511      3.0741579      4.9680097      8.0421676    
      3000   1.2148436      1.8218099      3.8484939      5.6703037    
      4000   0.92266848     1.3836567      3.4157738      4.7994306    
      5000   0.80382747     1.2054398      3.2554827      4.4609225    
Loop time of 0.491042 on 2 procs for 5000 steps with 4000 atoms

Performance: 879761.180 tau/day, 10182.421 timesteps/s, 40.730 Matom-step/s
99.8% CPU use with 2 MPI tasks x no OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0.062791   | 0.065567   | 0.068343   |   1.1 | 13.35
Bond    | 0.04252    | 0.046239   | 0.049958   |   1.7 |  9.42
Neigh   | 0.03113    | 0.031161   | 0.031193   |   0.0 |  6.35
Comm    | 0.057499   | 0.0616     | 0.065702   |   1.7 | 12.54
Output  | 6.3579e-05 | 9.6937e-05 | 0.00013029 |   0.0 |  0.02
Modify  | 0.23036    | 0.25028    | 0.2702     |   4.0 | 50.97
Other   |            | 0.0361     |            |       |  7.35

Nlocal:           2000 ave        2059 max        1941 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Nghost:         1362.5 ave        1443 max        1282 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Neighs:        11077.5 ave       11371 max       10784 min
Histogram: 1 0 0 0 0 0 0 0 0 1

Total # of neighbors = 22155
Ave neighs/atom = 5.53875
Ave special neighs/atom = 1
Neighbor list builds = 51
Dangerous builds = 0
System init for write_data ...
Generated 0 of 6 mixed pair_coeff terms from geometric mixing rule
Total wall time: 0:00:00

And I try to increase the binsize: mpirun -np 2 lmp_mpi -sf gpu -pk gpu 1 binsize 12 -in NH2-in.multi-lammps, there is also the following error:

LAMMPS (29 Aug 2024 - Update 1)
Reading data file ...
  orthogonal box = (0 0 0) to (50 50 50)
  1 by 1 by 2 MPI processor grid
  reading atoms ...
  4000 atoms
  scanning bonds ...
  11 = max bonds/atom
  orthogonal box = (0 0 0) to (50 50 50)
  1 by 1 by 2 MPI processor grid
  reading bonds ...
  2000 bonds
Finding 1-2 1-3 1-4 neighbors ...
  special bond factors lj:    0        1        1       
  special bond factors coul:  0        1        1       
     2 = max # of 1-2 neighbors
   102 = max # of special neighbors
  special bonds CPU = 0.000 seconds
  read_data CPU = 0.008 seconds
1000 atoms in group colloid
3000 atoms in group polymer
2000 atoms in group reaction

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:
- GPU package (short-range, long-range and three-body potentials): doi:10.1016/j.cpc.2010.12.021, doi:10.1016/j.cpc.2011.10.012, doi:10.1016/j.cpc.2013.08.002, doi:10.1016/j.commatsci.2014.10.068, doi:10.1016/j.cpc.2016.10.020, doi:10.3233/APC200086
- Type Label Framework: https://doi.org/10.1021/acs.jpcb.3c08419
The log file lists these citations in BibTeX format.

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE


--------------------------------------------------------------------------
- Using acceleration for lj/expand:
-  with 2 proc(s) per device.
-  Horizontal vector operations: ENABLED
-  Shared memory system: No
--------------------------------------------------------------------------
Device 0: NVIDIA GeForce RTX 5080, 84 CUs, 15/15 GB, 2.7 GHZ (Mixed Precision)
--------------------------------------------------------------------------

Initializing Device and compiling on process 0...Done.
Initializing Device 0 on core 0...Done.
Initializing Device 0 on core 1...Done.

Generated 0 of 6 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
  update: every = 1 steps, delay = 0 steps, check = yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 3.4
  ghost atom cutoff = 3.4
  binsize = 1.7, bins = 30 30 30
  1 neighbor lists, perpetual/occasional/extra = 0 1 0
  (1) fix bond/create, occasional
      attributes: half, newton on
      pair build: half/bin/newton
      stencil: half/bin/3d
      bin: standard
Setting up Verlet run ...
  Unit style    : lj
  Current step  : 0
  Time step     : 0.001
ERROR on proc 0: Neighbor list problem on the GPU. Try increasing the value of 'neigh_modify one' or the GPU neighbor list 'binsize'. (../fix_gpu.cpp:340)
Last command: run 5000
Abort(1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
ERROR on proc 1: Neighbor list problem on the GPU. Try increasing the value of 'neigh_modify one' or the GPU neighbor list 'binsize'. (../fix_gpu.cpp:340)
Last command: run 5000

Dear akohlmey, Above I performed the relevant tests

Not really. You did the one obvious thing and ran on the CPU run, but then you did one more change with no system and scheme to learn something from it.

The only thing that I have learned is that it is pretty pointless to even try GPUs for this, not even one let alone two. The job takes half a second. How much faster do you need it?
You have spent more time posting here than what you would save, if at all, from GPU acceleration.

I’m aware of this, I’m using a very short time here just to illustrate that it won’t run on the GPU, in the actual system I need to run it for quite a long time.