Hi, I want to use the GPU accelerator package, but there is an error:
ERROR on proc 1: Neighbor list problem on the GPU. Try increasing the value of 'neigh_modify one' or the GPU neighbor list 'binsize'. (../fix_gpu.cpp:340)
Here is my input file:
units lj
dimension 3
boundary p p p
atom_style full
special_bonds fene
package gpu 1 neigh yes
read_data lammps.data extra/bond/per/atom 10 extra/special/per/atom 100
group colloid type 1
group polymer type 2 3 4
group reaction type 1 2
#kspace_style pppm 1e-6
pair_style lj/expand 0.448 #lj/expand/coul/long 0.448 5.0
pair_modify shift yes
pair_coeff 1 1 0.0 0.2 0.0 0.224
pair_coeff 1 2 1.0 0.4 -0.1 2.5
pair_coeff 1 3 0.1 0.4 -0.1 0.448
pair_coeff 1 4 0.1 0.4 -0.1 0.448
pair_coeff 2 2 0.1 0.2 0.0 0.224
pair_coeff 2 3 0.1 0.2 0.0 0.224
pair_coeff 2 4 0.1 0.2 0.0 0.224
pair_coeff 3 3 0.1 0.2 0.0 0.224
pair_coeff 3 4 0.1 0.2 0.0 0.224
pair_coeff 4 4 0.1 0.2 0.0 0.224
bond_style fene/expand
bond_coeff 1 300.0 1.0 0.1 0.2 0.0
bond_coeff 2 300.0 1.0 0.1 0.4 -0.1
neighbor 1.0 bin
neigh_modify delay 0 every 1 check yes
thermo 1000
thermo_style custom step temp ke pe etotal
#dump output all custom 1000 output.lammpstrj id type x y z ix iy iz
#dump_modify output sort id
#fix step1 colloid rigid/nve single langevin 1.0 1.0 0.1 85723610
fix fix_colloid colloid setforce 0 0 0
fix step2 polymer nve
fix step3 polymer langevin 1.0 1.0 1.0 93627450
fix reaction_step reaction bond/create 5000 1 2 0.348 2 iparam 1 1 jparam 1 2 prob 0.5 85784221
timestep 0.001
run 1000000
write_data final.data nocoeff
#unfix step1
unfix fix_colloid
unfix step2
unfix step3
unfix reaction_step
clear
and I use the command: mpirun -np 2 lmp_mpi -sf gpu -in in.lammps
Here is my GPU information:
Found 1 platform(s).
CUDA Driver Version: 12.80
Device 0: "NVIDIA GeForce RTX 5080"
Type of device: GPU
Compute capability: 12
Double precision support: Yes
Total amount of global memory: 15.4791 GB
Number of compute units/multiprocessors: 84
Number of cores: 16128
Total amount of constant memory: 65536 bytes
Total amount of local/shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per block: 1024
Maximum group size (# of threads per block) 1024 x 1024 x 64
Maximum item sizes (# threads for each dim) 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 2.655 GHz
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default
Concurrent kernel execution: Yes
Device has ECC support enabled: No
Besides, I tested the lammps example/shear: mpirun -np 2 lmp_mpi -sf gpu -in in.shear, it can run successfully:
LAMMPS (29 Aug 2024 - Update 1)
Lattice spacing in x,y,z = 3.52 3.52 3.52
Created orthogonal box = (0 0 0) to (56.32 35.2 9.956063)
2 by 1 by 1 MPI processor grid
Lattice spacing in x,y,z = 3.52 4.9780317 4.9780317
Created 1912 atoms
using lattice units in orthogonal box = (-0.005632 -0.00352 0) to (56.325632 35.20352 9.956063)
create_atoms CPU = 0.000 seconds
Reading eam potential file Ni_u3.eam with DATE: 2007-06-11
264 atoms in group lower
264 atoms in group upper
528 atoms in group boundary
1384 atoms in group mobile
Setting atom values ...
264 settings made for type
Setting atom values ...
264 settings made for type
WARNING: Temperature for thermo pressure is not for group all (../thermo.cpp:533)
CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE
Your simulation uses code contributions which should be cited:
- GPU package (short-range, long-range and three-body potentials): doi:10.1016/j.cpc.2010.12.021, doi:10.1016/j.cpc.2011.10.012, doi:10.1016/j.cpc.2013.08.002, doi:10.1016/j.commatsci.2014.10.068, doi:10.1016/j.cpc.2016.10.020, doi:10.3233/APC200086
The log file lists these citations in BibTeX format.
CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE
--------------------------------------------------------------------------
- Using acceleration for eam:
- with 1 proc(s) per device.
- Horizontal vector operations: ENABLED
- Shared memory system: No
--------------------------------------------------------------------------
Device 0: NVIDIA GeForce RTX 5080, 84 CUs, 15/15 GB, 2.7 GHZ (Mixed Precision)
Device 1: NVIDIA GeForce RTX 5080, 84 CUs, 2.7 GHZ (Mixed Precision)
--------------------------------------------------------------------------
Initializing Device and compiling on process 0...Done.
Initializing Devices 0-1 on core 0...Done.
Setting up Verlet run ...
Unit style : metal
Current step : 0
Time step : 0.001
Per MPI rank memory allocation (min/avg/max) = 2.574 | 2.574 | 2.575 Mbytes
Step Temp E_pair E_mol TotEng Press Volume
0 300 -8317.4367 0 -8263.8066 -7100.6278 19547.02
25 218.96729 -8271.6701 0 -8232.526 5018.9555 19547.02
50 300 -8238.2244 0 -8184.5944 12937.814 19693.626
75 295.22062 -8232.3247 0 -8179.549 13096.236 19751.41
100 300 -8248.5066 0 -8194.8765 7631.67 19813.143
Loop time of 0.594331 on 2 procs for 100 steps with 1912 atoms
Performance: 14.537 ns/day, 1.651 hours/ns, 168.256 timesteps/s, 321.706 katom-step/s
100.0% CPU use with 2 MPI tasks x no OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
---------------------------------------------------------------
Pair | 0.30009 | 0.44656 | 0.59303 | 21.9 | 75.14
Neigh | 4.187e-06 | 4.929e-06 | 5.671e-06 | 0.0 | 0.00
Comm | 0.00071203 | 0.026911 | 0.053111 | 16.0 | 4.53
Output | 4.5775e-05 | 0.0021482 | 0.0042506 | 4.5 | 0.36
Modify | 0.00031167 | 0.014323 | 0.028334 | 11.7 | 2.41
Other | | 0.1044 | | | 17.56
Nlocal: 956 ave 976 max 936 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Nghost: 1364 ave 1379 max 1349 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Neighs: 0 ave 0 max 0 min
Histogram: 2 0 0 0 0 0 0 0 0 0
Total # of neighbors = 0
Ave neighs/atom = 0
Neighbor list builds = 4
Dangerous builds = 0
WARNING: Temperature for thermo pressure is not for group all (../thermo.cpp:533)
---------------------------------------------------------------------
Device Time Info (average):
---------------------------------------------------------------------
Data Transfer: 0.0071 s.
Neighbor copy: 0.0010 s.
Neighbor build: 0.0062 s.
Force calc: 0.2824 s.
Device Overhead: 0.1267 s.
CPU Neighbor: 0.0001 s.
CPU Cast/Pack: 0.0004 s.
CPU Driver_Time: 0.0016 s.
CPU Idle_Time: 0.1488 s.
Average split: 1.0000.
Max Mem / Proc: 0.70 MB.
Prefetch mode: None.
Vector width: 32.
Lanes / atom: 4.
Pair block: 256.
Neigh block: 128.
Neigh mode: Hybrid (binning on host) with subgroup support
---------------------------------------------------------------------
--------------------------------------------------------------------------
- Using acceleration for eam:
- with 1 proc(s) per device.
- Horizontal vector operations: ENABLED
- Shared memory system: No
--------------------------------------------------------------------------
Device 0: NVIDIA GeForce RTX 5080, 84 CUs, 15/15 GB, 2.7 GHZ (Mixed Precision)
Device 1: NVIDIA GeForce RTX 5080, 84 CUs, 2.7 GHZ (Mixed Precision)
--------------------------------------------------------------------------
Initializing Device and compiling on process 0...Done.
Initializing Devices 0-1 on core 0...Done.
Setting up Verlet run ...
Unit style : metal
Current step : 0
Time step : 0.001
Per MPI rank memory allocation (min/avg/max) = 2.574 | 2.574 | 2.575 Mbytes
Step Temp E_pair E_mol TotEng Press Volume
0 301.61467 -8248.5066 0 -8212.5608 6671.213 19834.251
100 305.24299 -8258.5336 0 -8222.1554 671.24999 19820.953
200 294.91959 -8256.7221 0 -8221.5742 -100.78904 19951.889
300 308.14232 -8252.9564 0 -8216.2327 -1188.1455 20066.072
400 299.14864 -8246.529 0 -8210.8771 -566.39181 20098.934
500 291.26312 -8238.3115 0 -8203.5994 1088.1969 20167.288
600 290.23348 -8230.4777 0 -8195.8883 3509.3183 20273.78
700 300 -8221.1504 0 -8185.3971 5951.52 20398.367
800 294.03107 -8206.273 0 -8171.231 11114.423 20508.297
900 294.38115 -8195.0448 0 -8159.9611 13565.239 20635.327
1000 300.21982 -8183.5656 0 -8147.786 16578.22 20766.022
1100 306.88214 -8167.459 0 -8130.8854 20551.628 20877.174
1200 297.88108 -8156.1944 0 -8120.6936 22500.643 21007.868
1300 300 -8149.5453 0 -8113.792 20848.779 21121.462
1400 300 -8141.6935 0 -8105.9401 20037.619 21249.714
1500 300 -8134.2882 0 -8098.5348 18731.246 21374.301
1600 300 -8133.3805 0 -8097.6272 15445.526 21494.003
1700 300 -8146.3578 0 -8110.6044 9680.3758 21613.705
1800 307.69417 -8151.5348 0 -8114.8645 9399.931 21732.185
1900 300 -8159.8865 0 -8124.1331 8478.6512 21861.658
2000 300 -8160.1243 0 -8124.371 7536.8943 21980.138
2100 307.27369 -8163.9448 0 -8127.3246 3854.3162 22107.168
2200 307.26749 -8163.2971 0 -8126.6776 2361.9376 22224.427
2300 300 -8168.2346 0 -8132.4813 1945.4271 22347.793
2400 300 -8179.5165 0 -8143.7631 -1979.8264 22468.716
2500 302.78588 -8179.0012 0 -8142.9158 -4748.8636 22594.524
2600 304.81386 -8174.8013 0 -8138.4742 -4885.6959 22709.34
2700 300.78581 -8176.429 0 -8140.582 -6164.9424 22840.035
2800 300 -8174.6046 0 -8138.8513 -7735.8932 22967.065
2900 300 -8171.2686 0 -8135.5153 -9678.5758 23080.659
3000 295.35661 -8168.4433 0 -8133.2433 -9148.2385 23204.025
Loop time of 18.0334 on 2 procs for 3000 steps with 1912 atoms
Performance: 14.373 ns/day, 1.670 hours/ns, 166.358 timesteps/s, 318.076 katom-step/s
100.0% CPU use with 2 MPI tasks x no OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
---------------------------------------------------------------
Pair | 9.357 | 13.68 | 18.004 | 116.9 | 75.86
Neigh | 0.00014138 | 0.00014883 | 0.00015627 | 0.0 | 0.00
Comm | 0.016654 | 1.1698 | 2.3229 | 106.6 | 6.49
Output | 0.00031152 | 0.00037786 | 0.00044421 | 0.0 | 0.00
Modify | 0.0079357 | 0.43592 | 0.86391 | 64.8 | 2.42
Other | | 2.747 | | | 15.23
Nlocal: 956 ave 962 max 950 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Nghost: 1373.5 ave 1375 max 1372 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Neighs: 0 ave 0 max 0 min
Histogram: 2 0 0 0 0 0 0 0 0 0
Total # of neighbors = 0
Ave neighs/atom = 0
Neighbor list builds = 226
Dangerous builds = 0
---------------------------------------------------------------------
Device Time Info (average):
---------------------------------------------------------------------
Data Transfer: 0.2189 s.
Neighbor copy: 0.0043 s.
Neighbor build: 0.2977 s.
Force calc: 8.4098 s.
Device Overhead: 2.6529 s.
CPU Neighbor: 0.0040 s.
CPU Cast/Pack: 0.0106 s.
CPU Driver_Time: 0.0340 s.
CPU Idle_Time: 4.3663 s.
Average split: 1.0000.
Max Mem / Proc: 0.62 MB.
Prefetch mode: None.
Vector width: 32.
Lanes / atom: 4.
Pair block: 256.
Neigh block: 128.
Neigh mode: Hybrid (binning on host) with subgroup support
---------------------------------------------------------------------
Total wall time: 0:00:19
How can I solve this error?