Substantially different number of neighbours when simulating EAM on GPU

Michele_Pellegrino · January 14, 2025, 2:03pm

Hi,

I have run the EAM benchmark (https://www.lammps.org/bench.html#eam) on my local workstation, both on GPU and CPU.
The number of neighbours on GPU is always zero when setting neigh yes:

Total # of neighbors = 0
Ave neighs/atom = 0
Neighbor list builds = 140
Dangerous builds = 1
Total wall time: 0:00:01

but even when running with neigh no I get a substantial difference between GPU and CPU. On GPU:

Total # of neighbors = 2416250
Ave neighs/atom = 75.507812
Neighbor list builds = 140
Dangerous builds = 1
Total wall time: 0:00:09

on CPU (single core):

Total # of neighbors = 1208113
Ave neighs/atom = 37.753531
Neighbor list builds = 140
Dangerous builds = 1
Total wall time: 0:00:25

on CPU (56 cores):

Total # of neighbors = 1208113
Ave neighs/atom = 37.753531
Neighbor list builds = 140
Dangerous builds = 1
Total wall time: 0:00:00

This is my input script:

# bulk Cu lattice

# Remove when running on CPU
package 	gpu 1 neigh no

variable        x index 1
variable        y index 1
variable        z index 1

variable        xx equal 20*$x
variable        yy equal 20*$y
variable        zz equal 20*$z

units           metal
atom_style      atomic

lattice         fcc 3.615
region          box block 0 ${xx} 0 ${yy} 0 ${zz}
create_box      1 box
create_atoms    1 box

# Switch to 'pair_style eam' on CPU
pair_style      eam/gpu
pair_coeff      1 1 Cu_u3.eam

velocity        all create 1600.0 376847 loop geom

neighbor        1.0 bin
neigh_modify    every 1 delay 5 check yes

fix             1 all nve

timestep        0.005
thermo          50

run             1000

Is this a bug or am I missing some crucial point?

Michele

akohlmey · January 14, 2025, 2:20pm

Yes. This is expected. That information is not collected when building the neighbor lists on the GPU.

Yes. That is also expected. When running on the GPU a full neighbor list is created, i.e. every pair is listed twice (as i-j and j-i) while the CPU version of the pair style uses a half neighbor list, i.e. every pair is listed only once. The reason for that is that with multi-threading (and GPU acceleration is effectively multi-threading at an extreme scale), the update of the forces has a race condition with a half-neighbor list, which creates significant overhead with the number of threads when trying to avoid that, while with a full neighbor list this does not happen and so even if the full neighbor list creates twice the amount of work, it can be much better parallelized with less overhead and for a large number of threads, this is much faster and shows much better strong scaling behavior.

There is no bug. This is just how the different pair style variants work with the different settings.

Michele_Pellegrino · January 14, 2025, 2:29pm

Ah, I see. So with neigh yes is simply an output ‘issue’, otherwise it’s essentially algorithmic optimisation.
I was wondering if it was a bug of if there was any difference in behaviour since I am getting inconsistent results between GPU and CPU on another run (still EAM but a slightly different system), and the neighbour list was the main suspect (based on my experience with MD and what I have found in the documentation).
Thank you very much for the clarification.

akohlmey · January 14, 2025, 3:11pm

What do you mean by “inconsistent”?
Please note, that you cannot expect identical trajectories, even when compiling the GPU package in full double precision, due to floating-point math limitations (the exact forces depend on the order in which forces are summed up). This would also happen if you use a different number of processors on the CPU. With mixed precision GPU kernels, this exponential divergence happens much faster, even.

Michele_Pellegrino · January 14, 2025, 3:18pm

Yes, indeed. Apparently I was just a bit unlucky as I was unknowingly very close to the phase transition temperature, and a slight underestimate of the potential energy in mixed precision led to a huge change in the RDF. Far from the critical point I see very similar result.

akohlmey · January 14, 2025, 3:21pm

Mixed precision forces carry more noise that has to be removed by the thermostat.

Michele_Pellegrino · January 23, 2025, 7:40pm

Quick update: it turns out the cause of the mismatch was not the floating point precision, but rather the GPU library.
I built with OpenCL instead of CUDA, by mistake. After building with CUDA and running again I don’t see any substantial difference in energies anymore.

akohlmey · January 23, 2025, 8:47pm

This should not happen. Can you post a minimal input where you can reproduce this difference
and post it here? @ndtrung may be able to help tracking this down.

Michele_Pellegrino · January 24, 2025, 10:26am

Hi, I can try. The only problem is that for some reson I cannot run with OpenCL anymore, as I incur in this error: OpenCL error when running Lammps with gpu
I’ll re-build with OpenCL (hoping it will make te issue disappear) and run the EAM benchmark again.

Michele_Pellegrino · January 24, 2025, 11:01am

OpenCL build:
cmake -D PKG_GPU=on -D GPU_API=opencl -D PKG_MOLECULE=on -D PKG_MANYBODY=on -D PKG_KSPACE=on -D PKG_RIGID=on -D PKG_REAXFF=on -D PKG_EXTRA-DUMP=on -D PKG_EXTRA-FIX=on -D CMAKE_INSTALL_PREFIX=${pwd} ../cmake

CUDA build:
cmake -D PKG_GPU=on -D GPU_API=cuda -D PKG_GPU_ARCH=sm_86 -D PKG_MOLECULE=on -D PKG_MANYBODY=on -D PKG_KSPACE=on -D PKG_RIGID=on -D PKG_REAXFF=on -D PKG_EXTRA-DUMP=on -D PKG_EXTRA-FIX=on -D CMAKE_INSTALL_PREFIX=${pwd} ../cmake

Energies from the metal benchmark (LAMMPS Benchmarks), OpenCL:

Step          Temp          E_pair         E_mol          TotEng         Press    
         0   1600          -113279.99      0             -106662.08      18703.582    
        50   781.69068     -109873.35      0             -106640.12      52273.14     
       100   801.83208     -109957.3       0             -106640.76      51322.875    
       150   794.6766      -109927.53      0             -106640.59      51669.368    
       200   795.75195     -109932.04      0             -106640.66      51629.021    
       250   797.5841      -109939.57      0             -106640.6       51564.266    
       300   793.23402     -109921.5       0             -106640.53      51741.371    
       350   799.1939      -109946.3       0             -106640.68      51462.036    
       400   797.16475     -109937.82      0             -106640.59      51585.94     
       450   794.86159     -109928.1       0             -106640.4       51651.626    
       500   796.77909     -109936.18      0             -106640.54      51565.449    
       550   796.68017     -109935.84      0             -106640.62      51581.171    
       600   799.55194     -109947.79      0             -106640.68      51434.124    
       650   800.8467      -109953.12      0             -106640.67      51407.925    
       700   791.74486     -109915.14      0             -106640.33      51729.695    
       750   801.43183     -109955.61      0             -106640.73      51340.853    
       800   797.91137     -109940.73      0             -106640.41      51527.923    
       850   803.69975     -109964.91      0             -106640.65      51321.868    
       900   805.41109     -109972.05      0             -106640.72      51274.688    
       950   799.24017     -109946.27      0             -106640.46      51489.492    
      1000   796.32238     -109933.96      0             -106640.21      51607.385

CUDA:

Step          Temp          E_pair         E_mol          TotEng         Press     
         0   1600          -113280         0             -106662.08      18703.562    
        50   781.6906      -109873.35      0             -106640.13      52273.122    
       100   801.83198     -109957.3       0             -106640.77      51322.827    
       150   794.67679     -109927.53      0             -106640.6       51669.331    
       200   795.7518      -109932.04      0             -106640.66      51628.926    
       250   797.58393     -109939.57      0             -106640.61      51564.244    
       300   793.23406     -109921.5       0             -106640.53      51741.353    
       350   799.19412     -109946.31      0             -106640.68      51461.911    
       400   797.16517     -109937.83      0             -106640.6       51585.684    
       450   794.86185     -109928.11      0             -106640.4       51651.436    
       500   796.77919     -109936.18      0             -106640.55      51565.508    
       550   796.68223     -109935.86      0             -106640.62      51580.94     
       600   799.55139     -109947.79      0             -106640.69      51434.143    
       650   800.85086     -109953.14      0             -106640.67      51407.514    
       700   791.7455      -109915.15      0             -106640.34      51729.425    
       750   801.43421     -109955.62      0             -106640.73      51340.493    
       800   797.91057     -109940.73      0             -106640.42      51527.563    
       850   803.70225     -109964.93      0             -106640.66      51322.097    
       900   805.40996     -109972.06      0             -106640.73      51274.791    
       950   799.22872     -109946.24      0             -106640.47      51489.012    
      1000   796.32717     -109933.99      0             -106640.23      51606.29

So this doesn’t seem so different. However, when I try to compute the surface tension of some molten Al film, I get quite different results.

I am in a hurry so I forgot to put labels: y-axis is surface tension [bar*Å], x-axis is the frame (multiply by 0.001 to get the time in ps), red is OpenCl and black is CUDA.

Input for surface tension:
slab_Al.data (591.2 KB)
surftens.in (1.0 KB)
Potential: Interatomic Potentials Repository
Running with: -pk gpu 1 -sf gpu

P.S. it’s worth mentioning that I also did a run on CPU, and it is very consistent with CUDA results.

ndtrung · January 24, 2025, 3:53pm

When I ran your first input script with CUDA and OpenCL builds in double precision, the thermo numbers all match up (given the numerical format) with the CPU-only runs within 1000 steps. Could you also build CUDA and OpenCL with double precision (-DGPU_PREC=double) and compare the surface tension with the CPU-only run again?

Michele_Pellegrino · January 24, 2025, 4:00pm

Hi,
Thanks for the feedback.
I guess I can, but wouldn’t it defeat the purpose of running on a GPU?

ndtrung · January 24, 2025, 4:19pm

The double precision builds would be helpful for debugging purposes. I am wondering if the difference in the surface tension comes from the precision of the builds, or something else, and why pressures jump to different values here. Could you also add pxx, pyy and pzz to the thermo output? Also, how substantial the performance diff between mixed to double precision builds is informative, too. Thanks!

Michele_Pellegrino · January 24, 2025, 5:24pm

Performance:

CPU (56 MPI proc.): 301.867 ns/day
CUDA sp: 347.456 ns/day
CUDA dp: 112.519 ns/day
OpenCL sp: 289.648 ns/day
OpenCL dp: 97.336 ns/day

Pressure components are very noisy and I cannot understand which one is causing the mismatch from the thermo output alone. But in any case it seems to be a OpenCL problem, not a mixed vs. double precision problem.

I can go on and analyse the single components of the pressure tensor, but frankly I am not planning to run with OpenCL anyway in the future, so…

ndtrung · January 24, 2025, 11:06pm

Thanks for the tests and the performance numbers. I will look into the issue with the OpenCL build.