Numerical difference in output from lammps CPU vs GPU

Why energy/pressure are slighly off, what is possible solution for making them equal. Any suggestion/comment?

clear

package gpu 1 neigh no binsize 5.0 newton off


units metal
boundary        p p p
atom_style      atomic
variable lp equal "2.86"
lattice bcc ${lp} orient x 1 0 0 orient y 0 1 0 orient z  0 0 1
region whole block 0 1 0 1 0 1 units lattice
create_box 1 whole
create_atoms 1 region whole
replicate 30 30 30
mass 1 1
# ---------- Define Interatomic Potential --------------------- 
pair_style lj/cut 2.5
pair_coeff * * 1 1

# ---------- Run Minimization --------------------- 
reset_timestep 0 
run 0

log.cpu

units metal
boundary        p p p
atom_style      atomic
variable lp equal "2.86"
lattice bcc ${lp} orient x 1 0 0 orient y 0 1 0 orient z  0 0 1
lattice bcc 2.86 orient x 1 0 0 orient y 0 1 0 orient z  0 0 1
Lattice spacing in x,y,z = 2.86 2.86 2.86
region whole block 0 1 0 1 0 1 units lattice
create_box 1 whole
Created orthogonal box = (0 0 0) to (2.86 2.86 2.86)
  1 by 1 by 2 MPI processor grid
create_atoms 1 region whole
Created 2 atoms
  using lattice units in orthogonal box = (0 0 0) to (2.86 2.86 2.86)
  create_atoms CPU = 0.000 seconds
replicate 30 30 30
Replication is creating a 30x30x30 = 27000 times larger system...
  orthogonal box = (0 0 0) to (85.8 85.8 85.8)
  1 by 1 by 2 MPI processor grid
  54000 atoms
  replicate CPU = 0.001 seconds
mass 1 1
# ---------- Define Interatomic Potential ---------------------
pair_style lj/cut 2.5
pair_coeff * * 1 1

# ---------- Run Minimization ---------------------
reset_timestep 0
run 0
WARNING: No fixes with time integration, atoms won't move (../verlet.cpp:60)
Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
  update: every = 1 steps, delay = 0 steps, check = yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 4.5
  ghost atom cutoff = 4.5
  binsize = 2.25, bins = 39 39 39
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair lj/cut, perpetual
      attributes: half, newton on
      pair build: half/bin/atomonly/newton
      stencil: half/bin/3d
      bin: standard
Per MPI rank memory allocation (min/avg/max) = 10.22 | 10.22 | 10.22 Mbytes
   Step          Temp          E_pair         E_mol          TotEng         Press     
         0   0             -3726.0493      0             -3726.0493     -18820.611    
Loop time of 9.615e-07 on 2 procs for 0 steps with 54000 atoms

0.0% CPU use with 2 MPI tasks x no OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0          | 0          | 0          |   0.0 |  0.00
Neigh   | 0          | 0          | 0          |   0.0 |  0.00
Comm    | 0          | 0          | 0          |   0.0 |  0.00
Output  | 0          | 0          | 0          |   0.0 |  0.00
Modify  | 0          | 0          | 0          |   0.0 |  0.00
Other   |            | 9.615e-07  |            |       |100.00

Nlocal:          27000 ave       27000 max       27000 min
Histogram: 2 0 0 0 0 0 0 0 0 0
Nghost:          14566 ave       14566 max       14566 min
Histogram: 2 0 0 0 0 0 0 0 0 0
Neighs:         351000 ave      351000 max      351000 min
Histogram: 2 0 0 0 0 0 0 0 0 0

Total # of neighbors = 702000
Ave neighs/atom = 13
Neighbor list builds = 0
Dangerous builds = 0



log.gpu

LAMMPS (2 Aug 2023)
Lattice spacing in x,y,z = 2.86 2.86 2.86
Created orthogonal box = (0 0 0) to (2.86 2.86 2.86)
  1 by 1 by 2 MPI processor grid
Created 2 atoms
  using lattice units in orthogonal box = (0 0 0) to (2.86 2.86 2.86)
  create_atoms CPU = 0.000 seconds
Replication is creating a 30x30x30 = 27000 times larger system...
  orthogonal box = (0 0 0) to (85.8 85.8 85.8)
  1 by 1 by 2 MPI processor grid
  54000 atoms
  replicate CPU = 0.002 seconds

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:
- GPU package (short-range, long-range and three-body potentials): doi:10.1016/j.cpc.2010.12.021, doi:10.1016/j.cpc.2011.10.012, doi:10.1016/j.cpc.2013.08.002, doi:10.1016/j.commatsci.2014.10.068, doi:10.1016/j.cpc.2016.10.020, doi:10.3233/APC200086
The log file lists these citations in BibTeX format.

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

WARNING: No fixes with time integration, atoms won't move (../verlet.cpp:60)

--------------------------------------------------------------------------
- Using acceleration for lj/cut:
-  with 2 proc(s) per device.
-  Horizontal vector operations: ENABLED
-  Shared memory system: No
--------------------------------------------------------------------------
Device 0: NVIDIA GeForce RTX 3090, 82 CUs, 17/24 GB, 1.7 GHZ (Mixed Precision)
--------------------------------------------------------------------------

Initializing Device and compiling on process 0...Done.
Initializing Device 0 on core 0...Done.
Initializing Device 0 on core 1...Done.

Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
  update: every = 1 steps, delay = 0 steps, check = yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 4.5
  ghost atom cutoff = 4.5
  binsize = 2.25, bins = 39 39 39
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair lj/cut/gpu, perpetual
      attributes: full, newton off
      pair build: full/bin/atomonly
      stencil: full/bin/3d
      bin: standard
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.001
Per MPI rank memory allocation (min/avg/max) = 13.68 | 13.68 | 13.68 Mbytes
   Step          Temp          E_pair         E_mol          TotEng         Press     
         0   0             -3726.0488      0             -3726.0488     -18820.609    
Loop time of 1.1825e-06 on 2 procs for 0 steps with 54000 atoms

0.0% CPU use with 2 MPI tasks x no OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0          | 0          | 0          |   0.0 |  0.00
Neigh   | 0          | 0          | 0          |   0.0 |  0.00
Comm    | 0          | 0          | 0          |   0.0 |  0.00
Output  | 0          | 0          | 0          |   0.0 |  0.00
Modify  | 0          | 0          | 0          |   0.0 |  0.00
Other   |            | 1.183e-06  |            |       |100.00

Nlocal:          27000 ave       27000 max       27000 min
Histogram: 2 0 0 0 0 0 0 0 0 0
Nghost:          14566 ave       14566 max       14566 min
Histogram: 2 0 0 0 0 0 0 0 0 0
Neighs:              0 ave           0 max           0 min
Histogram: 2 0 0 0 0 0 0 0 0 0
FullNghs:       702000 ave      702000 max      702000 min
Histogram: 2 0 0 0 0 0 0 0 0 0

Total # of neighbors = 1404000
Ave neighs/atom = 26
Neighbor list builds = 0
Dangerous builds = 0


---------------------------------------------------------------------
      Device Time Info (average): 
---------------------------------------------------------------------
Average split:   1.0000.
Lanes / atom:    4.
Vector width:    32.
Prefetch mode:   None.
Max Mem / Proc:  7.71 MB.
CPU Cast/Pack:   0.0002 s.
CPU Driver_Time: 0.0000 s.
CPU Idle_Time:   0.0001 s.
---------------------------------------------------------------------

Total wall time: 0:00:01

output of ./lmp_gpu -h

Large-scale Atomic/Molecular Massively Parallel Simulator - 2 Aug 2023

Usage example: ./lmp_gpu -var t 300 -echo screen -in in.alloy

List of command line options supported by this LAMMPS executable:

-echo none/screen/log/both  : echoing of input script (-e)
-help                       : print this help message (-h)
-in none/filename           : read input from file or stdin (default) (-i)
-kokkos on/off ...          : turn KOKKOS mode on or off (-k)
-log none/filename          : where to send log output (-l)
-mdi '<mdi flags>'          : pass flags to the MolSSI Driver Interface
-mpicolor color             : which exe in a multi-exe mpirun cmd (-m)
-cite                       : select citation reminder style (-c)
-nocite                     : disable citation reminder (-nc)
-nonbuf                     : disable screen/logfile buffering (-nb)
-package style ...          : invoke package command (-pk)
-partition size1 size2 ...  : assign partition sizes (-p)
-plog basename              : basename for partition logs (-pl)
-pscreen basename           : basename for partition screens (-ps)
-restart2data rfile dfile ... : convert restart to data file (-r2data)
-restart2dump rfile dgroup dstyle dfile ... 
                            : convert restart to dump file (-r2dump)
-reorder topology-specs     : processor reordering (-r)
-screen none/filename       : where to send screen output (-sc)
-skiprun                    : skip loops in run and minimize (-sr)
-suffix gpu/intel/opt/omp   : style suffix to apply (-sf)
-var varname value          : set index style variable (-v)

OS: Linux "Ubuntu 22.04.2 LTS" 5.15.0-86-generic x86_64

Compiler: GNU C++ 11.4.0 with OpenMP not enabled
C++ standard: C++11
MPI v3.1: Intel(R) MPI Library 2021.6 for Linux* OS


Accelerator configuration:

GPU package API:
GPU package precision: mixed

Compatible GPU present: yes

Active compile time flags:

-DLAMMPS_GZIP
-DLAMMPS_SMALLBIG
sizeof(smallint): 32-bit
sizeof(imageint): 32-bit
sizeof(tagint):   32-bit
sizeof(bigint):   64-bit

Available compression formats:

Extension: .gz     Command: gzip
Extension: .bz2    Command: bzip2
Extension: .zst    Command: zstd
Extension: .xz     Command: xz
Extension: .lzma   Command: xz
Extension: .lz4    Command: lz4


Installed packages:

GPU MANYBODY MEAM REPLICA 

My money is on the mixed precision of the GPU package. Try to recompile with double precision and see if the difference persist.

1 Like

As was already noted, you are comparing a result computed in full double precision on the CPU with a computation where the individual forces are computed in single precision and only accumulated in double precision on the GPU.

It will never be exactly equal. That is because you have different implementations of the force kernels, use different compiles, have different hardware, a different number of parallel work units and use floating point math, which does not commute, i.e. the exact result of a sum is dependent on the order in which values are summed. Thus also you will get diverging trajectories also when you are using a different number of MPI ranks. For a computation in full double precision, however, this difference is very small (order of 1.0e-15) and thus the divergence of trajectories can take several thousands of MD steps to become visible. With (partial) single precision computation, the difference is of the order of 1.0-e7 which is exactly what you are seeing.