Hi all! I am benchmarking the following LJmelt input script, in which I have created 256 million atoms, using the KOKKOS package on Perlmutter at NERSC.
# 3d Lennard-Jones melt
# defines a variable x with value 1
variable x index 1
variable y index 1
variable z index 1
# Defining the box dimension
variable xx equal 400*$x
variable yy equal 400*$y
variable zz equal 400*$z
# sets the unit system to lennar-jones
# atomic specifies that atoms do not have bonds, angles or other molecular fe>
units lj
atom_style atomic
# with lattice constant 0.8422
lattice fcc 0.8442
region box block 0 ${xx} 0 ${yy} 0 ${zz}
create_box 1 box
create_atoms 1 box
mass 1 1.0
velocity all create 1.44 87287 loop geom
pair_style lj/cut 2.5
pair_coeff 1 1 1.0 1.0 2.5
neighbor 0.3 bin
neigh_modify delay 0 every 20 check no
fix 1 all nve
run 1000
This was done on a GPU computation node with the following hardware setup:
- 1*AMD EPYC 7763
- 4*NVIDIA A100
- 4*HPE Slingshot 11
Only one GPU is used. The docker image of LAMMPS was used and the version is from 2, Auguest,2023.
image docker:nersc/lammps_all:23.08
The issue that I have encountered is from memory allocation for 256 million atoms, and the error output is the following:
lmp: /opt/udiImage/modules/mpich/dep/libcurl.so.4: no version information available (required by /opt/lammps/install/lib/liblammps.so.0)
lmp: /opt/udiImage/modules/mpich/dep/libcurl.so.4: no version information available (required by /lib/x86_64-linux-gnu/libhdf5_serial.so.103)
LAMMPS (2 Aug 2023)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:107)
will use up to 1 GPU(s) per node
using 1 OpenMP thread(s) per MPI task
Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962
Created orthogonal box = (0 0 0) to (671.83848 671.83848 671.83848)
1 by 1 by 1 MPI processor grid
Created 256000000 atoms
using lattice units in orthogonal box = (0 0 0) to (671.83848 671.83848 671.83848)
create_atoms CPU = 15.160 seconds
Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
update: every = 20 steps, delay = 0 steps, check = no
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 2.8
ghost atom cutoff = 2.8
binsize = 2.8, bins = 240 240 240
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair lj/cut/kk, perpetual
attributes: half, newton on, kokkos_device
pair build: half/bin/newton/kk/device
stencil: half/bin/3d
bin: kk/device
Setting up Verlet run ...
Unit style : lj
Current step : 0
Time step : 0.005
Exception: Kokkos failed to allocate memory for label "atom:f". Allocation using MemorySpace named "Cuda" failed with the following error: Allocation of size 5.779 G failed, likely due to insufficient memory. (The allocation mechanism was cudaMalloc(). The Cuda allocation returned the error code ""cudaErrorMemoryAllocation".)
MPICH ERROR [Rank 0] [job id 24317905.0] [Fri Apr 12 19:27:05 2024] [nid003633] - Abort(1) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
srun: error: nid003633: task 0: Exited with exit code 1
srun: Terminating StepId=24317905.0
When it says allocation of size 5.77 G failed. What does the unit of G stand for? gigabytes? If so, this error doesn’t make sense, as the the A100 GPU that I used on Perlmutter has a memory of 40GB, then what really is the memory issue? Does the allocation size exceed 40GB of memory? Is it possible for me to estimate the amount of memory allocation required for 256 million atoms? This problem disappeared as I change my input script to simulate smaller number of atoms such as 13.5 million of atoms.