LAMMPS Benchmarking memory allocation issue

Hi all! I am benchmarking the following LJmelt input script, in which I have created 256 million atoms, using the KOKKOS package on Perlmutter at NERSC.

# 3d Lennard-Jones melt

# defines a variable x with value 1
variable        x index 1
variable        y index 1
variable        z index 1

# Defining the box dimension
variable        xx equal 400*$x
variable        yy equal 400*$y
variable        zz equal 400*$z

# sets the unit system to lennar-jones 
# atomic specifies that atoms do not have bonds, angles or other molecular fe>
units           lj
atom_style      atomic

# with lattice constant 0.8422
lattice         fcc 0.8442
region          box block 0 ${xx} 0 ${yy} 0 ${zz}
create_box      1 box
create_atoms    1 box
mass            1 1.0

velocity        all create 1.44 87287 loop geom

pair_style      lj/cut 2.5
pair_coeff      1 1 1.0 1.0 2.5

neighbor        0.3 bin
neigh_modify    delay 0 every 20 check no

fix             1 all nve

run             1000

This was done on a GPU computation node with the following hardware setup:

  • 1*AMD EPYC 7763
  • 4*NVIDIA A100
  • 4*HPE Slingshot 11

Only one GPU is used. The docker image of LAMMPS was used and the version is from 2, Auguest,2023.

image docker:nersc/lammps_all:23.08

The issue that I have encountered is from memory allocation for 256 million atoms, and the error output is the following:

lmp: /opt/udiImage/modules/mpich/dep/libcurl.so.4: no version information available (required by /opt/lammps/install/lib/liblammps.so.0)
lmp: /opt/udiImage/modules/mpich/dep/libcurl.so.4: no version information available (required by /lib/x86_64-linux-gnu/libhdf5_serial.so.103)
LAMMPS (2 Aug 2023)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:107)
  will use up to 1 GPU(s) per node
  using 1 OpenMP thread(s) per MPI task
Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962
Created orthogonal box = (0 0 0) to (671.83848 671.83848 671.83848)
  1 by 1 by 1 MPI processor grid
Created 256000000 atoms
  using lattice units in orthogonal box = (0 0 0) to (671.83848 671.83848 671.83848)
  create_atoms CPU = 15.160 seconds
Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
  update: every = 20 steps, delay = 0 steps, check = no
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 2.8
  ghost atom cutoff = 2.8
  binsize = 2.8, bins = 240 240 240
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair lj/cut/kk, perpetual
      attributes: half, newton on, kokkos_device
      pair build: half/bin/newton/kk/device
      stencil: half/bin/3d
      bin: kk/device
Setting up Verlet run ...
  Unit style    : lj
  Current step  : 0
  Time step     : 0.005
Exception: Kokkos failed to allocate memory for label "atom:f".  Allocation using MemorySpace named "Cuda" failed with the following error:  Allocation of size 5.779 G failed, likely due to insufficient memory.  (The allocation mechanism was cudaMalloc().  The Cuda allocation returned the error code ""cudaErrorMemoryAllocation".)

MPICH ERROR [Rank 0] [job id 24317905.0] [Fri Apr 12 19:27:05 2024] [nid003633] - Abort(1) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

srun: error: nid003633: task 0: Exited with exit code 1
srun: Terminating StepId=24317905.0

When it says allocation of size 5.77 G failed. What does the unit of G stand for? gigabytes? If so, this error doesn’t make sense, as the the A100 GPU that I used on Perlmutter has a memory of 40GB, then what really is the memory issue? Does the allocation size exceed 40GB of memory? Is it possible for me to estimate the amount of memory allocation required for 256 million atoms? This problem disappeared as I change my input script to simulate smaller number of atoms such as 13.5 million of atoms.

The error does make sense. It is indicating that the allocation of 5 gigabytes more memory is failing.
Memory allocation is incremental in C++ code and with the incremental allocation requests you only know how much more memory is requested, not how much was used. Please also note, that this is about requesting address space and not necessarily RAM. But I don’t want to get into a detailed discussion of memory management.

You can try and run on the CPU. LAMMPS will output memory use number, but that is a lower bound.
You can get more information about the largest memory use from the “info” command when it is issued after a run.

How much memory is used by KOKKOS on the GPU is still somewhat uncertain because not all data allocations on the CPU need to be transferred on the GPU, but also the memory use for neighbor lists may be different or other data may be stored in different ways to have faster access.

1 Like

These are really helpful! Thank you very much.

You can also use Kokkos Tools to profile memory, my favorite tool is space-time-stack: kokkos-tools/profiling/space-time-stack at develop · kokkos/kokkos-tools · GitHub. You just need to build the tool using the Makefile, then export KOKKOS_TOOLS_LIBS=/path/to/library.so.

But Axel is correct about “the allocation of 5 gigabytes more memory is failing”.