Failure to utilize GPUs at NERSC

Peter_Chun_Pang_Li · April 10, 2024, 1:25pm

Hi all, I am currently doing some testing on NERSC’s gpu nodes. I decided to use a really simple lammps script that models the movement of a million argon atoms under short range LJ potential. Everything ran well with the cpu node on NERSC. However, in an attempt to the GPUs at NERSC, I have utilized the GPU nodes while running my LAMMPS script with both the GPU package as well as KOKKOS package, and both ran into some issues

For the GPU package, I ran into the following error:
lmp: /opt/udiImage/modules/mpich/dep/libcurl.so.4: no version information avail>
lmp: /opt/udiImage/modules/mpich/dep/libcurl.so.4: no version information avail>
LAMMPS (2 Aug 2023)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:9>
using 1 OpenMP thread(s) per MPI task
package gpu 1
ERROR: Invalid OpenCL platform ID. (src/GPU/gpu_extra.h:77)
Last command: package gpu 1
srun: error: nid001357: task 0: Exited with exit code 1
srun: Terminating StepId=24183628.0
NERSC has three docker images for LAMMPS:
perlmutter docker READY 78c9bbb876 2023-10-11T15:35:48 nersc/lammps_all:23.08
perlmutter docker READY 1265e04cff 2023-12-05T12:20:10 nersc/lammps_allegro:23.08
perlmutter docker READY a546b186a4 2023-09-19T12:14:13 nersc/lammps_lite:23.08
I was wondering is it because these docker image doesn’t contain the GPU package? Should I build my own lammps from source on NERSC to include the GPU package?

For the KOKKOS package, I ran into the following warning:
WARNING: Fix with atom-based arrays not compatible with sending data in Kokkos communication, switching to classic exchange/border communication (src/KOKKOS/comm_kokkos.cpp:666)
WARNING: Fix with atom-based arrays not compatible with Kokkos sorting on device, switching to classic host sorting (src/KOKKOS/atom_kokkos.cpp:178)
I have read the previous posts on this forum with regard to the same issue, and I understand that is not an error but just not utilizing the GPUs efficiently. However, after I have looked at my GPUs time after running this, I have not used any of my GPU computation time. So, it means all the work has been pushed back onto the CPU on the GPU node. How can I address this issue?

My LAMMPS input file is the following:
#  define units
units       lj

#  specify periodic boundary conditions
boundary p p p

#  define atom_style
#  full covers everything
atom_style  full 

#  define simulation volume 
#  If I want N = 1,000,000 atoms 
#  and I want a density of rho = 0.5 atoms/lj-sigma^3
#  Then I can determine the size of a cube by 
#  size = (N/rho)^(1/3)
variable side      equal 125
region      boxid block 0.0 ${side} 0.0 ${side} 0.0 ${side}
create_box  1 boxid

#  specify initial positions of atoms
#  sc = simple cubic
#  0.5 = density in lj units
lattice     sc 0.50

#  place atoms of type 1 in boxid
create_atoms    1 box

#   define mass of atom type 1
mass        1 1.0

#  specify initial velocity of atoms
#  group = all
#  reduced temperature is T = 1.0 = lj-eps/kb 
#  seed for random number generator
#  distribution is gaussian (e.g. Maxwell-Boltzmann)
velocity    all create 1.0 87287 dist gaussian

#  specify interaction potential
#  pairwise interaction via the Lennard-Jones potential with a cut-off at 2.5 lj-sigma
pair_style  lj/cut 2.5

#  specify parameters between atoms of type 1 with an atom of type 1
#  epsilon = 1.0, sigma = 1.0, cutoff = 2.5
pair_coeff  1 1 1.0 1.0 2.5

# add long-range tail correction
pair_modify tail yes

#  specify parameters for neighbor list 
#  rnbr = rcut + 0.3
neighbor    0.3 bin

#  specify thermodynamic properties to be output
#  pe = potential energy
#  ke = kinetic energy
#  etotal = pe + ke
#  temp = temperature
#  press = pressure
#  density = number density
#  output every 100 steps
#  norm = normalize by # of atoms (yes or no)
#  step means the current simulation step 
thermo_style custom step pe ke etotal temp press density

# report instantaneous thermo values every 100 steps
thermo 100

# normalize thermo properties by number of atoms (yes or no)
thermo_modify norm no

# minimize the system energy using conjugate gradient
# first two are energy and force tolerances. The other two are maximum numbers of iterations
# within the iteration it will not exceed 1000 evaluations
min_style cg
minimize 1e-4 1e-6 100 1000

#  specify ensemble
#  fixid = 1
#  atoms = all
#  ensemble = nve or nvt
#  simulate them under constant number of particles, volume, and energy 1 is an identifier for
#  for this particular fix
fix     1 all nve

#  define time step usually ps for time 
timestep 0.005

# run 1000 steps in the NVE ensemble
# (this equilibrates positions) without temperature control 
run 1000

#  stop fix with given fixid
#  fixid = 1
unfix 1

#  specify ensemble
#  fixid = 2
#  atoms = all
#  ensemble = nvt
#  temp = temperature
#  initial temperature = 1.0
#  final temperature = 1.0
#  thermostat controller gain = 0.1 (units of time, bigger is less tight control)
fix     2 all nvt temp 1.0 1.0 0.1 

# run 1000 steps in the NVT ensemble
# (this equilibrates thermostat)
run     1000

#   save configurations
#   dumpid = 1
#   all atoms
#   atomic symbol is Ar
#   save positions every 100 steps
#   filename = output.xyz
#   this is when actual data acquisition process begun
dump    1       all xyz 100 outputcpu.xyz
dump_modify 1 element Ar

# run 1000 more steps in the NVT ensemble
# (this is data production, from which configurations are saved) 
run     2000

akohlmey · April 10, 2024, 1:42pm

There are multiple problems with your post:

it is near unreadable because you are not quoting correctly. Please see the “guidelines and suggestions post” to learn how to do it properly
we don’t know which machine at NERSC you are using and what the hardware specs are. If you want specific help you need to provide this information
you don’t specify what command line you are using
we can only infer a few things about how LAMMPS was compiled, it would be extremely helpful if you could provide us with the output of “lmp -h”.

Since you are using pre-compiled docker images, you may need to contact NERSC staff about any specifics there.

Also, rather than rolling your own test input, you should begin with the example inputs in the LAMMPS “bench” folder, starting with in.lj, then in.eam and in.rhodo.

stamoor · April 10, 2024, 2:32pm

have read the previous posts on this forum with regard to the same issue, and I understand that is not an error but just not utilizing the GPUs efficiently. However, after I have looked at my GPUs time after running this, I have not used any of my GPU computation time. So, it means all the work has been pushed back onto the CPU on the GPU node. How can I address this issue?

By NERSC do you mean Perlmutter? Can you post your job submission script for KOKKOS?

Peter_Chun_Pang_Li · April 10, 2024, 11:39pm

Sorry about the bad formatting!
The machine that I am using is called Perlmutter. I am only using one GPU compuation node that contains the following:

1*AMD EPYC 7763
4*NVIDIA A100
4*HPE Slingshot 11

The job submission script that includes the command line is the following:

#!/bin/bash -l
#SBATCH --image docker:nersc/lammps_all:23.08
#SBATCH -C gpu
#SBATCH -t 01:00:00
#SBATCH -J LAMMPSLJUTK_GPU
#SBATCH -o LAMMPSLJUTK_GPU.o%j
#SBATCH -A m1338
#SBATCH -N 1
#SBATCH -c 32
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=none
#SBATCH -q regular

exe=lmp
input= “-k on g 1 -sf kk -pk kokkos newton on neigh half -in in.ljfromutkgpuv"

export OMP_NUM_THREADS=2
export OMP_PROC_BIND=spread
export OMP_PLACES=threads

command=“srun --cpu-bind=cores --gpu-bind=none --module mpich,gpu shifter lmp $input”

I will provide the output of “lmp -h” shortly.

The NERSC staff just said that they didn’t include the GPU build package in the docker image. I guess if I want to use the GPU package, I have to build LAMMPS on NERSC from source myself to include it.

Thank you for your suggestion. I will try those inputs from LAMMMPS “bench folder” for testing GPUs using the KOKKOS package, is included in the Docker image.

Peter_Chun_Pang_Li · April 10, 2024, 11:45pm

Yes.
The job submission sript for KOKKOS package is copied in the response to Axel’s reply.
I am still lost on how to make sure the GPUs are utilized while using the KOKKOS package by addressing the problem of

Fix with atom-based arrays not compatible with sending data in Kokkos communication.

Thank you!

Peter_Chun_Pang_Li · April 11, 2024, 4:42am

I attempted to run the command line “lmp -h”, the error of command not found was returned. Maybe ‘lmp’ executable is named differently in the docker image at Perlmutter. I have followed up with their staff about this.

stamoor · April 11, 2024, 4:06pm

WARNING: Fix with atom-based arrays not compatible with sending data in Kokkos communication, switching to classic exchange/border communication (src/KOKKOS/comm_kokkos.cpp:666)

This only happens every reneighbor (say every 10 or so steps) and shouldn’t hurt performance that much.

WARNING: Fix with atom-based arrays not compatible with Kokkos sorting on device, switching to classic host sorting (src/KOKKOS/atom_kokkos.cpp:178)
[/quote]

This only happens every 1k steps and should have virtually no effect on performance.

So I think you are fine to ignore these warnings for now.

akohlmey · April 11, 2024, 9:15pm

This would happen, when the GPU package was compiled for OpenCL, but no OpenCL loader was available on the compute node. You need to discuss this with the NERSC admins.

It does or else you would get a different error. It just isn’t compiled for use with CUDA.

If you build your own executable, you know exactly what is included and how LAMMPS is configured. To make it work correctly for GPUs (be it with the GPU package or the KOKKOS package) requires more effort and careful attention to the details of the documentation than compiling it for CPUs only.

Don’t just be sorry, please edit your post and correct it. If there is useful information in your input deck, we can see it after correcting it.

No. Your job submission script has been using exactly the “lmp” command. If it works for a run, it should work for the help flag.

Peter_Chun_Pang_Li · April 12, 2024, 6:24pm

I have talked to the NERSC admins, and they don’t have GPU package in their Docker image.

With regard to reformatting my post, I was having trouble on editing it, there is a little edit button at the top; once I clicked it, it brings me to a view of of my post, but I can not edit it at all.

With regard to the ‘lmp’ executable. I figured out what I have done wrong, I was missing a line in my batch script for it to work, I have just ran it, and I will update you the result.

Thank you very much!

akohlmey · April 12, 2024, 6:34pm

There also is an edit button at the bottom. The one on the top is for changing the subject and the category or the tags.

There is a little introductory Discourse tutorial that you should follow to learn the basics (and get a badge). It has been so long ago, that I don’t remember where the link is.

Peter_Chun_Pang_Li · April 12, 2024, 6:38pm

Taking Axel’s advice, I have used the bench mark LJ input file to do the testing. It did not run into any trouble with the fix command using the KOKKOS package. However, I don’t see any information on GPU utilization in the standard output file for the SLURM job but only the information on the CPU usage. Here is the standard output file for the SLURM job:

srun --cpu-bind=cores --gpu-bind=none --module mpich,gpu shifter lmp -k on g 1 -sf kk -pk kokkos newton on neigh half -in in.lj
lmp: /opt/udiImage/modules/mpich/dep/libcurl.so.4: no version information available (required by /opt/lammps/install/lib/liblammps.so.0)
lmp: /opt/udiImage/modules/mpich/dep/libcurl.so.4: no version information available (required by /lib/x86_64-linux-gnu/libhdf5_serial.so.103)
LAMMPS (2 Aug 2023)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:107)
  will use up to 1 GPU(s) per node
  using 1 OpenMP thread(s) per MPI task
Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962
Created orthogonal box = (0 0 0) to (33.591924 33.591924 33.591924)
  1 by 1 by 1 MPI processor grid
Created 32000 atoms
  using lattice units in orthogonal box = (0 0 0) to (33.591924 33.591924 33.591924)
  create_atoms CPU = 0.004 seconds
Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
  update: every = 20 steps, delay = 0 steps, check = no
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 2.8
  ghost atom cutoff = 2.8
  binsize = 2.8, bins = 12 12 12
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair lj/cut/kk, perpetual
      attributes: half, newton on, kokkos_device
      pair build: half/bin/newton/kk/device
      stencil: half/bin/3d
      bin: kk/device
Setting up Verlet run ...
  Unit style    : lj
  Current step  : 0
  Time step     : 0.005
Per MPI rank memory allocation (min/avg/max) = 7.225 | 7.225 | 7.225 Mbytes
   Step          Temp          E_pair         E_mol          TotEng         Press     
         0   1.44          -6.7733681      0             -4.6134356     -5.0197073    
       100   0.7574531     -5.7585055      0             -4.6223613      0.20726105   
Loop time of 0.0198547 on 1 procs for 100 steps with 32000 atoms

Performance: 2175807.130 tau/day, 5036.591 timesteps/s, 161.171 Matom-step/s
97.4% CPU use with 1 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0.0024814  | 0.0024814  | 0.0024814  |   0.0 | 12.50
Neigh   | 0.0014331  | 0.0014331  | 0.0014331  |   0.0 |  7.22
Comm    | 0.014106   | 0.014106   | 0.014106   |   0.0 | 71.04
Output  | 3.2914e-05 | 3.2914e-05 | 3.2914e-05 |   0.0 |  0.17
Modify  | 0.00084393 | 0.00084393 | 0.00084393 |   0.0 |  4.25
Other   |            | 0.0009577  |            |       |  4.82

Nlocal:          32000 ave       32000 max       32000 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost:          19657 ave       19657 max       19657 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs:    1.20283e+06 ave 1.20283e+06 max 1.20283e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 1202833
Ave neighs/atom = 37.588531
Neighbor list builds = 5
Dangerous builds not checked
Total wall time: 0:00:03

This was ran on a GPU node that also has a CPU on it, so the everything has the chance of all being ran on the CPU. Is there a way for me to know that this process has for sure utilized the GPUs.

stamoor · April 12, 2024, 6:42pm

KOKKOS package does not print out GPU utilization stats. As far as I can tell, you did run on the GPU though. You can double check using nvidia-smi on the compute node.

Peter_Chun_Pang_Li · April 12, 2024, 6:46pm

Thanks! Here is what I have at the bottom of my post. Here is no edit button available though.

Peter_Chun_Pang_Li · April 12, 2024, 6:46pm

Thank you very much! I will go double check on that.

Peter_Chun_Pang_Li · April 12, 2024, 7:22pm

Still couldn’t run the “lmp -h” command. The line I thought I was missing is

exe = lmp

But I don’t think that affect my sbatch script at all, so the sbatch script that I used was the following:

#!/bin/bash -l
#SBATCH --image docker:nersc/lammps_all:23.08
#SBATCH -C gpu
#SBATCH -t 01:00:00
#SBATCH -J LAMMPSLJUTK_GPU_Info
#SBATCH -o LAMMPSLJUTK_GPU_Info.o%j
#SBATCH -A m1338
#SBATCH -N 1
#SBATCH -c 32
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=none
#SBATCH -q regular

exe=lmp
command="lmp -h"

# Run the command
echo $command

$command

The output I got out was the following:

lmp -h
/var/spool/slurmd/job24316428/slurm_script: line 21: lmp: command not found

Like you mentioned, I have used ‘lmp’ to run all my other file just fine, don’t know what is going on here. Thanks!

stamoor · April 12, 2024, 8:28pm

Maybe you need to add “./” ?

./lmp -h

akohlmey · April 12, 2024, 8:44pm

You invocation isn’t the same. Here you do:

lmp -h

But in the KOKKOS script you did:

exe=lmp
input= “-k on g 1 -sf kk -pk kokkos newton on neigh half -in in.ljfromutkgpuv"

export OMP_NUM_THREADS=2
export OMP_PROC_BIND=spread
export OMP_PLACES=threads

command=“srun --cpu-bind=cores --gpu-bind=none --module mpich,gpu shifter lmp $input”

So if you use the same script but then replace “$input” with “-h”, you should be fine.

Peter_Chun_Pang_Li · April 15, 2024, 4:02pm

Thank you Axel and Stan! I was able to run lmp-h to get the output of “lmp -h”. Here is the relevant output:

Large-scale Atomic/Molecular Massively Parallel Simulator - 2 Aug 2023
Git info (HEAD / stable_2Aug2023)

Usage example: lmp -var t 300 -echo screen -in in.alloy

List of command line options supported by this LAMMPS executable:

-echo none/screen/log/both  : echoing of input script (-e)
-help                       : print this help message (-h)
-in none/filename           : read input from file or stdin (default) (-i)
-kokkos on/off ...          : turn KOKKOS mode on or off (-k)
-log none/filename          : where to send log output (-l)
-mdi '<mdi flags>'          : pass flags to the MolSSI Driver Interface
-mpicolor color             : which exe in a multi-exe mpirun cmd (-m)
-cite                       : select citation reminder style (-c)
-nocite                     : disable citation reminder (-nc)
-nonbuf                     : disable screen/logfile buffering (-nb)
-package style ...          : invoke package command (-pk)
-partition size1 size2 ...  : assign partition sizes (-p)
-plog basename              : basename for partition logs (-pl)
-pscreen basename           : basename for partition screens (-ps)
-restart2data rfile dfile ... : convert restart to data file (-r2data)
-restart2dump rfile dgroup dstyle dfile ... 
                            : convert restart to dump file (-r2dump)
-reorder topology-specs     : processor reordering (-r)
-screen none/filename       : where to send screen output (-sc)
-skiprun                    : skip loops in run and minimize (-sr)
-suffix gpu/intel/opt/omp   : style suffix to apply (-sf)
-var varname value          : set index style variable (-v)

OS: Linux "Ubuntu 22.04.3 LTS" 5.14.21-150400.24.81_12.0.87-cray_shasta_c x86_64

Compiler: GNU C++ 11.4.0 with OpenMP 4.5
C++ standard: C++14
MPI v3.1: MPI VERSION    : CRAY MPICH version 8.1.22.12 (ANL base 3.4a2)
MPI BUILD INFO : Wed Nov 09 12:31 2022 (git hash cfc6f82)

Accelerator configuration:

GPU package API: OpenCL
GPU package precision: mixed
KOKKOS package API: CUDA Serial
KOKKOS package precision: double
OPENMP package API: OpenMP
OPENMP package precision: double
INTEL package API: OpenMP
INTEL package precision: single mixed double

Compatible GPU present: no

Active compile time flags:

-DLAMMPS_GZIP
-DLAMMPS_JPEG
-DLAMMPS_EXCEPTIONS
-DLAMMPS_SMALLBIG
sizeof(smallint): 32-bit
sizeof(imageint): 32-bit
sizeof(tagint):   32-bit
sizeof(bigint):   64-bit

Available compression formats:

Extension: .gz     Command: gzip
Extension: .bz2    Command: bzip2
Extension: .zst    Command: zstd
Extension: .xz     Command: xz
Extension: .lzma   Command: xz
Extension: .lz4    Command: lz4


Installed packages:

AMOEBA ASPHERE AWPMD BOCS BODY BPM BROWNIAN CG-DNA CG-SPICA CLASS2 COLLOID 
COLVARS COMPRESS CORESHELL DIELECTRIC DIFFRACTION DIPOLE DPD-BASIC DPD-MESO 
DPD-REACT DPD-SMOOTH DRUDE EFF ELECTRODE EXTRA-COMPUTE EXTRA-DUMP EXTRA-FIX 
EXTRA-MOLECULE EXTRA-PAIR FEP GPU GRANULAR H5MD INTEL INTERLAYER KIM KOKKOS 
KSPACE LEPTON MACHDYN MANIFOLD MANYBODY MC MEAM MESONT MGPT MISC ML-IAP ML-POD 
ML-QUIP ML-SNAP MOFFF MOLECULE MOLFILE MPIIO MSCG OPENMP OPT ORIENT PERI 
PHONON PLUGIN POEMS PTM PYTHON QEQ QMMM QTB REACTION REAXFF REPLICA RIGID 
SHOCK SMTBQ SPH SPIN SRD TALLY UEF VORONOI YAFF

so the GPU package is actually installed, then perhaps I need to change my command script to utilize the GPU package?

akohlmey · April 15, 2024, 4:16pm

This tells you two things:

the user support person you talked to doesn’t know much about how LAMMPS was compiled and has been giving you incorrect information. perhaps you need to talk to someone else.
the GPU support is using OpenCL and not CUDA (that is a compile time choice) and we have already established in exchanges earlier in this thread that OpenCL support is not properly configured on the compute nodes you are using. Or you need to compile LAMMPS yourself and then configure the GPU package to use CUDA instead of OpenCL.

Peter_Chun_Pang_Li · April 16, 2024, 5:35pm

Thank you very much!