Hi all, I am currently doing some testing on NERSC’s gpu nodes. I decided to use a really simple lammps script that models the movement of a million argon atoms under short range LJ potential. Everything ran well with the cpu node on NERSC. However, in an attempt to the GPUs at NERSC, I have utilized the GPU nodes while running my LAMMPS script with both the GPU package as well as KOKKOS package, and both ran into some issues
For the GPU package, I ran into the following error:
lmp: /opt/udiImage/modules/mpich/dep/libcurl.so.4: no version information avail>
lmp: /opt/udiImage/modules/mpich/dep/libcurl.so.4: no version information avail>
LAMMPS (2 Aug 2023)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:9>
using 1 OpenMP thread(s) per MPI task
package gpu 1
ERROR: Invalid OpenCL platform ID. (src/GPU/gpu_extra.h:77)
Last command: package gpu 1
srun: error: nid001357: task 0: Exited with exit code 1
srun: Terminating StepId=24183628.0
NERSC has three docker images for LAMMPS:
perlmutter docker READY 78c9bbb876 2023-10-11T15:35:48 nersc/lammps_all:23.08
perlmutter docker READY 1265e04cff 2023-12-05T12:20:10 nersc/lammps_allegro:23.08
perlmutter docker READY a546b186a4 2023-09-19T12:14:13 nersc/lammps_lite:23.08
I was wondering is it because these docker image doesn’t contain the GPU package? Should I build my own lammps from source on NERSC to include the GPU package?
For the KOKKOS package, I ran into the following warning:
WARNING: Fix with atom-based arrays not compatible with sending data in Kokkos communication, switching to classic exchange/border communication (src/KOKKOS/comm_kokkos.cpp:666)
WARNING: Fix with atom-based arrays not compatible with Kokkos sorting on device, switching to classic host sorting (src/KOKKOS/atom_kokkos.cpp:178)
I have read the previous posts on this forum with regard to the same issue, and I understand that is not an error but just not utilizing the GPUs efficiently. However, after I have looked at my GPUs time after running this, I have not used any of my GPU computation time. So, it means all the work has been pushed back onto the CPU on the GPU node. How can I address this issue?
My LAMMPS input file is the following:
# define units
units lj
# specify periodic boundary conditions
boundary p p p
# define atom_style
# full covers everything
atom_style full
# define simulation volume
# If I want N = 1,000,000 atoms
# and I want a density of rho = 0.5 atoms/lj-sigma^3
# Then I can determine the size of a cube by
# size = (N/rho)^(1/3)
variable side equal 125
region boxid block 0.0 ${side} 0.0 ${side} 0.0 ${side}
create_box 1 boxid
# specify initial positions of atoms
# sc = simple cubic
# 0.5 = density in lj units
lattice sc 0.50
# place atoms of type 1 in boxid
create_atoms 1 box
# define mass of atom type 1
mass 1 1.0
# specify initial velocity of atoms
# group = all
# reduced temperature is T = 1.0 = lj-eps/kb
# seed for random number generator
# distribution is gaussian (e.g. Maxwell-Boltzmann)
velocity all create 1.0 87287 dist gaussian
# specify interaction potential
# pairwise interaction via the Lennard-Jones potential with a cut-off at 2.5 lj-sigma
pair_style lj/cut 2.5
# specify parameters between atoms of type 1 with an atom of type 1
# epsilon = 1.0, sigma = 1.0, cutoff = 2.5
pair_coeff 1 1 1.0 1.0 2.5
# add long-range tail correction
pair_modify tail yes
# specify parameters for neighbor list
# rnbr = rcut + 0.3
neighbor 0.3 bin
# specify thermodynamic properties to be output
# pe = potential energy
# ke = kinetic energy
# etotal = pe + ke
# temp = temperature
# press = pressure
# density = number density
# output every 100 steps
# norm = normalize by # of atoms (yes or no)
# step means the current simulation step
thermo_style custom step pe ke etotal temp press density
# report instantaneous thermo values every 100 steps
thermo 100
# normalize thermo properties by number of atoms (yes or no)
thermo_modify norm no
# minimize the system energy using conjugate gradient
# first two are energy and force tolerances. The other two are maximum numbers of iterations
# within the iteration it will not exceed 1000 evaluations
min_style cg
minimize 1e-4 1e-6 100 1000
# specify ensemble
# fixid = 1
# atoms = all
# ensemble = nve or nvt
# simulate them under constant number of particles, volume, and energy 1 is an identifier for
# for this particular fix
fix 1 all nve
# define time step usually ps for time
timestep 0.005
# run 1000 steps in the NVE ensemble
# (this equilibrates positions) without temperature control
run 1000
# stop fix with given fixid
# fixid = 1
unfix 1
# specify ensemble
# fixid = 2
# atoms = all
# ensemble = nvt
# temp = temperature
# initial temperature = 1.0
# final temperature = 1.0
# thermostat controller gain = 0.1 (units of time, bigger is less tight control)
fix 2 all nvt temp 1.0 1.0 0.1
# run 1000 steps in the NVT ensemble
# (this equilibrates thermostat)
run 1000
# save configurations
# dumpid = 1
# all atoms
# atomic symbol is Ar
# save positions every 100 steps
# filename = output.xyz
# this is when actual data acquisition process begun
dump 1 all xyz 100 outputcpu.xyz
dump_modify 1 element Ar
# run 1000 more steps in the NVT ensemble
# (this is data production, from which configurations are saved)
run 2000
it is near unreadable because you are not quoting correctly. Please see the “guidelines and suggestions post” to learn how to do it properly
we don’t know which machine at NERSC you are using and what the hardware specs are. If you want specific help you need to provide this information
you don’t specify what command line you are using
we can only infer a few things about how LAMMPS was compiled, it would be extremely helpful if you could provide us with the output of “lmp -h”.
Since you are using pre-compiled docker images, you may need to contact NERSC staff about any specifics there.
Also, rather than rolling your own test input, you should begin with the example inputs in the LAMMPS “bench” folder, starting with in.lj, then in.eam and in.rhodo.
have read the previous posts on this forum with regard to the same issue, and I understand that is not an error but just not utilizing the GPUs efficiently. However, after I have looked at my GPUs time after running this, I have not used any of my GPU computation time. So, it means all the work has been pushed back onto the CPU on the GPU node. How can I address this issue?
By NERSC do you mean Perlmutter? Can you post your job submission script for KOKKOS?
The NERSC staff just said that they didn’t include the GPU build package in the docker image. I guess if I want to use the GPU package, I have to build LAMMPS on NERSC from source myself to include it.
Thank you for your suggestion. I will try those inputs from LAMMMPS “bench folder” for testing GPUs using the KOKKOS package, is included in the Docker image.
Yes.
The job submission sript for KOKKOS package is copied in the response to Axel’s reply.
I am still lost on how to make sure the GPUs are utilized while using the KOKKOS package by addressing the problem of
Fix with atom-based arrays not compatible with sending data in Kokkos communication.
I attempted to run the command line “lmp -h”, the error of command not found was returned. Maybe ‘lmp’ executable is named differently in the docker image at Perlmutter. I have followed up with their staff about this.
WARNING: Fix with atom-based arrays not compatible with sending data in Kokkos communication, switching to classic exchange/border communication (src/KOKKOS/comm_kokkos.cpp:666)
This only happens every reneighbor (say every 10 or so steps) and shouldn’t hurt performance that much.
WARNING: Fix with atom-based arrays not compatible with Kokkos sorting on device, switching to classic host sorting (src/KOKKOS/atom_kokkos.cpp:178)
[/quote]
This only happens every 1k steps and should have virtually no effect on performance.
So I think you are fine to ignore these warnings for now.
This would happen, when the GPU package was compiled for OpenCL, but no OpenCL loader was available on the compute node. You need to discuss this with the NERSC admins.
It does or else you would get a different error. It just isn’t compiled for use with CUDA.
If you build your own executable, you know exactly what is included and how LAMMPS is configured. To make it work correctly for GPUs (be it with the GPU package or the KOKKOS package) requires more effort and careful attention to the details of the documentation than compiling it for CPUs only.
Don’t just be sorry, please edit your post and correct it. If there is useful information in your input deck, we can see it after correcting it.
No. Your job submission script has been using exactly the “lmp” command. If it works for a run, it should work for the help flag.
I have talked to the NERSC admins, and they don’t have GPU package in their Docker image.
With regard to reformatting my post, I was having trouble on editing it, there is a little edit button at the top; once I clicked it, it brings me to a view of of my post, but I can not edit it at all.
With regard to the ‘lmp’ executable. I figured out what I have done wrong, I was missing a line in my batch script for it to work, I have just ran it, and I will update you the result.
There also is an edit button at the bottom. The one on the top is for changing the subject and the category or the tags.
There is a little introductory Discourse tutorial that you should follow to learn the basics (and get a badge). It has been so long ago, that I don’t remember where the link is.
Taking Axel’s advice, I have used the bench mark LJ input file to do the testing. It did not run into any trouble with the fix command using the KOKKOS package. However, I don’t see any information on GPU utilization in the standard output file for the SLURM job but only the information on the CPU usage. Here is the standard output file for the SLURM job:
srun --cpu-bind=cores --gpu-bind=none --module mpich,gpu shifter lmp -k on g 1 -sf kk -pk kokkos newton on neigh half -in in.lj
lmp: /opt/udiImage/modules/mpich/dep/libcurl.so.4: no version information available (required by /opt/lammps/install/lib/liblammps.so.0)
lmp: /opt/udiImage/modules/mpich/dep/libcurl.so.4: no version information available (required by /lib/x86_64-linux-gnu/libhdf5_serial.so.103)
LAMMPS (2 Aug 2023)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:107)
will use up to 1 GPU(s) per node
using 1 OpenMP thread(s) per MPI task
Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962
Created orthogonal box = (0 0 0) to (33.591924 33.591924 33.591924)
1 by 1 by 1 MPI processor grid
Created 32000 atoms
using lattice units in orthogonal box = (0 0 0) to (33.591924 33.591924 33.591924)
create_atoms CPU = 0.004 seconds
Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
update: every = 20 steps, delay = 0 steps, check = no
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 2.8
ghost atom cutoff = 2.8
binsize = 2.8, bins = 12 12 12
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair lj/cut/kk, perpetual
attributes: half, newton on, kokkos_device
pair build: half/bin/newton/kk/device
stencil: half/bin/3d
bin: kk/device
Setting up Verlet run ...
Unit style : lj
Current step : 0
Time step : 0.005
Per MPI rank memory allocation (min/avg/max) = 7.225 | 7.225 | 7.225 Mbytes
Step Temp E_pair E_mol TotEng Press
0 1.44 -6.7733681 0 -4.6134356 -5.0197073
100 0.7574531 -5.7585055 0 -4.6223613 0.20726105
Loop time of 0.0198547 on 1 procs for 100 steps with 32000 atoms
Performance: 2175807.130 tau/day, 5036.591 timesteps/s, 161.171 Matom-step/s
97.4% CPU use with 1 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
---------------------------------------------------------------
Pair | 0.0024814 | 0.0024814 | 0.0024814 | 0.0 | 12.50
Neigh | 0.0014331 | 0.0014331 | 0.0014331 | 0.0 | 7.22
Comm | 0.014106 | 0.014106 | 0.014106 | 0.0 | 71.04
Output | 3.2914e-05 | 3.2914e-05 | 3.2914e-05 | 0.0 | 0.17
Modify | 0.00084393 | 0.00084393 | 0.00084393 | 0.0 | 4.25
Other | | 0.0009577 | | | 4.82
Nlocal: 32000 ave 32000 max 32000 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost: 19657 ave 19657 max 19657 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs: 1.20283e+06 ave 1.20283e+06 max 1.20283e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Total # of neighbors = 1202833
Ave neighs/atom = 37.588531
Neighbor list builds = 5
Dangerous builds not checked
Total wall time: 0:00:03
This was ran on a GPU node that also has a CPU on it, so the everything has the chance of all being ran on the CPU. Is there a way for me to know that this process has for sure utilized the GPUs.
KOKKOS package does not print out GPU utilization stats. As far as I can tell, you did run on the GPU though. You can double check using nvidia-smi on the compute node.
the user support person you talked to doesn’t know much about how LAMMPS was compiled and has been giving you incorrect information. perhaps you need to talk to someone else.
the GPU support is using OpenCL and not CUDA (that is a compile time choice) and we have already established in exchanges earlier in this thread that OpenCL support is not properly configured on the compute nodes you are using. Or you need to compile LAMMPS yourself and then configure the GPU package to use CUDA instead of OpenCL.