Dear developer and user,
I’m trying to run parallel tempering with deepMD potential on multiple GPUs.
Before compilation, I loaded the following modules on the Bridges2 cluster:
module load openmpi/4.0.5-nvhpc22.9 nvhpc/22.9 cuda/11.7.1 gcc/10.2.0
The compilation process of LAMMPS (2 Aug 2023):
I went to the directory lammps-stable_2Aug2023_update2/lib/gpu
and changed two lines of Makefile.linux
:
CUDA_HOME = /opt/packages/cuda/v11.7.1
CUDA_ARCH = -arch=sm_70
Then
make -f Makefile.linux
cd ../../src
I enabled the following packages:
- deepmd
- replica
- GPU
- extra-fix
- kspace
- openmp
Then compiled with the command :
make mpi -j4 CUDA_ARCH = -arch = sm_70
Lauch:
My launch command is shown here:
mpirun -n 20 /ocean/projects/dmr200038p/szhange/gpu/lammps-stable_2Aug2023_update2/src/lmp_mpi -partition 20x1 -sf gpu -pk gpu 4 -in in.remd_Pt110
I’m trying to use 4 GPUs (each GPU has 5 cores) to run parallel tempering simulation with 20 replicas (required 20 cores), my input file in.remd_Pt110
:
# **********************************Initialization****************************
units metal
boundary p p p
atom_style atomic
neigh_modify delay 5 every 5 check yes
variable Q world 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
pair_style deepmd compress_Pt110_160.pb
# ***********************************read_data********************************
read_data Pt110.data
variable t world 300.0 320.0 340.0 360.0 380.0 400.0 420.0 440.0 460.0 480.0 500.0 520.0 540.0 560.0 580.0 600.0 620.0 640.0 660.0 680.0
pair_coeff * *
# *******************************Define pair styles***************************
thermo 0
thermo_style custom step temp etotal pe ke vol press lx ly lz
thermo_modify flush yes
timestep 0.002
velocity all create 300.0 4928459
variable STEP equal step
variable TEMP equal temp
variable ETOTAL equal etotal
variable PE equal pe
variable KE equal ke
variable VOL equal vol
variable PRESS equal press
variable LX equal lx
variable LY equal ly
variable LZ equal lz
variable PXX equal pxx
variable PYY equal pyy
variable PZZ equal pzz
fix thermo_output all print 2000 "${STEP} ${TEMP} ${ETOTAL} ${PE} ${KE} ${VOL} ${PRESS} ${LX} ${LY} ${LZ} ${PXX} ${PYY} ${PZZ}" file thermo.$Q.lammps title "#step temp etotal pe ke vol press lx ly lz pxx pyy pzz"
fix COM all momentum 1 linear 1 1 1 angular
fix myfix all nvt temp $t $t 0.1
temper 500000 1000 $t myfix 36312 12122
write_data Pt_110.$Q.data
My script file is:
#!/bin/bash
#SBATCH -o Pt110_remd.o%j
#SBATCH -N 1
#SBATCH -p GPU-shared
#SBATCH --gpus=v100-32:4
#SBATCH -J Pt110_remd
#SBATCH --time=0:30:00
module load gcc/10.2.0
module load nvhpc/22.9
module load openmpi/4.0.5-nvhpc22.9
module load cuda/11.7.1
module load python/3.8.6
module load mkl/2020.4.304
echo "SLURM_NTASKS: " $SLURM_NTASKS
ulimit -n 2048
#export OMP_NUM_THREADS=1
nvidia-smi -l 10 -f nvidia-smi-output-$SLURM_JOB_ID.txt &
mpirun -n 20 /ocean/projects/dmr200038p/szhange/gpu/lammps-stable_2Aug2023_update2/src/lmp_mpi -partition 20x1 -sf gpu -pk gpu 4 -in in.remd_Pt110
kill %1
The log.lammps
, log.lammps.0
,…, files generated under the working directory are all empty but the job was running until reaching the time limit. Some lines from the output file Pt110_remd.o24760639
are:
2024-07-25 15:21:48.889075: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 28675 MB memory: -> device: 3, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b3:00.0, compute capability: 7.0
2024-07-25 15:21:48.890365: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
...
>>> Info of model(s):
using 1 model(s): compress_Pt110_160.pb
rcut in model: 7
ntypes in model: 1
2024-07-25 15:21:50.592250: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 4.2KiB (4352 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-25 15:21:50.594755: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 2.2KiB (2304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
>>> Info of model(s):
using 1 model(s): compress_Pt110_160.pb
rcut in model: 7
ntypes in model: 1
2024-07-25 15:21:50.596132: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 8.5KiB (8704 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
...
2024-07-25 15:21:56.103233: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 2.2KiB (2304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-25 15:21:56.104471: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 2.2KiB (2304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-25 15:21:56.110743: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 2.2KiB (2304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-25 15:21:56.111981: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 2.2KiB (2304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
The output file Pt110_remd.o24760639
kept printing the following line until job ended due to time limit:
I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 2.2KiB (2304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
I used 4 GPUs of Tesla V100-SXM2 with 32 GB RAM, and each GPU has 5 cores.
the output from nvidian-smi
:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-SXM2-32GB On | 00000000:15:00.0 Off | 0 |
| N/A 28C P0 53W / 300W | 32499MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2-32GB On | 00000000:8A:00.0 Off | 0 |
| N/A 31C P0 54W / 300W | 2775MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2-32GB On | 00000000:B2:00.0 Off | 0 |
| N/A 27C P0 53W / 300W | 2775MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2-32GB On | 00000000:B3:00.0 Off | 0 |
| N/A 32C P0 55W / 300W | 2775MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 8987 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 0 N/A N/A 8988 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 0 N/A N/A 8989 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 0 N/A N/A 8990 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 0 N/A N/A 8991 C ...stable_2Aug2023_update2/src/lmp_mpi 350MiB |
| 0 N/A N/A 8992 C ...stable_2Aug2023_update2/src/lmp_mpi 404MiB |
| 0 N/A N/A 8993 C ...stable_2Aug2023_update2/src/lmp_mpi 29022MiB |
| 0 N/A N/A 8994 C ...stable_2Aug2023_update2/src/lmp_mpi 1332MiB |
| 1 N/A N/A 8987 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 1 N/A N/A 8988 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 1 N/A N/A 8989 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 1 N/A N/A 8990 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 1 N/A N/A 8991 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 1 N/A N/A 8992 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 1 N/A N/A 8993 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 1 N/A N/A 8994 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 2 N/A N/A 8987 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 2 N/A N/A 8988 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 2 N/A N/A 8989 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 2 N/A N/A 8990 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 2 N/A N/A 8991 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 2 N/A N/A 8992 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 2 N/A N/A 8993 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 2 N/A N/A 8994 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 3 N/A N/A 8987 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 3 N/A N/A 8988 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 3 N/A N/A 8989 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 3 N/A N/A 8990 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 3 N/A N/A 8991 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 3 N/A N/A 8992 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 3 N/A N/A 8993 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
| 3 N/A N/A 8994 C ...stable_2Aug2023_update2/src/lmp_mpi 346MiB |
+---------------------------------------------------------------------------------------+
I thought this maybe the RAM issue, so I tried using 8 GPUs but the issue is the same.
The 4 GPUs are not even used for simulation, and the job was just running without being stopped, nothing on the log.lammps
file.
I don’t know how to solve this issue, Could someone please offer suggestions or guidance on this issue? Any suggestions or response will be appreciated! Thanks in advance!
Kevin