Running Parallel Tempering MD Simulation with DeepMD potential on Multiple GPUs

syi_zhang · July 26, 2024, 3:56am

Dear developer and user,

I’m trying to run parallel tempering with deepMD potential on multiple GPUs.

Before compilation, I loaded the following modules on the Bridges2 cluster:

module load openmpi/4.0.5-nvhpc22.9    nvhpc/22.9   cuda/11.7.1    gcc/10.2.0

The compilation process of LAMMPS (2 Aug 2023):

I went to the directory lammps-stable_2Aug2023_update2/lib/gpu and changed two lines of Makefile.linux:

CUDA_HOME = /opt/packages/cuda/v11.7.1
CUDA_ARCH = -arch=sm_70

Then

make -f Makefile.linux

cd ../../src

I enabled the following packages:

deepmd
replica
GPU
extra-fix
kspace
openmp
Then compiled with the command :

make mpi -j4 CUDA_ARCH = -arch = sm_70

Lauch:
My launch command is shown here:

mpirun -n 20 /ocean/projects/dmr200038p/szhange/gpu/lammps-stable_2Aug2023_update2/src/lmp_mpi -partition 20x1  -sf gpu -pk  gpu 4  -in in.remd_Pt110

I’m trying to use 4 GPUs (each GPU has 5 cores) to run parallel tempering simulation with 20 replicas (required 20 cores), my input file in.remd_Pt110:


# **********************************Initialization**************************** 

units                   metal
boundary                p p p 
atom_style              atomic

neigh_modify    delay 5 every 5 check yes
variable Q world 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
pair_style  deepmd  compress_Pt110_160.pb

# ***********************************read_data********************************
read_data  Pt110.data

variable t world 300.0 320.0 340.0 360.0 380.0 400.0 420.0 440.0 460.0 480.0 500.0 520.0 540.0 560.0 580.0 600.0 620.0 640.0 660.0 680.0
pair_coeff              * *

# *******************************Define pair styles***************************
thermo                  0
thermo_style    custom step temp etotal pe ke vol press lx ly lz  
thermo_modify   flush yes
timestep        0.002

velocity all create 300.0 4928459
variable STEP equal step
variable TEMP equal temp
variable ETOTAL equal etotal
variable PE equal pe
variable KE equal ke
variable VOL equal vol
variable PRESS equal press
variable LX equal lx
variable LY equal ly
variable LZ equal lz
variable PXX equal pxx
variable PYY equal pyy
variable PZZ equal pzz

fix thermo_output all print 2000 "${STEP} ${TEMP} ${ETOTAL} ${PE} ${KE} ${VOL} ${PRESS} ${LX} ${LY} ${LZ} ${PXX} ${PYY} ${PZZ}" file thermo.$Q.lammps title "#step   temp   etotal   pe   ke   vol   press   lx   ly   lz   pxx   pyy   pzz"

fix COM all momentum 1 linear 1 1 1 angular

fix myfix all nvt temp $t $t 0.1
temper 500000 1000 $t myfix 36312 12122
                                                        
write_data  Pt_110.$Q.data

My script file is:

#!/bin/bash
#SBATCH -o Pt110_remd.o%j
#SBATCH -N 1
#SBATCH -p GPU-shared
#SBATCH --gpus=v100-32:4
#SBATCH -J Pt110_remd
#SBATCH --time=0:30:00
module load gcc/10.2.0
module load nvhpc/22.9 
module load openmpi/4.0.5-nvhpc22.9
module load cuda/11.7.1
module load python/3.8.6
module load mkl/2020.4.304

echo "SLURM_NTASKS: " $SLURM_NTASKS

ulimit -n 2048
#export OMP_NUM_THREADS=1
nvidia-smi -l 10 -f nvidia-smi-output-$SLURM_JOB_ID.txt &

mpirun -n 20 /ocean/projects/dmr200038p/szhange/gpu/lammps-stable_2Aug2023_update2/src/lmp_mpi -partition 20x1  -sf gpu -pk  gpu 4  -in in.remd_Pt110

kill %1

The log.lammps, log.lammps.0,…, files generated under the working directory are all empty but the job was running until reaching the time limit. Some lines from the output file Pt110_remd.o24760639 are:

2024-07-25 15:21:48.889075: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 28675 MB memory:  -> device: 3, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b3:00.0, compute capability: 7.0
2024-07-25 15:21:48.890365: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
...
>>> Info of model(s):
  using   1 model(s): compress_Pt110_160.pb
  rcut in model:      7
  ntypes in model:    1
2024-07-25 15:21:50.592250: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 4.2KiB (4352 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-25 15:21:50.594755: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 2.2KiB (2304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory 
>>> Info of model(s):
  using   1 model(s): compress_Pt110_160.pb
  rcut in model:      7
  ntypes in model:    1
2024-07-25 15:21:50.596132: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 8.5KiB (8704 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
...
2024-07-25 15:21:56.103233: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 2.2KiB (2304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-25 15:21:56.104471: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 2.2KiB (2304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-25 15:21:56.110743: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 2.2KiB (2304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-25 15:21:56.111981: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 2.2KiB (2304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

The output file Pt110_remd.o24760639 kept printing the following line until job ended due to time limit:

I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 2.2KiB (2304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

I used 4 GPUs of Tesla V100-SXM2 with 32 GB RAM, and each GPU has 5 cores.
the output from nvidian-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB           On  | 00000000:15:00.0 Off |                    0 |
| N/A   28C    P0              53W / 300W |  32499MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2-32GB           On  | 00000000:8A:00.0 Off |                    0 |
| N/A   31C    P0              54W / 300W |   2775MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2-32GB           On  | 00000000:B2:00.0 Off |                    0 |
| N/A   27C    P0              53W / 300W |   2775MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2-32GB           On  | 00000000:B3:00.0 Off |                    0 |
| N/A   32C    P0              55W / 300W |   2775MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      8987      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    0   N/A  N/A      8988      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    0   N/A  N/A      8989      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    0   N/A  N/A      8990      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    0   N/A  N/A      8991      C   ...stable_2Aug2023_update2/src/lmp_mpi      350MiB |
|    0   N/A  N/A      8992      C   ...stable_2Aug2023_update2/src/lmp_mpi      404MiB |
|    0   N/A  N/A      8993      C   ...stable_2Aug2023_update2/src/lmp_mpi    29022MiB |
|    0   N/A  N/A      8994      C   ...stable_2Aug2023_update2/src/lmp_mpi     1332MiB |
|    1   N/A  N/A      8987      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    1   N/A  N/A      8988      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    1   N/A  N/A      8989      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    1   N/A  N/A      8990      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    1   N/A  N/A      8991      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    1   N/A  N/A      8992      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    1   N/A  N/A      8993      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    1   N/A  N/A      8994      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    2   N/A  N/A      8987      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    2   N/A  N/A      8988      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    2   N/A  N/A      8989      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    2   N/A  N/A      8990      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    2   N/A  N/A      8991      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    2   N/A  N/A      8992      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    2   N/A  N/A      8993      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    2   N/A  N/A      8994      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    3   N/A  N/A      8987      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    3   N/A  N/A      8988      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    3   N/A  N/A      8989      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    3   N/A  N/A      8990      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    3   N/A  N/A      8991      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    3   N/A  N/A      8992      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    3   N/A  N/A      8993      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
|    3   N/A  N/A      8994      C   ...stable_2Aug2023_update2/src/lmp_mpi      346MiB |
+---------------------------------------------------------------------------------------+

I thought this maybe the RAM issue, so I tried using 8 GPUs but the issue is the same.

The 4 GPUs are not even used for simulation, and the job was just running without being stopped, nothing on the log.lammps file.

I don’t know how to solve this issue, Could someone please offer suggestions or guidance on this issue? Any suggestions or response will be appreciated! Thanks in advance!

Kevin

Germain · July 26, 2024, 11:43am

Hi @syi_zhang,

I am not very familiar with parallel tempering simulation. That said I had some questions and problems with parallelization using SLURM in the past, so here are some thoughts on your inputs:

You are basically launching 20 mpi processes on a single node and expecting them to be shared evenly on your GPUs. Clearly from your nvidia-smi output, this is not the case. You have an uneven balance of memory between your GPUs and, looking at the IDs some processes are shared between the GPUs. This does not look like what you want.
You might be interested in looking at the details of the slurm options you use in your script. Have a look at the documentation. You shall be interested in how SLURM handles tasks on CPU and GPUS (look at ntasks options and its variations). That said, the best person to ask would be the people managing your cluster and the SLURM installation. If you are using a computing cluster from an academic institution (Bridges2 is from Pittsburgh right?), they generally have some people to help you prepare your scripts according to your needs. This will avoid wasting CPU/GPU time and resources. See your cluster user guide where you have mail to the helpdesk.

Concerning your input script:

thermo                  0
thermo_style    custom step temp etotal pe ke vol press lx ly lz  
thermo_modify   flush yes

When you say:

This is expected behavior even if the simulation where running. There is no point in flushing the thermo info from a simulation that prints only the starting and ending steps (during the latter the info are flushed anyway).

As a rule of thumb, I would go dumb and simple, and start by requesting one node per mpi process with a shorter simulation (maybe with a smaller system) and see if things work as expected with one core per mpi process. Then start optimizing if possible. Also try to contact your cluster management team etc.

syi_zhang · July 27, 2024, 3:32pm

Hi @Germain,

Thanks for your kind reply and suggestions.

To test if GPU works as expected, I tried to run a very simple NVT simulation with deepMD potential:

# **********************************Initialization**************************** 
units 			metal
boundary 		p p p 
atom_style 		atomic

neigh_modify    delay 5 every 5 check yes

# ***********************************read_data********************************

read_data       Pt110.data
# *******************************Define pair styles***************************

pair_style  deepmd  compress_Pt110_160.pb
pair_coeff * *
timestep        0.002

velocity all create 300.0 4928459

fix myfix all nvt temp 300 300 0.1

run 5000000

write_data  Pt110_1.data

And the script file is :

#!/bin/bash
#SBATCH -o Pt110_remd.o%j
#SBATCH -N 1
#SBATCH -p GPU-shared
#SBATCH --gpus=v100-32:1
#SBATCH --ntasks=2
#SBATCH --cpus-per-gpu=2
#SBATCH -J Pt110_remd
#SBATCH --time=1:00:00

module load gcc/10.2.0
module load nvhpc/22.9 
module load openmpi/4.0.5-nvhpc22.9
module load cuda/11.7.1
module load python/3.8.6
module load mkl/2020.4.304

echo "SLURM_NTASKS: " $SLURM_NTASKS

ulimit -n 2048
#export OMP_NUM_THREADS=1
nvidia-smi -l 60 -f nvidia-smi-output-$SLURM_JOB_ID.txt &

mpirun -np 2 /ocean/projects/dmr200038p/szhange/gpu/lammps-stable_2Aug2023_update2/src/lmp_mpi   -sf gpu -pk  gpu 1  -in in.min_Pt
kill %1

At the beginning, it also shows the memory issue:

2024-07-27 10:17:46.544951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30755 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:16:00.0, compute capability: 7.0
2024-07-27 10:17:46.546467: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30755 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:16:00.0, compute capability: 7.0
2024-07-27 10:17:46.573219: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-07-27 10:17:46.575415: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-07-27 10:17:46.625906: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 30.03GiB (32249282560 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.627571: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 27.03GiB (29024354304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.629056: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 24.33GiB (26121918464 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.630528: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 21.89GiB (23509725184 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.631993: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 19.71GiB (21158752256 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.633458: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 17.73GiB (19042877440 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.634927: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 15.96GiB (17138589696 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.636350: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 14.37GiB (15424730112 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.637826: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 12.93GiB (13882256384 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.639203: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 11.64GiB (12494030848 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.640689: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 10.47GiB (11244627968 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.642121: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 9.42GiB (10120165376 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.643607: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 8.48GiB (9108148224 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.645070: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 7.63GiB (8197332992 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.646528: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 6.87GiB (7377599488 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.648015: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 6.18GiB (6639839232 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.649472: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 5.57GiB (5975855104 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.650928: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 5.01GiB (5378269696 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.652395: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 4.51GiB (4840442368 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.653848: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 4.06GiB (4356398080 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.655303: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 3.65GiB (3920758272 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.656758: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 3.29GiB (3528682240 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.658133: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 2.96GiB (3175813888 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.659533: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 2.66GiB (2858232320 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.661009: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 2.40GiB (2572409088 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.662466: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 2.16GiB (2315168000 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.663949: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 1.94GiB (2083651328 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.665442: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 1.75GiB (1875286272 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.666913: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 1.57GiB (1687757568 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.668392: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 1.41GiB (1518981888 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.669837: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 1.27GiB (1367083776 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.671240: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 1.15GiB (1230375424 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-27 10:17:46.672644: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 1.03GiB (1107337984 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

Then it began to print lammps information:

INVALID_ARGUMENT: Tensor spin_attr/ntypes_spin:0, specified in either feed_devices or fetch_devices was not found in the Graph
  >>> Info of model(s):
  using   1 model(s): compress_Pt110_160.pb
  rcut in model:      7
  ntypes in model:    1
INVALID_ARGUMENT: Tensor spin_attr/ntypes_spin:0, specified in either feed_devices or fetch_devices was not found in the Graph

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:
- GPU package (short-range, long-range and three-body potentials): doi:10.1016/j.cpc.2010.12.021, doi:10.1016/j.cpc.2011.10.012, doi:10.1016/j.cpc.2013.08.002, doi:10.1016/j.commatsci.2014.10.068, doi:10.1016/j.cpc.2016.10.020, doi:10.3233/APC200086
- USER-DEEPMD package:
The log file lists these citations in BibTeX format.

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
  update: every = 5 steps, delay = 5 steps, check = yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 9
  ghost atom cutoff = 9
  binsize = 4.5, bins = 23 23 23
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair deepmd, perpetual
      attributes: full, newton on
      pair build: full/bin/atomonly
      stencil: full/bin/3d
      bin: standard
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.002
Per MPI rank memory allocation (min/avg/max) = 3.109 | 3.109 | 3.11 Mbytes
   Step          Temp          TotEng         PotEng         KinEng         Volume         Press            Lx             Ly             Lz
         0   300           -565.69932     -569.92613      4.2268067      1000000        317.84803      100            100            100
slurmstepd: error: *** JOB 24798455 ON v008 CANCELLED AT 2024-07-27T11:17:47 DUE TO TIME LIMIT ***

The job is stuck at step of 0 and no new result is printed.
The output from nvidia-smi is also very strange:

Sat Jul 27 11:13:48 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB           On  | 00000000:16:00.0 Off |                    0 |
| N/A   29C    P0              56W / 300W |  32423MiB / 32768MiB |     29%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     18483      C   ...stable_2Aug2023_update2/src/lmp_mpi     1308MiB |
|    0   N/A  N/A     18484      C   ...stable_2Aug2023_update2/src/lmp_mpi    31112MiB |
+---------------------------------------------------------------------------------------+

I don’t think the memory required by the simple NVT simulation job needs 32423 MB memory usage. I contacted the people from ‘Bridges-2’ cluster (yes, it’s from Pittsburgh) but no response yet. I have no idea about this weird memory issue.

In addition, is it possible that the issue is from my compilation process of LAMMPS ? I’m not sure about this as the result of NVT simulation looks okay to me.

akohlmey · July 29, 2024, 1:02am

Please note that you have two processes attached to the GPU. One with about 30.3GB video RAM and the second with about 1.3GB. So it depends which is which. You probably need to stop the second one which was launched later.

Please also note that it is difficult to judge from remote how much memory your pair style requires.

In fact, since DeepMD is not part of LAMMPS, you are asking in the wrong forum. You need to discuss this with the DeepMD developers, since only they can tell you what the GPU memory requirements for your simulation is.

Germain · July 29, 2024, 8:22am

This seems unlikely to me but you might try and compile the latest version of the code just in case. @akohlmey is still right in advising you to contact the DeepMD developers concerning those memory issues.

No. You can’t tell with your actual settings. I would go even more simple in your test simulation. Go for 1000 steps with thermo 1 and flush yes settings to see how many steps are actually computed. It seems that your system is able to initialize, compute energy and pressure (which means it can compute forces). As you have no more outputs because of the defaults settings, this gives no information on the number of steps that might have been computed. It is possible that your system is running but still very slowly due to size and/or poor optimization. I can’t help further concerning the memory issue.

syi_zhang · July 30, 2024, 1:40am

Thanks for your kind reply.
I have solved my issue via reinstalling DeepMD-kit and recompiling LAMMPS. I edited Makefile.mpi instead of Makefile.linux under directory of lammps/lib/gpu. I think my previous issue was improper environment variant set in DeepMD-kit.

syi_zhang · July 30, 2024, 1:43am

Thanks for your kind and helpful suggestions. I solved this issue via changing some environment variables to install DeepMD-kit and recompiling LAMMPS.
Your suggestions on doing more simple simulation really help me find the issue.