Hi all,
I am currently running a Hybrid MD/MC simulation to observe the precipitation sequence and evolution of an Al-Mg-Mn-Sc-Zr alloy during isothermal heat treatment at 573 K. The initial configuration is derived from a Laser Powder Bed Fusion (LPBF) simulation.
System and Resources:
-
LAMMPS Version: 29 Aug 2024
-
Potential: DeepMD (Machine Learning Potential) combined with ZBL.
-
System Size: ~850,000 atoms total (approx. 370,000 atoms involved in the swap group).
-
Hardware: 1 Node with 8x Tesla V100 GPUs (running via Singularity container).
-
Parallelism: 8 MPI tasks (1 per GPU) with 2 OpenMP threads.
The Challenge:
Since DeepMD inference is computationally expensive, the MC swap steps (which require energy re-calculation) significantly slow down the simulation.
-
Pure MD speed: ~0.21 ns/day.
-
Hybrid MD/MC speed: ~0.117 ns/day (approx. 50% slowdown).
To maintain a manageable simulation speed, I have set the fix atom/swap parameters to a relatively low frequency and low number of attempts:
-
N(invoke frequency) = 100 steps (0.1 ps) -
X(max attempts) = 5 to 10 attempts per fix command. -
Total swap attempts per 0.1 ps: 45 (sum across all 6 fix commands).
My Questions:
-
Physical Validity: Given the extremely slow diffusion in solids, will such a low swap frequency (attempting only ~0.01% of atoms every 0.1 ps) be sufficient to overcome the time-scale limitations and effectively simulate precipitation/clustering evolution? Or is this flux too low to observe meaningful microstructural changes within a few nanoseconds of simulation time?
-
Efficiency Strategy: Are there specific strategies for optimizing
fix atom/swapwhen using expensive ML potentials like DeepMD? Is it better to perform fewer swaps more frequently (e.g., N=10, X=1) or bulk swaps less frequently (e.g., N=1000, X=100) to minimize GPU communication/invocation overhead?
Below are my input script and relevant log outputs.
Input Script :
# 1. Initialization
units metal
boundary p p p
atom_style atomic
neighbor 2 bin
neigh_modify every 1 delay 0 check yes
timestep 0.001
read_data test.data
# 2. Force Field (DeepMD + ZBL)
pair_style hybrid/overlay deepmd model-compress.pb zbl 1.9 2.4
pair_coeff * * deepmd Al Mg Mn Sc Zr
pair_coeff * * zbl 0 0
pair_coeff 1 1 zbl 13 13
pair_coeff 1 2 zbl 13 12
pair_coeff 1 3 zbl 13 25
pair_coeff 1 4 zbl 13 21
pair_coeff 1 5 zbl 13 40
pair_coeff 2 2 zbl 12 12
pair_coeff 2 3 zbl 12 25
pair_coeff 2 4 zbl 12 21
pair_coeff 2 5 zbl 12 40
pair_coeff 3 3 zbl 25 25
pair_coeff 3 4 zbl 25 21
pair_coeff 3 5 zbl 25 40
pair_coeff 4 4 zbl 21 21
pair_coeff 4 5 zbl 21 40
pair_coeff 5 5 zbl 40 40
# 3. Group definitions
region powderbed block INF INF INF INF 32 INF units box
group powderbed region powderbed
fix 1 all npt temp 573 573 0.1 x 0 0 1.0 y 0 0 1.01
# 4. MC Swap Settings
# Al-Mn
fix mc_Mn powderbed atom/swap 100 10 12342 573 types 1 3 ke yes
# Al-Sc
fix mc_Sc powderbed atom/swap 100 10 12343 573 types 1 4 ke yes
# Al-Zr
fix mc_Zr powderbed atom/swap 100 10 12344 573 types 1 5 ke yes
# Sc-Zr (Core-shell competition)
fix mc_ScZr powderbed atom/swap 100 5 12349 573 types 4 5 ke yes
# Mg-related
fix mc_Mg powderbed atom/swap 100 5 12341 573 types 1 2 ke yes
fix mc_MgMn powderbed atom/swap 100 5 12347 573 types 2 3 ke yes
# 5. Output
thermo 1
# Output acceptance counts for monitoring
thermo_style custom step temp pe press vol &
f_mc_Mn[1] f_mc_Mn[2] &
f_mc_Sc[1] f_mc_Sc[2] &
f_mc_Zr[1] f_mc_Zr[2] &
f_mc_ScZr[1] f_mc_ScZr[2] &
f_mc_Mg[1] f_mc_Mg[2] &
f_mc_MgMn[1] f_mc_MgMn[2]
dump 1 all custom 100 test.xyz id type x y z vx vy vz
run 1000
Log Output (Excerpt):
Per MPI rank memory allocation (min/avg/max) = 69.73 | 91.72 | 117.2 Mbytes
Step Temp PotEng Press Volume f_mc_Mn[1] f_mc_Mn[2] f_mc_Sc[1] f_mc_Sc[2] f_mc_Zr[1] f_mc_Zr[2] f_mc_ScZr[1] f_mc_ScZr[2] f_mc_Mg[1] f_mc_Mg[2] f_mc_MgMn[1] f_mc_MgMn[2]
0 521.30925 -1.3931766e+08 102.28221 51336049 0 0 0 0 0 0 0 0 0 0 0 0
1000 567.19876 -1.3931208e+08 2.6469089 51282428 100 7 100 7 100 8 50 10 50 0 50 1
Loop time of 740.609 on 16 procs for 1000 steps with 856260 atoms
Performance: 0.117 ns/day, 205.725 hours/ns, 1.350 timesteps/s, 1.156 Matom-step/s
80.9% CPU use with 8 MPI tasks x 2 OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
---------------------------------------------------------------
Pair | 185.55 | 247.57 | 313.65 | 367.9 | 33.43
Modify | 337.53 | 354.47 | 359.76 | 34.8 | 47.86 <-- MC Swap overhead
Other | | 19.57 | | | 2.64
Job Submission Script (Slurm/Singularity):
#SBATCH --gres=gpu:8
# ...
singularity exec --nv -e -B /public:/public .../deepmd-kit_3.0.0rc0_cuda118.sif bash << EOF
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export TF_FORCE_GPU_ALLOW_GROWTH=true
export TF_NUM_INTEROP_THREADS=1
export TF_NUM_INTRAOP_THREADS=1
export OMP_NUM_THREADS=2
# ...
mpirun -np 8 lmp -in test.in
EOF
Any insights or suggestions would be greatly appreciated. Thank you in advance!