Dear ASE users,
I’m running some NEB calculations on HPC machines, in particular, Galileo (CINECA) with 48 proc x node.
I’m trying to understand how parallelization works to speed up my calculations.
This is my first input without any parallelization input setup
import ase
import numpy as np
from ase.calculators.orca import ORCA
from ase.atoms import Atoms
from ase.optimize import LBFGS, QuasiNewton
import ase.units as units
from ase.calculators.tip4p import TIP4P, epsilon0, sigma0
from ase.constraints import FixAtoms, FixBondLengths
from ase.io.trajectory import Trajectory
import io, os
import functions as fp
from ase.io import read,write
from ase.neb import NEB
#from ase.parallel import world
start_geom='initial_opt.xyz'
final_geom='final_opt.xyz'
start_geom_ase= read('initial_opt.traj')
final_geom_ase=read('final_opt.traj')
start_list_not_fix= fp.list_not_fix_func(start_geom,float(6))
end_list_not_fix= fp.list_not_fix_func_without_adsorbate(final_geom, float(6))
#QM-MM indexes
qmindex= start_list_not_fix + end_list_not_fix
mmindex=[]
for element in start_geom_ase:
if element.index not in qmindex:
mmindex.append(element.index)
#MM parameters
embedding = Embedding(rc=0.02)
lj = LJInteractions({('O', 'O'): (epsilon0, sigma0)})
calculator=EIQMMM(qmindex,ORCA(label='neb_orca', orcasimpleinput='B97-3c', orcablocks='%pal nprocs 48 \n %maxcore 2000'),
TIP4P(), interaction=lj, vacuum=None, embedding=embedding, output='neb.out')
images= [start_geom_ase]
#j = world.rank * 3 // world.size # my image number
for i in range(3):
image = start_geom_ase.copy()
#if i == j:
image.set_calculator(calculator)
image.constraints=FixAtoms(mmindex)
images.append(image)
images.append(final_geom_ase)
neb = NEB(images,allow_shared_calculator=True)
neb.interpolate(apply_constraint=True)
qn = LBFGS(neb, trajectory='neb.traj')
qn.run(fmax=0.01)
and these are the usual time step during the NEB:
Step Time Energy fmax
*Force-consistent energies used in optimization.
LBFGS: 0 11:54:10 -37982.963595* 39.3516
LBFGS: 1 11:58:16 -37989.062639* 5.3344
LBFGS: 2 12:02:23 -37989.331212* 3.9196
LBFGS: 3 12:06:47 -37989.833104* 2.3965
LBFGS: 4 12:11:05 -37990.305015* 2.2199
LBFGS: 5 12:15:12 -37990.403682* 2.3303
So more or less it requires 4 minutes for each NEB step.
However, if I insert the parallelization condition of ASE:
#..... same as before ..... #
j = world.rank * 3 // world.size # my image number
for i in range(3):
image = start_geom_ase.copy()
if i == j:
image.set_calculator(calculator)
image.constraints=FixAtoms(mmindex)
images.append(image)
images.append(final_geom_ase)
neb = NEB(images,allow_shared_calculator=True)
neb.interpolate(apply_constraint=True)
qn = LBFGS(neb, trajectory='neb.traj')
qn.run(fmax=0.01)
The time step for each NEB dramatically decrease in a very efficient way:
Step Time Energy fmax
*Force-consistent energies used in optimization.
LBFGS: 0 18:42:59 -37982.963594* 39.3515
LBFGS: 1 18:43:20 -37989.062639* 4.2437
LBFGS: 2 18:43:40 -37989.402321* 3.2667
LBFGS: 3 18:44:00 -37990.172304* 2.1642
LBFGS: 4 18:44:21 -37990.346845* 2.5432
LBFGS: 5 18:44:41 -37990.437566* 2.2412
LBFGS: 6 18:45:00 -37990.603804* 1.6526
With a time step for NEB of more or less 30 seconds.
There are two things that I’m not understanding.
-
why is NEB step parallelized? I did NOT activated the keyword
parallel=True
inside the neb function -
in both of the above example I’m defining 48 processors as a maximum numbers of processors for the calculation with the following shell script:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=48
#SBATCH --account=IscrB_BEH2O2ND
#SBATCH -p g100_usr_smem #Or g100_usr_bmem for fat nodes
#SBATCH -q g100_qos_dbg
#SBATCH --mem=300000MB #Or up to 3000000MB on g100_usr_bmem
#SBATCH --time=00:05:00
#SBATCH --error=orca_%j.err
#SBATCH --job-name=orca_ase
module purge
module load profile/chem-phys
module load --auto orca/5.0.3--openmpi--4.1.1--gcc--10.2.0
# suppress no cuda error
export OMPI_MCA_opal_warn_on_missing_libcuda=0
python3 $@ > ase_neb.out
and at the same time I’m defining inside the ORCA calculator orcablocks='%pal nprocs 48
, with the following setup:
calculator=EIQMMM(qmindex,ORCA(label='neb_orca', orcasimpleinput='B97-3c', orcablocks='%pal nprocs 48 end \n %maxcore 2000'),
TIP4P(), interaction=lj, vacuum=None, embedding=embedding, output='neb.out')
So, as far as I understood ASE should run each ORCA input with a total of 48 processors so the maximum amount of available process and thus won’t be able to run in parallel also the other images.
This is also confirmed in the input created by ASE and the output printed by ORCA:
Input:
! engrad B97-3c
%pal nprocs 48 end
%maxcore 2000% pointcharges "neb_orca.pc"
*xyz 0 1
O -0.1352377220311489 5.426759293532192 4.289348954243008
......#other atoms ....
Output:
#.....orca output ...
************************************************************
* Program running with 48 parallel MPI-processes *
* working on a common directory *
************************************************************
------------------------------------------------------------------------------
ORCA GTO INTEGRAL CALCULATION
-- RI-GTO INTEGRALS CHOSEN --
------------------------------------------------------------------------------
#...orca output ...
Sorry for the very long question but I’m totally confused by what is happening.
Hope that someone could help in understanding this.
Best,
Vittorio