What is happening during parallelization?

Dear ASE users,

I’m running some NEB calculations on HPC machines, in particular, Galileo (CINECA) with 48 proc x node.
I’m trying to understand how parallelization works to speed up my calculations.

This is my first input without any parallelization input setup

import ase
import numpy as np
from ase.calculators.orca import ORCA
from ase.atoms import Atoms
from ase.optimize import LBFGS, QuasiNewton
import ase.units as units
from ase.calculators.tip4p import TIP4P, epsilon0, sigma0
from ase.constraints import FixAtoms, FixBondLengths
from ase.io.trajectory import Trajectory
import io, os
import functions as fp
from ase.io import read,write
from ase.neb import NEB
#from ase.parallel import world


start_geom='initial_opt.xyz'
final_geom='final_opt.xyz'

start_geom_ase= read('initial_opt.traj')
final_geom_ase=read('final_opt.traj')

start_list_not_fix= fp.list_not_fix_func(start_geom,float(6))
end_list_not_fix= fp.list_not_fix_func_without_adsorbate(final_geom, float(6))

#QM-MM indexes
qmindex= start_list_not_fix + end_list_not_fix

mmindex=[]
for element in start_geom_ase:
    if element.index not in qmindex:
        mmindex.append(element.index)

#MM parameters
embedding = Embedding(rc=0.02)
lj = LJInteractions({('O', 'O'): (epsilon0, sigma0)})

calculator=EIQMMM(qmindex,ORCA(label='neb_orca', orcasimpleinput='B97-3c', orcablocks='%pal nprocs 48 \n %maxcore 2000'),
                           TIP4P(), interaction=lj, vacuum=None, embedding=embedding, output='neb.out')

images= [start_geom_ase]

#j = world.rank * 3 // world.size  # my image number

for i in range(3):
    image = start_geom_ase.copy()
    #if i == j:
    image.set_calculator(calculator)
    image.constraints=FixAtoms(mmindex)
    images.append(image)


images.append(final_geom_ase)

neb = NEB(images,allow_shared_calculator=True)
neb.interpolate(apply_constraint=True)
qn = LBFGS(neb, trajectory='neb.traj')
qn.run(fmax=0.01)

and these are the usual time step during the NEB:

       Step     Time          Energy         fmax
*Force-consistent energies used in optimization.
LBFGS:    0 11:54:10   -37982.963595*      39.3516
LBFGS:    1 11:58:16   -37989.062639*       5.3344
LBFGS:    2 12:02:23   -37989.331212*       3.9196
LBFGS:    3 12:06:47   -37989.833104*       2.3965
LBFGS:    4 12:11:05   -37990.305015*       2.2199
LBFGS:    5 12:15:12   -37990.403682*       2.3303

So more or less it requires 4 minutes for each NEB step.

However, if I insert the parallelization condition of ASE:

#..... same as before ..... #
j = world.rank * 3 // world.size  # my image number

for i in range(3):
    image = start_geom_ase.copy()
    if i == j:
        image.set_calculator(calculator)
        image.constraints=FixAtoms(mmindex)
        images.append(image)


images.append(final_geom_ase)

neb = NEB(images,allow_shared_calculator=True)
neb.interpolate(apply_constraint=True)
qn = LBFGS(neb, trajectory='neb.traj')
qn.run(fmax=0.01)

The time step for each NEB dramatically decrease in a very efficient way:

       Step     Time          Energy         fmax
*Force-consistent energies used in optimization.
LBFGS:    0 18:42:59   -37982.963594*      39.3515
LBFGS:    1 18:43:20   -37989.062639*       4.2437
LBFGS:    2 18:43:40   -37989.402321*       3.2667
LBFGS:    3 18:44:00   -37990.172304*       2.1642
LBFGS:    4 18:44:21   -37990.346845*       2.5432
LBFGS:    5 18:44:41   -37990.437566*       2.2412
LBFGS:    6 18:45:00   -37990.603804*       1.6526

With a time step for NEB of more or less 30 seconds.

There are two things that I’m not understanding.

  1. why is NEB step parallelized? I did NOT activated the keyword parallel=True inside the neb function

  2. in both of the above example I’m defining 48 processors as a maximum numbers of processors for the calculation with the following shell script:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=48
#SBATCH --account=IscrB_BEH2O2ND
#SBATCH -p g100_usr_smem            #Or g100_usr_bmem for fat nodes
#SBATCH -q g100_qos_dbg
#SBATCH --mem=300000MB              #Or up to 3000000MB on g100_usr_bmem
#SBATCH --time=00:05:00
#SBATCH --error=orca_%j.err
#SBATCH --job-name=orca_ase

module purge
module load profile/chem-phys
module load --auto  orca/5.0.3--openmpi--4.1.1--gcc--10.2.0

# suppress no cuda error
export OMPI_MCA_opal_warn_on_missing_libcuda=0

python3 $@  > ase_neb.out

and at the same time I’m defining inside the ORCA calculator orcablocks='%pal nprocs 48, with the following setup:

calculator=EIQMMM(qmindex,ORCA(label='neb_orca', orcasimpleinput='B97-3c', orcablocks='%pal nprocs 48 end \n %maxcore 2000'),
                           TIP4P(), interaction=lj, vacuum=None, embedding=embedding, output='neb.out')

So, as far as I understood ASE should run each ORCA input with a total of 48 processors so the maximum amount of available process and thus won’t be able to run in parallel also the other images.
This is also confirmed in the input created by ASE and the output printed by ORCA:

Input:

! engrad B97-3c
%pal nprocs 48 end
 %maxcore 2000% pointcharges "neb_orca.pc"


*xyz 0 1
O   -0.1352377220311489 5.426759293532192 4.289348954243008
......#other atoms ....

Output:

#.....orca output ...

************************************************************
           *        Program running with 48 parallel MPI-processes    *
           *              working on a common directory               *
           ************************************************************
------------------------------------------------------------------------------
                           ORCA GTO INTEGRAL CALCULATION
                           -- RI-GTO INTEGRALS CHOSEN --
------------------------------------------------------------------------------

#...orca output ...

Sorry for the very long question but I’m totally confused by what is happening.

Hope that someone could help in understanding this.

Best,

Vittorio