Dear Axel and ATAT experts,
I’m currently trying cluster expansion using mmaps and pollmach and encountering difficulties with simultaneous execution of multiple DFT calculations. My objective is to efficiently execute four DFT calculations in parallel. I have compiled ATAT package on our supercomputer and I did not create machine.rc file.
Below is the sbatch script I’ve been utilizing:
#!/bin/bash -x
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=128
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=24:00:00
module use $OTHERSTAGES
module load Stages/2023
module load intel-para/2022a
module load GCCcore/.11.3.0 tcsh/6.24.01
#/p/home/jusers/ting1/jureca/bin_jureca/atat/maps -d &
Run multiple parallel instances of pollmach
mmaps -d &
pollmach ./runstruct_qe srun --exclusive -n 64 &
sleep 30
pollmach -f ./runstruct_qe srun --exlusive -n 64 &
sleep 30
pollmach -f ./runstruct_qe srun --exlusive -n 64 &
sleep 30
pollmach -f ./runstruct_qe srun --exlusive -n 64
wait
I aimed to distribute the four jobs across 256 cores, allocating 64 cores per job. Initially, the setup allowed for four parallel jobs that executed successfully. However, after a few jobs done, I observed failures in some calculations, primarily due to incorrect input files. My initial guess is that concurrent pollmach codes might be attempting to execute identical jobs, leading to disarray in the input files.
Could you provide insights or guidance on correctly configuring parallel jobs using sbatch scripts and pollmach to avoid such issues? Your expertise and advice would be immensely appreciated.
Thank you for your time and assistance!
Best regards,
Yin-Ying Ting