BSE-XAS calculation crashes

Hi,

I am a new user of exciting and I am now testing it for XAS with BSE on hexagonal ice bulk structure. Today, I found that the same input would complete without problem on one supercomputer but not another. I tested the BSE-XAS examples in the tutorial and both worked fine. Below is the error message:

No =input.xml

Using specified input file: input.xml

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
exciting_mpismp 0000000001096E1A for__signal_handl Unknown Unknown
libpthread-2.26.s 00007FEB557682D0 Unknown Unknown Unknown
exciting_mpismp 0000000000AE5DF5 Unknown Unknown Unknown
exciting_mpismp 00000000006061D4 Unknown Unknown Unknown
exciting_mpismp 0000000000574761 Unknown Unknown Unknown
exciting_mpismp 0000000000793EE6 Unknown Unknown Unknown
exciting_mpismp 000000000069313A Unknown Unknown Unknown
exciting_mpismp 0000000000B1E987 Unknown Unknown Unknown
exciting_mpismp 0000000000412012 Unknown Unknown Unknown
libc-2.26.so 00007FEB553BE34A __libc_start_main Unknown Unknown
exciting_mpismp 0000000000411F2A Unknown Unknown Unknown
srun: error: nid001150: task 0: Exited with exit code 174
srun: launch/slurm: _step_signal: Terminating StepId=652529.0

I used Intel compilers to compile the program exciting oxygen. As a new user, I cannot attach any file, so I uploaded it on GitLab:

https://gitlab.com/chliu2018/exciting_tests/-/tree/main/test_1

The error message also exists in this directory (slurm file). Could you give me some hint about where it could have gone wrong?

Thank you!
Chang Liu

Hi Chang,

Can you compile with make debugmpiandsmp, and see if the problem occurs - if so, this will tell us where it went wrong.

I see that you’re running with ASE, I don’t think this makes sense. The ASE calculator only supports ground state calculations but you wish to do BSE skipping the ground state. Can you try submitting the job directly? I also don’t think there’s a way to specify MPI processes with the ASE calculator (?) If not, the run time performance will be terrible.

One you check these things, can you provide more information please? Which intel compiler and what make.inc did you use (provided or custom)? Can you supply INFO.OUT and corresponding BSE INFO file for the failing case, and the run time settings.

I also note your rgkmax is extremely small, but I assume this is for testing purposes.

Cheers,
Alex

Hi,

I compiled using make debugmpiandsmp and did the same test, but it seems this time the error pops out in process 401. I uploaded relevant files to:
https://gitlab.com/chliu2018/exciting_tests/-/tree/main/test_2

Yes, the official ASE version can only do serial calculations and I looked into the module and modified it a little so now it works with mpi executable, too. The modified code is at:
https://gitlab.com/chliu2018/ase-master-exciting/-/blob/master/ase/calculators/exciting.py
I added a new argument called “mpi_command” and in lines 94-96 one can see the trick. I have run around twenty tutorials from exciting’s homepage with 4 or 8 cores and it is working well.

As for the compilation, I uploaded relevant files to:
https://gitlab.com/chliu2018/exciting_tests/-/tree/main/build_info

Thank you very much!
Chang Liu

Hi Chang,

This is the problem:

forrtl: severe (408): fort: (8): Attempt to fetch from allocatable variable PMUO1 when it is not allocated

Image              PC                Routine            Line        Source             
exciting_debug_mp  0000000003935ADF  Unknown               Unknown  Unknown
exciting_debug_mp  0000000002A844C6  exccoulint_               484  exccoulint.f90
exciting_debug_mp  0000000000B1D746  exccoulintlaunche         118  exccoulintlauncher.f90
exciting_debug_mp  0000000000935FD9  xsmain_                   161  xsmain.F90
exciting_debug_mp  0000000001956B01  xstasklauncher_           233  xstasklauncher.f90
exciting_debug_mp  00000000022D8EE0  tasklauncher_              25  tasklauncher.f90
exciting_debug_mp  0000000001D7DD5B  MAIN__                     51  main.f90
exciting_debug_mp  0000000000410D52  Unknown               Unknown  Unknown
libc-2.26.so       00007F661379234A  __libc_start_main     Unknown  Unknown
exciting_debug_mp  0000000000410C6A  Unknown               Unknown  Unknown

I’ll ask the BSE developers if this has been patched.

W.r.t. running exciting with mpi via ASE, I didn’t look at the code but you’ll also need to set the env variable for OMP_NUM_THREADS for maximum efficiency. excitingtools already enables you to generate input for ground state and BSE with python, it also has numerous file parsers. There is an open MR with ASE to completely overhaul the calculator, using excitingtools as a plug-in . Hopefully that goes through this year.

Cheers,
Alex

Hi Alex,

I am indeed setting OMP_NUM_THREADS to 1 for my parallelized computations yes, and many thanks for the tips about python scripts of BSE - I am still learning these very convenient tools now (I created a special Python 2.7.14 environment in Anaconda for exciting tools) and they are indeed very helpful. The news about open MR with ASE sounds great, too.

Since I am trying to use exciting to compute X-ray absorption and emission spectra based on GW-BSE, I tried to go beyond TDA by setting coupling = "true", using the tutorial for BN (http://exciting.wikidot.com/oxygen-x-ray-absorption-spectra-using-bse). But unfortunately the program crashes each time I attempt:

mpprun info: Starting impi run on 1 node ( 8 rank X 1 th ) for job ID 21805777
Abort(101) on node 2 (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 101) - process 2
Abort(101) on node 4 (rank 4 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 101) - process 4
Abort(101) on node 6 (rank 6 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 101) - process 6
mpprun info:   Job terminated with error

The same test will run smoothly if I turn off coupling.

This exciting build was compiled by the staff at NSC in Linköping University, so it should not suffer from my own inexperience as in the previous case. I have uploaded the whole case in:
https://gitlab.com/chliu2018/exciting_tests/-/tree/main/test_3

Could you help me in this case, too? Thank you!!!

Best wishes,
Chang Liu

Hi Chang,

Python’s subprocess will start its own shell instance. I’m relatively sure that you would need to pass OMP_NUM_THREADS as an env dictionary to subprocess in order for anything other than 1 to be used. Something like this:

def some_routine(my_env, ...):
    if my_env is None:
        my_env: dict = os.environ.copy()
    process = Popen(execution_str.split(), cwd=path, stdout=PIPE, stderr=PIPE, env=my_env)

excitingtools is not the tutorial scripts, it’s the python3 package we’re developing to supersede the scripts: excitingtools · PyPI (also packaged with exciting).

W.r.t. the failing calculation, I’ll pass this info on to Fabian, who currently does the most x-ray absorption simulations in the group.

Cheers,
Alex

Hi Alex, hi Chang,

The described error should be patched. Anyway, if I remember correctly, this comes up with a certain intel compiler version so when you can change that, the problem should be resolved easily.

@chang_liu22 can you please check your compiler versions?

Best,
Benedikt

Hi,

The way I add the environment variable is through the submission shell script file:

#!/bin/bash
#SBATCH --mail-type=ALL
#SBATCH [email protected]
#SBATCH -D .
#SBATCH -p main
#SBATCH -n 32
#SBATCH -t 3:00:00
#SBATCH -A snic2021-3-34
#SBATCH -J ice_ih_gs
#SBATCH -c 1
source /cfs/klemming/projects/snic/xsolas/chliu/bash_files/load_ase-exciting_oxygen_202209.bash
python ice_ih_gs.py

whereby load_ase-exciting_oxygen_202209.bash is

module add PrgEnv-intel/8.2.0
source /cfs/klemming/projects/snic/xsolas/chliu/anaconda3/etc/profile.d/conda.sh
conda activate ase-3.22.1

export OMP_NUM_THREADS=1
export EXCITINGROOT=/cfs/klemming/projects/snic/xsolas/chliu/exciting_oxygen/exciting
export EXCITINGTOOLS=$EXCITINGROOT/tools
export TIMEFORMAT="   Elapsed time = %0lR"
export WRITEMINMAX="1"
export PYTHONPATH=$PYTHONPATH:$EXCITINGTOOLS/stm
export PATH=$PATH:$EXCITINGTOOLS:$EXCITINGROOT/bin:$EXCITINGTOOLS/stm

I think the version for ifort is 2021.5.0 20211109, as

$ ifort --version
ifort (IFORT) 2021.5.0 20211109
Copyright (C) 1985-2021 Intel Corporation.  All rights reserved.

I will try another version and see if the problem is gone. Thank you!

PS: Is there a way of running BSE XANES/XES with coupling = "true"?

Best wishes,
Chang Liu

Hi again,

I used another “Programming Environment for Intel” module on the cluster Dardel at PDC, Royal Institute of Technology in Sweden, but unfortunately it uses the same ifort version and the calculation crashed with the same error again after I compiled the program with that module. And it seems there are only these two tested modules for intel compilers installed there. Could you please tell me which version of ifort I should request from the supercomputer staff?

Best wishes,
Chang Liu

Hi Chang,

Intel 2021 is about as new as it gets, I’d be surprised if changing the compiler affects an allocation error. But regardless, for the patched version of exciting oxygen, please clone from our Github page and see if that does the trick.

W.r.t. running coupling matrices with BSE XANES/XES, @maurerben is the person to answer.

Cheers,
Alex

1 Like

Hi Alex,

I think I might have downloaded an older version of the code as I downloaded it from http://exciting.wikidot.com/oxygen instead of GitHub as I subconsciously thought both should have the same version. I will clone from the GitHub page like you said and redo all the tests! It is very likely that most or even all of the errors I have encountered will not appear had I cloned from GitHub.

Thank you very much!

Best wishes,
Chang Liu

The website is version oxygen (8.0.0.) but patches get pushed to Github as we find and fix problems.

We are in the process of phasing out the wikidot site.

Cheers,
Alex

1 Like

Hi again,

I have cloned from the exciting GitHub repository and recompiled it on the supercomputer and the XAS and G0W0 runs for ice VII were completed successfully this time.

But it seems the XES computation has been failing with errors I do not understand. So I compiled the exciting_debug_mpismp binary and reran the simulation, the error this time is:

Image              PC                Routine            Line        Source             
exciting_debug_mp  000000000394054F  Unknown               Unknown  Unknown
exciting_debug_mp  000000000057B160  modbse_mp_select_         983  modbse.f90
exciting_debug_mp  000000000313E182  scrcoulint_               226  scrcoulint.f90
exciting_debug_mp  00000000019A57A2  scrcoulintlaunche          98  scrcoulintlauncher.f90
exciting_debug_mp  0000000001169254  xsmain_                   156  xsmain.F90
exciting_debug_mp  00000000004EB0FD  xstasklauncher_           233  xstasklauncher.f90
exciting_debug_mp  00000000004DF8FC  tasklauncher_              25  tasklauncher.f90
exciting_debug_mp  000000000333B5B3  MAIN__                     51  main.f90
exciting_debug_mp  0000000000410D52  Unknown               Unknown  Unknown
libc-2.26.so       00007FBDA836C34A  __libc_start_main     Unknown  Unknown
exciting_debug_mp  0000000000410C6A  Unknown               Unknown  Unknown
forrtl: severe (408): fort: (33): Shape mismatch: The extent of dimension 2 of array SMAP is 320 and the corresponding extent of array <RHS expression> is 256

I have uploaded the run in:

https://gitlab.com/chliu2018/exciting_tests/-/tree/main/test_4

Could you please give me some hint about how one should fix the run?

Besides, I am a bit confused about how many processes in total and how much memory I should assign to G0W0, XES and XAS for larger systems, such as the hexagonal ice bulk super cell which contains 24 water molecules, and denser k/q point sampling etc. as the computational resources required might be much larger than the tutorial tests. Is there are way to ask the program to print out the memory required in total or per process for a run? Thanks!

Best wishes,
Chang Liu

Hi Chang,

Is this also happening on an other machine or with an other compiler?

About scaling the calculations: You can parallelize the BSE caclulations (XAS, XES etc.) over the number of k-points. I try to distribute it normally as much as possible. In the BSE section of the INFOXS.OUT you also find the local matrix size.

Best,
Benedikt

Hi,

I have built exciting with GCC and this time, BSE-XES for LiF from the tutorial was done successfully, but it crashed when I tested the ice VII bulk structure with 2 H2O molecules. The build information and the test i/o files are in:
https://gitlab.com/chliu2018/exciting_tests/-/tree/main/test_5

Besides, I have been struggling to get the BSE-XAS for larger systems to get past the out of memory (OOM) error. For example, for the bulk structure of ice ih with 12 water molecules, I increased the computational resources up to 96 cores and 1152 GB RAM, and it still crashed at the GW step complaining OOM. For the 32 water structure, it is stuck at the first cycle of KS-DFT scf. I began with fewer cores and far less memories and increased the resources requested gradually but they always have this same problem. The same calculation for the 2 water molecule ice VII structure was sucsessful. These tests are done with the previous intel compiler build and are in:
https://gitlab.com/chliu2018/exciting_tests/-/tree/main/test_6

Should I decrease the k/q point sampling density? I did convergence tests and the current k/q point density was the convergence point.

Best wishes,
Chang Liu