Why does exciting_smp only take up two CPUs at mosts

Dear all,
I run the exciting_smp by following the turtorial.
I have set OMP_NUM_THREADS=32, however, exciting_smp only uses one or two (at mosts) CPUs all the time.
My system: gfortran 9.2.0, exciting-oxygen. CentOS 7.0

Best regards.
Youzhao Lan

Hi Youzhao,

Can you give the input and species files please, so I can try and reproduce the run?
Or specify which tutorial it is?

For groundstate, threading will mostly be utilised in the lapack/openBLAS/MKL library.
Which did you link to?

Thanks,
Alex

Dear Alex,
Thanks for your reply.
I am sorry for no enough information for you.
I try the examples in the turtorial, exciting_smp runs fine
but I try the following input which is written by following the example in the turtorial
Electronic Band Structure from GW - exciting,
exciting_smp runs with only one or two (at mosts) CPUs all the time.

<input>
  <title>monolayer BN</title>
  <structure speciespath='$EXCITINGROOT/species'>
    <crystal>
      <basevect>    4.109335   -2.372526    0.000000</basevect>
      <basevect>    0.000000    4.745052    0.000000</basevect>
      <basevect>    0.000000    0.000000   28.369500</basevect>
    </crystal>
    <species speciesfile='B.xml' rmt='1.34'>
      <atom coord='    0.000000    0.000000    0.500000'></atom>
    </species>
    <species speciesfile='N.xml' rmt='1.34'>
      <atom coord='    0.333333   -0.333333    0.500000'></atom>
    </species>
  </structure>

  <groundstate
      do="fromscratch"
      rgkmax="7.0"
      ngridk="6 6 1"
      xctype="LDA_PW"
      >
   </groundstate>

</input>

Best regards.
Lan

Hi Lan,

Could you clarify please:
You’re trying to follow the GW tutorial for silicon but apply it to boron nitride. And to begin with you’d like to perform a ground state calculation using this input?

What linear algebra library did you link to when you built? packaged lapack/blas, system lapack/blas, MKL, openBLAS?

And to confirm you export omp_num_threads=32?

Dear Alex,
Thanks for your reply.
Yes, I want to calculate the BN’s round state and write the input file based on the tutorial example.
For the example in this tutorial (i.e. Silicon), the exciting _smp works well under multiple threads.
So, I think my compilation is normal.
However, when I run the job for the ground state of BN, the exciting _smp always runs with only one or two (at most) CPUs.
Can you try to run my input file?

Note: oxygen-o’s make_script will not automatically compile lapack/blas contained in the src directory. So I link to an external lapack/blas compiled without the -fopenmp flag. (I don’t have permission to recompile lapack for the time being)

Best regards.
Lan

Hi Lan,

I can try running your file but:

  • I don’t think there’s actually problem
  • I’d like to fully clarify the lapack point

W.r.t lapack/blas linking, if it’s the vendor version with CentOS 7.0, and as you’ve suggested it’s serial-built, diagonalisation won’t run with threads. Not alot of ground state is openMP threaded. Rather, to get optimal performance one assumes a) you build the MPISMP version to exploit k-point distribution across MPI processes and b) you use openBLAS or MKL to gain the benefit of shared-memory parallelism for the linear algebra operations.

PS Never build with the supplied version of lapack/blas (it will be the slowest). This is left over from long ago and I need to remove it completely before the next release.

My thoughts:

  • Use the package manager of CentOS to install openBLAS or MKL. Rebuild exciting, link to either and see if the threads change/the code speeds up. The Intel suite is actually very easy to install now, and does not require a paid licence.
  • GW will be similarly slow unless you do this, because of the current parallelisation scheme.
  • I can run this file for you and see what happens on my mac using GCC 9 and openBLAS.
  • Why did you select the same muffin tin radius for B and N? This is probably suboptimal (I will need to check with Andris to get the optimal ratio)

Cheers,
Alex

Dear Alex,
Great thanks for your time.
I compiled the MPI version, and it runs well.

Becasue the default rmt in species file is 1.45 for both B and N. Their sum is 2.90 which is larger than the distance (2.72) between B and N atoms. So I set them to be 1.34 (I also refer to the ELK program).

Best regards.
Lan