Properly setting ncores for batch VASP runs

jacksund · March 24, 2021, 8:53pm

Hi all,

When running VASP calculations through SLURM → FireWorks → custodian → mpirun, do extra cores need to be available for the FireWorker and Custodian processes?

For example…

I have one SLURM job that runs a singleshot FireWork.
The FireWork is a VASP static calc and runs with Custodian + default handlers.
The VaspJob launches the calculation via the command “mpirun -n 16 vasp”

In this case, would I need to submit a SLURM job with more than 16 cores (–ntasks 16)? I’d like to make sure I’m not slowing down calculations with insufficient cores as well as prevent requesting extra cores.

-Jack

Nicholas_Winner · March 24, 2021, 11:30pm

Hi Jacksund,

You do not need to explicitly leave cores available for python, because the whole job goes through python: Custodian opens VASP using the subprocess module and then VASP will make use of the available cores, but custodian will still be able to run in the background. Custodian/fireworks technically are taking resources away from VASP, but its so minimal compared to the demands of VASP that its basically non-existent.

-Nick

jacksund · March 25, 2021, 2:23pm

Hey Nick!

Thanks for the quick response! And this is good to know. I have to admit, I’m not 100% convinced for my specific use-case, so I’m actually going to do more testing on this and will report back later today. Also, let me know if I’m just speaking nonsense below – I’m not great at parallel processes and may be off/confused haha.

I seeing a significant slowdown (>30%) when the VASP jobs are small (e.g. less than 10 min) and Custodian is told to check for errors frequently (e.g. every 15s). After some more reading and testing, it’s my understanding that multiple processes are being spun up for each thing we are running. The main process is Python/Custodian → then subprocess spins up a secondary process for mpirun → and mpirun spins up “-n” processes for VASP.

[this is where I’m unsure and could use clarification] So having more processes than cores is fine when the majority of them are sleeping most of the time (like Python+Custodian typically does). But when all processes are active (i.e. VASP is running AND Custodian is always checking status), then process share cores and I think the sharing of cores slows down VASP significantly. A simple solution is to decrease my Custodian frequency, but I’m going to see if the number of cores available has the same effect.

Does it sound like I’m on the right track or am I misunderstanding subprocesses+cores?

-Jack

shyamd · March 25, 2021, 2:53pm

The Custodian checks are very lightweight in terms of CPU. What’s more likely causing the impact is IO. If you’re running the job on slow storage, then VASP time will be dominated by reading and writing to disk, as will Custodian time. Since this isn’t a parallelizable process, you’ll notice a big impact.

jacksund · March 25, 2021, 3:00pm

Ah gotcha, this makes a lot more sense. I am running within a network storage directory, so that’s probably it. Thank you! Pointing this out probably saved me a ton of time.