Hey Nick!
Thanks for the quick response! And this is good to know. I have to admit, I’m not 100% convinced for my specific use-case, so I’m actually going to do more testing on this and will report back later today. Also, let me know if I’m just speaking nonsense below – I’m not great at parallel processes and may be off/confused haha.
I seeing a significant slowdown (>30%) when the VASP jobs are small (e.g. less than 10 min) and Custodian is told to check for errors frequently (e.g. every 15s). After some more reading and testing, it’s my understanding that multiple processes are being spun up for each thing we are running. The main process is Python/Custodian → then subprocess spins up a secondary process for mpirun → and mpirun spins up “-n” processes for VASP.
[this is where I’m unsure and could use clarification] So having more processes than cores is fine when the majority of them are sleeping most of the time (like Python+Custodian typically does). But when all processes are active (i.e. VASP is running AND Custodian is always checking status), then process share cores and I think the sharing of cores slows down VASP significantly. A simple solution is to decrease my Custodian frequency, but I’m going to see if the number of cores available has the same effect.
Does it sound like I’m on the right track or am I misunderstanding subprocesses+cores?
-Jack