Hi all,
on our new cluster, we have 2 AMD epyc 64c CPUs on each node, but it is not efficient for me to use VASP on 128 cores, 64 are much better for my problems, and usually actually complete faster than using all cores.
The ideal solution (I have enough memory) would be to run two calculations per node, e.g.
rlaunch multi 2
However, I run my calculations through custodian and noticed that a lot of Fireworks fizzle without running through all 5 error correction steps. They just get killed without custodian ever logging anything in the execution folder.
I am pretty sure that this is due to the other job on the node making a custodian correction (e.g. lowering SIGMA), which actually kills all VASP jobs on the node, not only the one that should be killed. While the corrected job usually completes fine, the other one cannot recover.
Now, I am not sure if anyone ever ran into this problem, since I guess rlaunch multi
is more suitable for smaller jobs that do not run through custodian, but I would be interested if this is intended/expected behaviour or could be considered a bug.
I am also not sure if this is not better posted in the custodian section of this forum, I guess I will see what the response is here and then maybe post again there.
Thanks in any case! Michael