Rlaunch multi and custodian seem to clash if 2 or more vasp jobs run on a single node

Hi all,

on our new cluster, we have 2 AMD epyc 64c CPUs on each node, but it is not efficient for me to use VASP on 128 cores, 64 are much better for my problems, and usually actually complete faster than using all cores.
The ideal solution (I have enough memory) would be to run two calculations per node, e.g.
rlaunch multi 2
However, I run my calculations through custodian and noticed that a lot of Fireworks fizzle without running through all 5 error correction steps. They just get killed without custodian ever logging anything in the execution folder.

I am pretty sure that this is due to the other job on the node making a custodian correction (e.g. lowering SIGMA), which actually kills all VASP jobs on the node, not only the one that should be killed. While the corrected job usually completes fine, the other one cannot recover.

Now, I am not sure if anyone ever ran into this problem, since I guess rlaunch multi is more suitable for smaller jobs that do not run through custodian, but I would be interested if this is intended/expected behaviour or could be considered a bug.

I am also not sure if this is not better posted in the custodian section of this forum, I guess I will see what the response is here and then maybe post again there.

Thanks in any case! Michael

1 Like

OK, I did what I should have done previously and looked into the custodian code.

the way a VaspJob is killed is via the terminate method which uses killall on the vasp_cmd, so it is clear that all jobs are killed.

    def terminate(self):
        """
        Ensure all vasp jobs are killed.
        """
        for k in self.vasp_cmd:
            if "vasp" in k:
                try:
                    os.system("killall %s" % k)
                except Exception:
                    pass

Maybe an alternative would be to use the .kill() method of subprocess.Popen which is used to run VASP, but that may also kill the custodian run itself? I would appreciate some help here, but would be willing to do make the necessary code changes myself. Maybe @Shyue_Ping_Ong can weigh in on this?
Thanks!

Has to be considered a bug IMO. Independent jobs should not kill each other no matter if they run on the same or different nodes.

I made a pull request that hopefully fixes the issue.