ATAT automation on multiple nodes

freshwind · March 2, 2025, 6:54am

Dear Axel,

I had been tyring to get ATAT to work on multiple nodes when invoking the ab initio code (in my case VASP). From what I can tell one would have to write a batchscript which is invoked every time the pollmach command is issued. So some modifications have to be done to the scripts.

I did not succeed in getting that up an running, I am not that much of a scripting expert.
Every time I tried, the batchscript was submitted but as soon as that was done the temporary ‘slave’ directory was copied back to the local workstation.

That is why I had settled with invoking the ab initio calculation by hand and extracting the information for maps/mmaps with extract_vasp up to now.

Can you tell how the scripts have to modifed?

Thanks, Siau

avdw · March 2, 2025, 7:05am

I am assuming your system uses a queuing system. The easiest way would to submit

a batch running maps in the background and pollmach runsttuct_vasp in the foreground.
other batches running pollmach runstruct_vasp
Delete or rename your .machine.rc so pollmach runs single machine mode.

freshwind · March 2, 2025, 7:30am

I guess that this solution might turn out to be a inconvenient since on our queuing system the maximum walltime of a batchscript is 24 hours. So If maps is sent via batch then it would terminate the whole process after 24 hours have been reached. That is why I tried to modify the varibale VASPCMD in the script ezvasp from the binary ‘vasp’ to something like ‘msub batchscript’, whereas in the batchscript the execution of vasp is specified on mutliple nodes.

Suddha · March 3, 2025, 6:19am

I am not sure if the solution given here is the same as what you want. I have created a machines file in the maps directory, something like
icn11
icn12
icn13
icn14

Here, icn11 means the 11th node with 12 processors. So, essentially I run the code over 48 processors. I have also created a vaspmpi.sh script which is something like

#!/bin/bash

source /opt/intel/composerxe/bin/compilervars.sh intel64
source /opt/intel/impi/4.0.2.003/intel64/bin/mpivars.sh
VASP_DIR=/home/appl/vasp-5.3/bin
mpirun -f machines -nolocal -n 2 $VASP_DIR/vasp-nc

It will access the machines file in the maps directory.

You can run the command

nohup runstruct_vasp ./vaspmpi.sh

and check nohup.out

Hope this can be the starting point…

Regards
Suddhasattwa Ghosh

freshwind · March 4, 2025, 12:27am

Is it possible to have the .machines.rc written so that it specifies only the local machine (cluster with qeueing system) so that pollmach runs only on that, polling up to 15 jobs?

Suddha · March 4, 2025, 12:37am

Hello Sir,

I have not quite used the pollmach command. But let me share my experience with you. I did configure the .machines.rc file, something like

.machines.rc file

#configuration file for chl, minload, and pollmach
#this line indicates the waiting time between the checks of machine availability
#in seconds
set waitbetweenpoll=60
#The remainder of this file lists the machines
#columns in this file are separated by a +
#each line corresponds to a machine
#the first column indicates the command to obtain the load on a remote machine
#the second column is the command prefix to lauch a job on that remote machine
#note that the command must cd into the same directory on the remote machine
#as on the local machine. The ‘node’ command does that automatically.
#Type ‘node’ on the command line for more info.

#the first line sets the threshold load for not starting a job (here 0.5)
#do not remove the ‘none’ keyword
echo 6.5 + none
#list the machines here
#example on a local network (e.g. beowolf) with shared disk
#secure version of the above
ssh sghosh@icn11 uptime | getvalue average + node -s sghosh@icn11
ssh sghosh@icn12 uptime | getvalue average + node -s sghosh@icn12
ssh sghosh@icn13 uptime | getvalue average + node -s sghosh@icn13
ssh sghosh@icn14 uptime | getvalue average + node -s sghosh@icn14
#node -s sghosh@icn11
#node -s sghosh@icn12
#node -s sghosh@icn13
#node -s sghosh@icn14

Now, essentially it means the jobs run over icn11, icn12, icn13 and icn14 nodes (with 12 processors each). That is, each job can run over 48 processors depending on the load..
I tried to optimize ''echo 6.5+none ‘’ by assuming that a maximum of 4 jobs(runstruct_vasp in 4 directories) can run over 48 processors.

You may have to tune this line according to your needs. I am not sure if I am speaking right. Dr. Axel can correct me if I am wrong…

Since you want a maximum of 15 jobs to be run,

Increase number of cores
Optimize echo 6.5+none

Then you can invoke pollmach and run

pollmach runstruct_vasp …

I hope this helps…
Dr. Axel, please let me know if I am right here…

Thank you
Regards
Suddhasattwa

avdw · March 4, 2025, 11:58pm

Thanks for your help, Suddhasattwa!
Your answer seems correct and helpful.

terencelz · March 18, 2025, 1:30am

This looks like a thread more than a year ago. Axel, would you prefer people ask relevant/similar questions in the same thread, even if the thread is one year old, or they open up a new one?

My confusion here is about the assumption of the setup. The original poster freshwind has a machine that runs with a queuing system, which means everything is (wall-)timed and requires a job submission script.

Is the setup proposed by Suddha:

a. applicable to this very machine, somehow still submitting a script for every single vasp run,
b. or still applicable to this very machine, but the script is to be submitted as a whole (when requesting a lot of nodes), and allocate the resources to run more than one runstruct_vasp copy at the same time
c. or he is actually sshing into other machine to run vasp, using the current workstation with the queuing system as a local machine?

I guess I’m not familiar with supercomputers without queuing systems, and assume the multi-machine mode is designed for cases like that, when every remote supercomputer does not have a queuing system.

avdw · March 18, 2025, 1:31am

Pooling the topics as you did is more useful. (Time doesn’t matter, people typically search for topics).

The example by Sudda is mostly for a machine where you have control over all the nodes (and don’t need a submission script). It could be adapted for machines with a queueing system but where you submit the whole maps run as a single job. (the ssh command would have to be based on which nodes your job is running).

Machine with queueing system are the most common. Personnally, what I do is to have a job submission script that runs

maps -d &
pollmach runstruct_vasp

and not use a .machines.rc file at all. The parallelism occurs within the vasp job.
If I want multiple vasp jobs to run simultaneously, then I submit another jobs with script:


pollmach runstruct_vasp

terencelz · March 18, 2025, 1:32am

The way you pointed out is the way I’ve been doing it. I didn’t find that in the manual but ruled out the other (obvious) ways. I think there is a growing user base of queuing systems among vasp users, so maybe a reminder in the next version would be helpful to beginners.
It seems that you spent quite some effort on developing the multi-machine mode. Where are these typical supercomputer infrastructure setups? Do they belong to 10 years ago, or they belong to some specific fields of study?

avdw · March 18, 2025, 1:33am

good idea! An update has now been posted.

It was more common 10 years ago, but there are still a lot of small research groups with small clusters that don’t want to run a queueing system. You can also use the multi-machine mode within a job script, with some more efforts.
(In any case, this piece of code was a very small part of the whole ATAT effort!)

oupengfei1989 · March 20, 2025, 5:47am

Dear Suddha,

It seems that your solution works. As I set the parameters of our machines, it automatically login to other nodes and now maybe we can calculate this on multiple nodes.

But another problem is encountered, when I using the paralleled ATAT code, the VASP cannot run on the structure automatically, but it works if it is running on the single node.

cp: cannot stat OSZICAR': No such file or directory cp: cannot stat OUTCAR’: No such file or directory
cp: cannot stat CONTCAR': No such file or directory cp: cannot stat CONTCAR’: No such file or directory
cp: cannot stat OSZICAR': No such file or directory cp: cannot stat OUTCAR’: No such file or directory
cp: cannot stat CONTCAR': No such file or directory cp: cannot stat DOSCAR’: No such file or directory
unable to open OSZICAR or OSZICAR.static

Do you have any suggestions for this ?

Best

Hello Sir,

I have not quite used the pollmach command. But let me share my experience with you. I did configure the .machines.rc file, something like

.machines.rc file

#configuration file for chl, minload, and pollmach
#this line indicates the waiting time between the checks of machine availability
#in seconds
set waitbetweenpoll=60
#The remainder of this file lists the machines
#columns in this file are separated by a +
#each line corresponds to a machine
#the first column indicates the command to obtain the load on a remote machine
#the second column is the command prefix to lauch a job on that remote machine
#note that the command must cd into the same directory on the remote machine
#as on the local machine. The ‘node’ command does that automatically.
#Type ‘node’ on the command line for more info.

#the first line sets the threshold load for not starting a job (here 0.5)
#do not remove the ‘none’ keyword
echo 6.5 + none
#list the machines here
#example on a local network (e.g. beowolf) with shared disk
#secure version of the above
ssh sghosh@icn11 uptime | getvalue average + node -s sghosh@icn11
ssh sghosh@icn12 uptime | getvalue average + node -s sghosh@icn12
ssh sghosh@icn13 uptime | getvalue average + node -s sghosh@icn13
ssh sghosh@icn14 uptime | getvalue average + node -s sghosh@icn14
#node -s sghosh@icn11
#node -s sghosh@icn12
#node -s sghosh@icn13
#node -s sghosh@icn14

Now, essentially it means the jobs run over icn11, icn12, icn13 and icn14 nodes (with 12 processors each). That is, each job can run over 48 processors depending on the load…
I tried to optimize ''echo 6.5+none ‘’ by assuming that a maximum of 4 jobs(runstruct_vasp in 4 directories) can run over 48 processors.

You may have to tune this line according to your needs. I am not sure if I am speaking right. Dr. Axel can correct me if I am wrong…

Since you want a maximum of 15 jobs to be run,

Increase number of cores

Optimize echo 6.5+none

Then you can invoke pollmach and run

pollmach runstruct_vasp …

I hope this helps…
Dr. Axel, please let me know if I am right here…

Thank you
Regards
Suddhasattwa

avdw · March 20, 2025, 6:34pm

It looks like your job script (or the queuing system) does not cd to the right directory on the slave process/nodes. Or perhaps are you running multiple copies of runstruct_vasp (e.g. mpirun runstruct_vasp , which would be wrong)?
add a few pwd and ls in your scripts as various points to debug this.

avdw · March 29, 2025, 9:42pm

I submit my script on a machine with queueing system, like this (and there is no .machines file)
maps -d &
pollmach runstruct_vasp
and then I want multiple vasp jobs to run simultaneously, then I submit another jobs with script:
pollmach runstruct_vasp

BUT, unfortunately, there is a message saying:
"pollmach is already running. Aborting.
To override this behavior, type rm pollmach_is_running or use -f option"

Should I add -f option on the second script? or what I did is wrong?[/quote]

Yes!

Jintao_Wang · March 30, 2025, 4:50am

I submit my script on a machine with queueing system, like this (and there is no .machines file)

maps -d &
pollmach runstruct_vasp

and then I want multiple vasp jobs to run simultaneously, then I submit another jobs with script:


pollmach runstruct_vasp

[/quote]

BUT, unfortunately, there is a message saying:
"pollmach is already running. Aborting.
To override this behavior, type rm pollmach_is_running or use -f option"

Should I add -f option on the second script? or what I did is wrong?

mimisee · March 30, 2025, 4:51am

Dear Dr. Axel,

Thanks for sharing ATAT code for free. I’m new to this code. I have the same problem posted here which I can’t fix. I put it down here to see if you can kindly help to solve it. Here is it.
I have a cluster with two nodes without queneing: node01(master machine) and node02(remote machine). I can run ATAT perfectly on master machine. But after I configure .machines.rc file as:


set waitbetweenpoll=60
echo 0.5 + none
ssh customer@node01 uptime | getvalue average + node -s customer@node01
ssh customer@node02 uptime | getvalue average + node -s customer@node02

I run the code as:

pollmach runstruct_vasp

or pollmach runstruct_vasp mpirun -np 96

then I got the error message as:
cp: cannot stat ‘OSZICAR’: No such file or directory
cp: cannot stat ‘OUTCAR’: No such file or directory
cp: cannot stat ‘CONTCAR’: No such file or directory
cp: cannot stat ‘CONTCAR’: No such file or directory
cp: cannot stat ‘OSZICAR’: No such file or directory
cp: cannot stat ‘OUTCAR’: No such file or directory
cp: cannot stat ‘CONTCAR’: No such file or directory
cp: cannot stat ‘DOSCAR’: No such file or directory
unable to open OSZICAR or OSZICAR.static.
If I delete the script .machines.rc, I can continue a perfect run with ATAT again but only on master machine.
This problem drives me crazy. Kindly help me please.
Thanks in advance.
ZX