Walltime Error Before Walltime reached?

Hello all,

I’m pretty stumped by this error. Jobs are running perfectly fine for the first 5-6 hours, but then stop abruptly. I know this isn’t a VASP issue because it runs fine when submitting it directly to the SLURM scheduler. The below error from FW_error output seems to indicate that custodian believes the wall time has been reached, but the submit script created by atomate shows that time should be 48 hours, nowhere near the 5 hours that the job ran for. I considered that the problem might be the “RuntimeWarning: divide by zero…” errors that come after the wall time error, but I find no documentation of this elsewhere, and have no idea why these errors would arise many hours after the job began running. My submission script seems to be perfectly in order.

Note: I’ve included my std_err file to show what it says just in case, but don’t believe that the “huge pages error” has anything to do with this, as I see this error show up with the jobs I manually submit to slurp and it doesn’t affect the job.

Does anyone have any ideas with what this problem could be?

For reference, running atomate on NERSC for the first time after having run it on other shared resources before.

Best,

Nick Winner

FW_job.error

/global/homes/n/nwinner/.conda/envs/matsci/lib/python3.6/site-packages/pymatgen/io/cif.py:37: UserWarning: Please install optional dependency pybtex if youwant to extract references from CIF files.

warnings.warn(“Please install optional dependency pybtex if you”

{ ‘actions’: None,

‘errors’: [ 'Walltime ’

          'reached'],

‘handler’: <custodian.vasp.handlers.WalltimeHandler object at 0x2aaacaa29208>}

vasp_std: no process found

Unrecoverable error for handler: <custodian.vasp.handlers.WalltimeHandler object at 0x2aaacaa29208>

/global/homes/n/nwinner/.conda/envs/matsci/lib/python3.6/site-packages/pymatgen/core/lattice.py:1094: RuntimeWarning: divide by zero encountered in sqrt

return np.sqrt(d2)

/global/homes/n/nwinner/.conda/envs/matsci/lib/python3.6/site-packages/pymatgen/core/lattice.py:1094: RuntimeWarning: invalid value encountered in sqrt

return np.sqrt(d2)

FW_submit.script

#!/bin/bash -l

#SBATCH --nodes=4

#SBATCH --time=48:00:00

#SBATCH --partition=regular

#SBATCH --account=m1090

#SBATCH --job-name=FW_job

#SBATCH --output=FW_job-%j.out

#SBATCH --error=FW_job-%j.error

#SBATCH --constraint=knl

module load python/3.6-anaconda-4.4

source activate matsci

module load vasp/20170629-knl

export OMP_PROC_BIND=true

export OMP_PLACES=threads

export OMP_NUM_THREADS=4

cd /global/cscratch1/sd/nwinner/flibe/solutes/Cr/0

rlaunch -c /global/u1/n/nwinner/config/atomate rapidfire

CommonAdapter (SLURM) completed writing Template

std_err.txt

···

libhugetlbfs [nid07183:185095]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid07182:184826]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid07188:183785]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid07183:185096]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid07188:183786]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid07183:185097]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid07182:184827]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid07188:183787]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid06827:183854]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid07183:185098]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid07182:184828]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid07188:183788]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid07182:184829]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid06827:183855]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid06827:183856]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid06827:183853]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

PROFILE, used timers: 236

Hi Nick,

It’s hard to tell what’s going on since you haven’t provided any details of which Firework you’re running or what wall time you have actually set. Note that the wall time is not read from the submit script and is set in other ways.

For example, for MDFW the wall time is set to ~5 hours as that is just the default for the Firework (questionable if this should be hardcoded as an optional like that - it’s bound to cause confusion):

···

====

class MDFW(Firework):

def init(self, structure, start_temp, end_temp, nsteps,

name=“molecular dynamics”,

vasp_input_set=None, vasp_cmd=“vasp”,

override_default_vasp_params=None,

wall_time=19200, db_file=None, parents=None,

copy_vasp_outputs=True, **kwargs):

====

If you are using MDFW that is likely the source of the problem.

The other Fireworks coded into atomate don’t even activate the WalltimeHandler by default - so somehow you must be activating it yourself with a given wall time as far as I can tell.

Note that WalltimeHandler can read the walltime from the environment variables: PBS_WALLTIME or SBATCH_TIMELIMIT. If either of those are set, you could set the wall_time to None in MDFW.

Let me know if this was your problem. If so I’ll think about reworking MDFW to avoid this confusion.

On Wednesday, February 20, 2019 at 10:47:36 PM UTC-8, n…@berkeley.edu wrote:

Hello all,

I’m pretty stumped by this error. Jobs are running perfectly fine for the first 5-6 hours, but then stop abruptly. I know this isn’t a VASP issue because it runs fine when submitting it directly to the SLURM scheduler. The below error from FW_error output seems to indicate that custodian believes the wall time has been reached, but the submit script created by atomate shows that time should be 48 hours, nowhere near the 5 hours that the job ran for. I considered that the problem might be the “RuntimeWarning: divide by zero…” errors that come after the wall time error, but I find no documentation of this elsewhere, and have no idea why these errors would arise many hours after the job began running. My submission script seems to be perfectly in order.

Note: I’ve included my std_err file to show what it says just in case, but don’t believe that the “huge pages error” has anything to do with this, as I see this error show up with the jobs I manually submit to slurp and it doesn’t affect the job.

Does anyone have any ideas with what this problem could be?

For reference, running atomate on NERSC for the first time after having run it on other shared resources before.

Best,

Nick Winner

FW_job.error

/global/homes/n/nwinner/.conda/envs/matsci/lib/python3.6/site-packages/pymatgen/io/cif.py:37: UserWarning: Please install optional dependency pybtex if youwant to extract references from CIF files.

warnings.warn(“Please install optional dependency pybtex if you”

{ ‘actions’: None,

‘errors’: [ 'Walltime ’

          'reached'],

‘handler’: <custodian.vasp.handlers.WalltimeHandler object at 0x2aaacaa29208>}

vasp_std: no process found

Unrecoverable error for handler: <custodian.vasp.handlers.WalltimeHandler object at 0x2aaacaa29208>

/global/homes/n/nwinner/.conda/envs/matsci/lib/python3.6/site-packages/pymatgen/core/lattice.py:1094: RuntimeWarning: divide by zero encountered in sqrt

return np.sqrt(d2)

/global/homes/n/nwinner/.conda/envs/matsci/lib/python3.6/site-packages/pymatgen/core/lattice.py:1094: RuntimeWarning: invalid value encountered in sqrt

return np.sqrt(d2)

FW_submit.script

#!/bin/bash -l

#SBATCH --nodes=4

#SBATCH --time=48:00:00

#SBATCH --partition=regular

#SBATCH --account=m1090

#SBATCH --job-name=FW_job

#SBATCH --output=FW_job-%j.out

#SBATCH --error=FW_job-%j.error

#SBATCH --constraint=knl

module load python/3.6-anaconda-4.4

source activate matsci

module load vasp/20170629-knl

export OMP_PROC_BIND=true

export OMP_PLACES=threads

export OMP_NUM_THREADS=4

cd /global/cscratch1/sd/nwinner/flibe/solutes/Cr/0

rlaunch -c /global/u1/n/nwinner/config/atomate rapidfire

CommonAdapter (SLURM) completed writing Template

std_err.txt

libhugetlbfs [nid07183:185095]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid07182:184826]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid07188:183785]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid07183:185096]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid07188:183786]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid07183:185097]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid07182:184827]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid07188:183787]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid06827:183854]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid07183:185098]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid07182:184828]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid07188:183788]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid07182:184829]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid06827:183855]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid06827:183856]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

libhugetlbfs [nid06827:183853]: WARNING: Maximum number of huge page sizes exceeded, ignoring 8388608kB page size

PROFILE, used timers: 236