Packing jobs on NERSC

Greetings!

I am trying to pack fireworks with rlaunch multi on NERSC cori following the instructions in https://materialsproject.github.io/fireworks/multi_job.html. Each firework includes several TemplateWriterTask, and one final ScriptTask to run LAMMPS simulations. I used the rlaunch multi <NP/PPJOB> option described in the previous page. The goal is to pack many smaller fireworks into one single multi-node job to reduce the queue time at NERSC.

My problem is that only one LAMPS simulation can be run simultaneously at this moment. I find the TemplateWriterTasks are running in parallel, but the final ScriptTask seems to be blocking each other. The ScriptTask is currently in the form of srun -n 68 -c 1 --cpu-bind=cores /global/common/cori_cle7/software/lammps/2018.12.12/knl/lmp_cori < in.lmp. My guess is that the srun command is blocking each other, but I am not sure.

Do you have any suggestions on how to solve the problem? Thanks a lot!

Tian

Hi Tian,

A primary issue with job packing is that srun is very fickle and makes a lot of assumptions on what the user wants. So you need to be explicit with the srun command and specify exactly the resources you need, i.e., the number of nodes, how many MPI Tasks, and how many “cores” per task.

In your case, if it is just one node per LAMPS simulation, adding -N 1 to your command might fix your issue.

Best,
Alex

Thank you Alex. I’ve tested the modification and it indeed fixed the issue!

Optimally I also want to run multiple jobs in a single node. I found adding -N 1 only works for jobs that use the entire node. Are you aware of any options to achieve this?

Hi @txie,

The nodes on NERSC have very strict resource limits and srun obeys them. So you have to figure out how to get it allocate just the right amount of resources for what you’re doing. You should definitely contact NERSC about this and they can help you tune your srun to pack onto one node.