| Summary: | Recommendation for sbatch job config | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Shaheer KM <shaheer> |
| Component: | Configuration | Assignee: | Chad Vizino <chad> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | chad |
| Version: | 23.02.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Cerebras | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Shaheer KM
2022-06-20 09:31:43 MDT
(In reply to Shaheer KM from comment #0) Hi. This is possible using sbatch but sbatch will not block and wait at the command prompt for the execution to finish like srun will if you are running that on the command line but maybe you know this and are ok with this behavior--just wanted to point it out. If I understand your use case correctly, you should be able to just run the srun within a batch job like this (or putting the srun in a job file and passing the file name to sbatch): >sbatch -N1 --ntasks-per-node=1 --exclusive : -N3 --ntasks-per-node=8 --cpus-per-task=16 --exclusive --wrap="srun --unbuffered --kill-on-bad-exit --nodes=1 --ntasks-per-node=1 --exclusive : --nodes=3 --ntasks-per-node=8 --cpus-per-task=16 --exclusive --wrap="srun : python run.py -p configs/params.yaml" As for starting het jobs, all components will start together on resources when they are available. However, certain factors can influence the component jobs' position in the scheduling queue than can affect the overall het job start. A couple of them are explained here: >https://slurm.schedmd.com/slurm.conf.html#OPT_bf_hetjob_immediate >https://slurm.schedmd.com/slurm.conf.html#OPT_bf_hetjob_prio=[min|avg|max] Take a look and let me know if you have more questions or if I misinterpreted your question. Thanks for the input. I tried this and it does not seem to work the way our application needs looks like. Our application relies on slurm env variables to decide roles for each task that gets spun up as part of job and with sbatch heterogeneous job this seems to be not working. Our application looks at SLURM_JOB_NODELIST and with sbatch hetro job all nodes wont be listed under this env var. (In reply to Shaheer KM from comment #2) > Thanks for the input. I tried this and it does not seem to work the way our > application needs looks like. Our application relies on slurm env variables > to decide roles for each task that gets spun up as part of job and with > sbatch heterogeneous job this seems to be not working. > > Our application looks at SLURM_JOB_NODELIST and with sbatch hetro job all > nodes wont be listed under this env var. That env var should still be available in both the job script for sbatch and in the environment of srun and what it starts. But, there is another one that holds the node list for each het group (component). So, you might also look at using SLURM_PROCID and SLURM_JOB_NODELIST_HET_GROUP_<N> where <N> is the value of SLURM_PROCID and holds the node list for the N+1 het component job. Example: >$ cat het_example >#!/bin/bash >#set -x >tmp=SLURM_JOB_NODELIST_HET_GROUP_$SLURM_PROCID >nodelist=${!tmp} >echo "$SLURM_PROCID ($SLURMD_NODENAME): $tmp=$nodelist" > >$ sbatch : --wrap="srun : /tmp/het_example" >Submitted batch job 164527 > >$ cat slurm-164527.out >1 (mackinac-2): SLURM_JOB_NODELIST_HET_GROUP_1=mackinac-2 >0 (mackinac-1): SLURM_JOB_NODELIST_HET_GROUP_0=mackinac-1 Does this help you? (In reply to Chad Vizino from comment #3) > (In reply to Shaheer KM from comment #2) > > Thanks for the input. I tried this and it does not seem to work the way our > > application needs looks like. Our application relies on slurm env variables > > to decide roles for each task that gets spun up as part of job and with > > sbatch heterogeneous job this seems to be not working. > > > > Our application looks at SLURM_JOB_NODELIST and with sbatch hetro job all > > nodes wont be listed under this env var. > That env var should still be available in both the job script for sbatch and > in the environment of srun and what it starts. But, there is another one > that holds the node list for each het group (component). So, you might also > look at using SLURM_PROCID and SLURM_JOB_NODELIST_HET_GROUP_<N> where <N> is > the value of SLURM_PROCID and holds the node list for the N+1 het component > job. Example: > > >$ cat het_example > >#!/bin/bash > >#set -x > >tmp=SLURM_JOB_NODELIST_HET_GROUP_$SLURM_PROCID > >nodelist=${!tmp} > >echo "$SLURM_PROCID ($SLURMD_NODENAME): $tmp=$nodelist" > > > >$ sbatch : --wrap="srun : /tmp/het_example" > >Submitted batch job 164527 > > > >$ cat slurm-164527.out > >1 (mackinac-2): SLURM_JOB_NODELIST_HET_GROUP_1=mackinac-2 > >0 (mackinac-1): SLURM_JOB_NODELIST_HET_GROUP_0=mackinac-1 > Does this help you? Hi. I'll plan to close this issue shortly unless you have further questions--feel free to ask. Closing for now. Feel free to reopen if you have more questions. |