Hello Team, The HPC users reported the new issue, when launching Slurm jobs from within other Slurm jobs: srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x200000000000000000000000. srun: error: Task launch for StepId=11225441.0 failed on node cpu-403: Unable to satisfy cpu bind request srun: error: Application launch failed: Unable to satisfy cpu bind request srun: Job step aborted Here's a minimal reproduction. Our application launches (potentially nested-) jobs with similar commands, though within Python code: ## FILE: outer.sh #!/bin/sh #SBATCH --mem=8192 #SBATCH --time=120 #SBATCH --cpus-per-task=1 sbatch --array=0-3 inner.sh ## FILE: inner.sh #!/bin/sh #SBATCH --mem=8192 #SBATCH --time=120 #SBATCH --cpus-per-task=1 srun sleep 10 ## Commands to run: chmod +x outer.sh inner.sh sbatch outer.sh In order to run without the error if we add --cpu-bind=quiet or --cpu-bind=none to the srun command, but according to the Slurm docs the default value should already be one of those two options. Has the global Slurm configuration changed? Thanks, Radek
Could you please attach your slurm.conf?
Created attachment 26640 [details] slurm.conf
Radek, I am not able to reproduce the issue with your example. The jobs are not nested. sbatch will submit a separate new job even from inside another job. The error you are seeing comes from mask or map cpu binding. If you have the environment variable SLURM_CPU_BIND set, it could lead to this error without using --cpu-bind. This would make sense because --cpu-bind=none or quiet would override SLURM_CPU_BIND. See the documentation https://slurm.schedmd.com/srun.html#OPT_cpu-bind -Scott
Hi Scott, I forgot to add srun to the outer.sh file. The original version is: #!/bin/sh #SBATCH --mem=8192 #SBATCH --time=120 #SBATCH --cpus-per-task=1 srun sbatch --array=0-3 inner.sh As you can see there's srun and then sbatch. User is saying that it worked in the previous version of Slurm and now it's stopped working. Once he gets rid srun of then it's working fine. Even though it looks odd, could you please advise something here? Thanks, Radek
Radek, srun sbatch doesn't make sense and shouldn't make a difference. sbatch will still launch a new not nested job. srun here launches a step in the first job only to submit a new job to slurmctld. It doesn't make sense to do that. Testing it I don't see any difference. -Scott
Hi Scott, I know and this is exactly what we told the user. Once he modified the script, everything seems to work without any errors. I think we can close the ticket. Thanks, Radek
Closing ticket