Ticket 14902

Summary: Unable to satisfy cpu bind request when launches nested jobs
Product: Slurm Reporter: GSK-ONYX-SLURM <slurm-support>
Component: SchedulingAssignee: Scott Hilton <scott>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: cinek
Version: 22.05.2   
Hardware: Linux   
OS: Linux   
Site: GSK Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: CentOS
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf

Description GSK-ONYX-SLURM 2022-09-07 00:45:56 MDT
Hello Team,

The HPC users reported the new issue, when launching Slurm jobs from within other Slurm jobs:

srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x200000000000000000000000.
srun: error: Task launch for StepId=11225441.0 failed on node cpu-403: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted

Here's a minimal reproduction. Our application launches (potentially nested-) jobs with similar commands, though within Python code:

## FILE: outer.sh

#!/bin/sh
#SBATCH --mem=8192
#SBATCH --time=120
#SBATCH --cpus-per-task=1
sbatch --array=0-3 inner.sh

## FILE: inner.sh

#!/bin/sh
#SBATCH --mem=8192
#SBATCH --time=120
#SBATCH --cpus-per-task=1
srun sleep 10

## Commands to run:

chmod +x outer.sh inner.sh
sbatch outer.sh

In order to run without the error if we add --cpu-bind=quiet or --cpu-bind=none to the srun command, but according to the Slurm docs the default value should already be one of those two options.

Has the global Slurm configuration changed?

Thanks,
Radek
Comment 1 Marcin Stolarek 2022-09-07 03:27:30 MDT
Could you please attach your slurm.conf?
Comment 2 GSK-ONYX-SLURM 2022-09-07 05:05:39 MDT
Created attachment 26640 [details]
slurm.conf
Comment 3 Scott Hilton 2022-09-08 13:33:58 MDT
Radek,

I am not able to reproduce the issue with your example.

The jobs are not nested. sbatch will submit a separate new job even from inside another job.

The error you are seeing comes from mask or map cpu binding. If you have the environment variable SLURM_CPU_BIND set, it could lead to this error without using --cpu-bind. This would make sense because --cpu-bind=none or quiet would override SLURM_CPU_BIND.

See the documentation https://slurm.schedmd.com/srun.html#OPT_cpu-bind

-Scott
Comment 4 GSK-ONYX-SLURM 2022-09-09 03:26:57 MDT
Hi Scott,

I forgot to add srun to the outer.sh file. The original version is:

#!/bin/sh
#SBATCH --mem=8192
#SBATCH --time=120
#SBATCH --cpus-per-task=1
srun sbatch --array=0-3 inner.sh

As you can see there's srun and then sbatch. User is saying that it worked in the previous version of Slurm and now it's stopped working. Once he gets rid srun of then it's working fine. 

Even though it looks odd, could you please advise something here?

Thanks,
Radek
Comment 5 Scott Hilton 2022-09-09 10:42:31 MDT
Radek,

srun sbatch doesn't make sense and shouldn't make a difference. sbatch will still launch a new not nested job. 

srun here launches a step in the first job only to submit a new job to slurmctld. It doesn't make sense to do that. 

Testing it I don't see any difference. 

-Scott
Comment 6 GSK-ONYX-SLURM 2022-09-12 00:37:03 MDT
Hi Scott,

I know and this is exactly what we told the user. Once he modified the script, everything seems to work without any errors.

I think we can close the ticket.

Thanks,
Radek
Comment 7 Scott Hilton 2022-09-12 09:40:24 MDT
Closing ticket