Ticket 14902

Summary:	Unable to satisfy cpu bind request when launches nested jobs
Product:	Slurm	Reporter:	GSK-ONYX-SLURM <slurm-support>
Component:	Scheduling	Assignee:	Scott Hilton <scott>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	cinek
Version:	22.05.2
Hardware:	Linux
OS:	Linux
Site:	GSK	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	CentOS	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf

Description GSK-ONYX-SLURM 2022-09-07 00:45:56 MDT

Hello Team,

The HPC users reported the new issue, when launching Slurm jobs from within other Slurm jobs:

srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x200000000000000000000000.
srun: error: Task launch for StepId=11225441.0 failed on node cpu-403: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted

Here's a minimal reproduction. Our application launches (potentially nested-) jobs with similar commands, though within Python code:

## FILE: outer.sh

#!/bin/sh
#SBATCH --mem=8192
#SBATCH --time=120
#SBATCH --cpus-per-task=1
sbatch --array=0-3 inner.sh

## FILE: inner.sh

#!/bin/sh
#SBATCH --mem=8192
#SBATCH --time=120
#SBATCH --cpus-per-task=1
srun sleep 10

## Commands to run:

chmod +x outer.sh inner.sh
sbatch outer.sh

In order to run without the error if we add --cpu-bind=quiet or --cpu-bind=none to the srun command, but according to the Slurm docs the default value should already be one of those two options.

Has the global Slurm configuration changed?

Thanks,
Radek

Comment 1 Marcin Stolarek 2022-09-07 03:27:30 MDT

Could you please attach your slurm.conf?

Comment 2 GSK-ONYX-SLURM 2022-09-07 05:05:39 MDT

Created attachment 26640 [details]
slurm.conf

Comment 3 Scott Hilton 2022-09-08 13:33:58 MDT

Radek,

I am not able to reproduce the issue with your example.

The jobs are not nested. sbatch will submit a separate new job even from inside another job.

The error you are seeing comes from mask or map cpu binding. If you have the environment variable SLURM_CPU_BIND set, it could lead to this error without using --cpu-bind. This would make sense because --cpu-bind=none or quiet would override SLURM_CPU_BIND.

See the documentation https://slurm.schedmd.com/srun.html#OPT_cpu-bind

-Scott

Comment 4 GSK-ONYX-SLURM 2022-09-09 03:26:57 MDT

Hi Scott,

I forgot to add srun to the outer.sh file. The original version is:

#!/bin/sh
#SBATCH --mem=8192
#SBATCH --time=120
#SBATCH --cpus-per-task=1
srun sbatch --array=0-3 inner.sh

As you can see there's srun and then sbatch. User is saying that it worked in the previous version of Slurm and now it's stopped working. Once he gets rid srun of then it's working fine. 

Even though it looks odd, could you please advise something here?

Thanks,
Radek

Comment 5 Scott Hilton 2022-09-09 10:42:31 MDT

Radek,

srun sbatch doesn't make sense and shouldn't make a difference. sbatch will still launch a new not nested job. 

srun here launches a step in the first job only to submit a new job to slurmctld. It doesn't make sense to do that. 

Testing it I don't see any difference. 

-Scott

Comment 6 GSK-ONYX-SLURM 2022-09-12 00:37:03 MDT

Hi Scott,

I know and this is exactly what we told the user. Once he modified the script, everything seems to work without any errors.

I think we can close the ticket.

Thanks,
Radek

Comment 7 Scott Hilton 2022-09-12 09:40:24 MDT

Closing ticket