Ticket 4660

Summary:	Batch job submission failed: Requested node configuration is not available
Product:	Slurm	Reporter:	Ole.H.Nielsen <Ole.H.Nielsen>
Component:	Scheduling	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	17.02.9
Hardware:	Linux
OS:	Linux
Site:	DTU Physics	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf slurmctld.log with debug flags (gzipped) topology.conf file

Description Ole.H.Nielsen@fysik.dtu.dk 2018-01-22 08:05:32 MST

I've created a reservation on 2 nodes with 16 cores each:

# scontrol create reservation starttime=now duration=720:00:00 ReservationName=Test1 nodes=g079,g083 user=ohni,mikst PartitionName=xeon16

and I want to submit this job for 2 nodes and 32 CPU cores which should fill out the reservation fully:

#!/bin/sh
#SBATCH --mail-type=ALL
#SBATCH --partition=xeon16
#SBATCH --reservation=Test1
#SBATCH -N 2-2
#SBATCH -n 32     
#SBATCH --mem=4G 
#SBATCH --time=02:00
#SBATCH --output=lolcow.log
module load OpenMPI
echo Run Singularity container with `which mpirun`
mpirun -n $SLURM_NTASKS singularity exec lolcow.simg cowsay 'How did you get out of the container?'

Unfortunately, Slurm rejects this job with the message:

# sbatch lolcow.slurm
sbatch: error: Batch job submission failed: Requested node configuration is not available

The Jobid isn't printed, but I think this Jobid is the relevant one:

# scontrol show job 389267

Queued job information:
JobId=389267 JobName=lolcow.slurm
   UserId=ohni(1775) GroupId=camdvip(1250) MCS_label=N/A
   Priority=118255 Nice=0 Account=camdvip QOS=normal
   JobState=FAILED Reason=BadConstraints Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:1
   RunTime=00:00:00 TimeLimit=00:02:00 TimeMin=N/A
   SubmitTime=2018-01-22T15:45:50 EligibleTime=2018-01-22T15:45:50
   StartTime=2018-01-22T15:45:50 EndTime=2018-01-22T15:45:50 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=xeon16 AllocNode:Sid=surt:11373
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=2 NumCPUs=32 NumTasks=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=32,mem=8192,node=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=2 MinMemoryNode=4G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=Test1
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/niflheim/ohni/lolcow.slurm
   WorkDir=/home/niflheim/ohni
   StdErr=/home/niflheim/ohni/lolcow.log
   StdIn=/dev/null
   StdOut=/home/niflheim/ohni/lolcow.log
   Power=


The slurmctld.log says the same thing:

[2018-01-22T15:45:50.937] _pick_best_nodes: job 389267 never runnable in partition xeon16
[2018-01-22T15:45:50.937] email msg to ohni: SLURM Job_id=389267 Name=lolcow.slurm Failed, Run time 00:00:00, FAILED
[2018-01-22T15:45:50.937] _slurm_rpc_submit_batch_job: Requested node configuration is not available

Question: Why does Slurm reject this job?

This may be an RTFM, but I can't figure out why a 32-core job won't run on 32 cores :-(

Additional observations:

1. If I submit the script to 4 cores on 1 node, it runs OK.

2. If I submit the script to 4 cores on 2 nodes, it fails as above.

So the problem may be associated with allocating all 2 nodes in the reservation?

Comment 1 Marshall Garey 2018-01-22 12:06:27 MST

That should work. I'm not able to reproduce this.

Can you post the output of scontrol show reservations Test1?

Can you also upload your slurm.conf?

Thanks

Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2018-01-22 12:56:15 MST

(In reply to Marshall Garey from comment #1)
> That should work. I'm not able to reproduce this.
> 
> Can you post the output of scontrol show reservations Test1?

# scontrol show reservations Test1
ReservationName=Test1 StartTime=Mon 15:39:54 EndTime=Wed 15:39:54 Duration=30-00:00:00
   Nodes=g[079,083] NodeCnt=2 CoreCnt=32 Features=(null) PartitionName=xeon16 Flags=SPEC_NODES
   TRES=cpu=32
   Users=ohni,mikst Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

> Can you also upload your slurm.conf?

Will do.

Comment 3 Ole.H.Nielsen@fysik.dtu.dk 2018-01-22 12:57:03 MST

Created attachment 5978 [details]
slurm.conf

Comment 4 Ole.H.Nielsen@fysik.dtu.dk 2018-01-22 13:04:46 MST

Another observation: If I submit this script (with -N, -n and reservation lines removed), it submits correctly to the normal queue, but gets rejected when submitted to the reservation:

$ sbatch -N 2-2 -n 2 --reservation=Test1 lolcow.slurm
sbatch: error: Batch job submission failed: Requested node configuration is not available

$ sbatch -N 2-2 -n 2  lolcow.slurm
Submitted batch job 389825

Comment 5 Ole.H.Nielsen@fysik.dtu.dk 2018-01-22 13:10:16 MST

Now I added a 3rd node to the reservation:

# scontrol update reserv=Test1  Nodes=g[021,079,083]
Reservation updated.
# scontrol show reservations Test1
ReservationName=Test1 StartTime=Mon 15:39:54 EndTime=Wed 15:39:54 Duration=30-00:00:00
   Nodes=g[021,079,083] NodeCnt=3 CoreCnt=48 Features=(null) PartitionName=xeon16 Flags=SPEC_NODES
   TRES=cpu=48
   Users=ohni,mikst Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

Lo and behold, my 2-node job can now be submitted successfully:

$ sbatch -N 2-2 -n 32 --reservation=Test1 lolcow.slurm
Submitted batch job 389828

But I can't submit the job to all 3 nodes:

$ sbatch -N 3-3 -n 48 --reservation=Test1 lolcow.slurm
sbatch: error: Batch job submission failed: Requested node configuration is not available

So the error seems to be localized to the reservation, and the batch job requesting the entire reservation.

Comment 6 Marshall Garey 2018-01-22 13:54:45 MST

I still can't find why. I'm still looking, but in the meantime can you set your slurmctld debug level to debug with

scontrol setdebug debug2

and turn on the reservation and selecttype flags:

scontrol setdebugflags +reservation
scontrol setdebugflags +selecttype

Then try to submit the batch job again. Can you upload the slurmctld log file after that? Then go ahead and turn off the flags and set the debug level back to whatever you want.

Comment 7 Ole.H.Nielsen@fysik.dtu.dk 2018-01-23 01:03:55 MST

(In reply to Marshall Garey from comment #6)
> I still can't find why. I'm still looking, but in the meantime can you set
> your slurmctld debug level to debug with
> 
> scontrol setdebug debug2
> 
> and turn on the reservation and selecttype flags:
> 
> scontrol setdebugflags +reservation
> scontrol setdebugflags +selecttype
> 
> Then try to submit the batch job again. Can you upload the slurmctld log
> file after that? Then go ahead and turn off the flags and set the debug
> level back to whatever you want.

I ran these 3 commands, then tried to submit the job:

$ sbatch -N 3-3 -n 48 --reservation=Test1 lolcow.slurm
sbatch: error: Batch job submission failed: Requested node configuration is not available

I reset the debug flags by "scontrol reconfigure".  Slurmctld log file will be uploaded.

Comment 8 Ole.H.Nielsen@fysik.dtu.dk 2018-01-23 01:04:59 MST

Created attachment 5981 [details]
slurmctld.log with debug flags (gzipped)

Comment 9 Ole.H.Nielsen@fysik.dtu.dk 2018-01-23 01:10:07 MST

(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #8)
> Created attachment 5981 [details]
> slurmctld.log with debug flags (gzipped)

FYI: The rejected job submission has Jobid 389877.  It would be very useful if sbatch could be modified to print the Jobid also for jobs that are rejected.

Status mail from Slurm:

SLURM Job_id=389877 Name=lolcow.slurm Failed, Run time 00:00:00, FAILED
Job ID: 389877
Cluster: niflheim
User/Group: ohni/camdvip
State: FAILED (exit code 1)
Cores: 1
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:00:00 core-walltime
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 4.00 GB (4.00 GB/node)

Comment 10 Marshall Garey 2018-01-23 09:55:52 MST

I found this in your log file:

[2018-01-23T08:59:36.354] debug:  job 389877: best_fit topology failure: no switch currently has sufficient resource to satisfy the request
[2018-01-23T08:59:36.354] cons_res: cr_job_test: test 0 fail: insufficient resources
[2018-01-23T08:59:36.354] _pick_best_nodes: job 389877 never runnable in partition xeon16


I suspect there's something going on with your topology. Can you share your topology.conf file? I'd like to mimic it and try to reproduce this bug.

Comment 11 Ole.H.Nielsen@fysik.dtu.dk 2018-01-23 10:24:07 MST

(In reply to Marshall Garey from comment #10)
> I found this in your log file:
> 
> [2018-01-23T08:59:36.354] debug:  job 389877: best_fit topology failure: no
> switch currently has sufficient resource to satisfy the request
> [2018-01-23T08:59:36.354] cons_res: cr_job_test: test 0 fail: insufficient
> resources
> [2018-01-23T08:59:36.354] _pick_best_nodes: job 389877 never runnable in
> partition xeon16
> 
> 
> I suspect there's something going on with your topology. Can you share your
> topology.conf file? I'd like to mimic it and try to reproduce this bug.

It's true that our compute node fabric is divided into disjoint islands corresponding to several different generations of node and network hardware (Intel Omni-Path, Infiniband, and plain Gigabit Ethernet islands).  I'll attach the topology.conf file.

I agree with your analysis: Node g079 is connected to this fabric island:
  SwitchName=volt01234 Switches=volt0[1-4]
while node g083 is connected to a disjoint island:
  SwitchName=mell01 Nodes=g[081-110],h[001-002]

Please update the case as solved.

The reason why this case had me confounded is the lack of and really unhelpful error messages from Slurm.  If the "best_fit topology failure" message had been printed to stderr from sbatch, or even logged to slurmctld.log, I would have understood the error much sooner, and no support case would be required.

Questions: 

1. Can SchedMD modify the sbatch and srun commands so that error messages like the above would be printed or logged by default in the future?

2. Also, sbatch really ought to print out the Jobid of failed jobs, making it more user friendly to search for the Jobid in log files.

Thanks for your support,
Ole

Comment 12 Ole.H.Nielsen@fysik.dtu.dk 2018-01-23 10:24:29 MST

Created attachment 5985 [details]
topology.conf file

Comment 13 Marshall Garey 2018-01-25 13:20:07 MST

(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #11)
> Questions: 
Some discussion will probably need to happen internally on both of these. For now, I'm creating tickets to track that discussion.

> 1. Can SchedMD modify the sbatch and srun commands so that error messages
> like the above would be printed or logged by default in the future?
Bug 4687. I'll look into this. It might be the case that we simply say "increase the debug level temporarily" since the information is there. But there might be an elegant way to make the error message more helpful, since, as you point out, "Node configuration unavailable" isn't very helpful to know what is actually going on.

> 2. Also, sbatch really ought to print out the Jobid of failed jobs, making
> it more user friendly to search for the Jobid in log files.
Bug 4686. From what I understand, sbatch submissions that are immediately rejected don't even get a job record created. In this case, a job record was created and put in the database. It makes sense to me to print the job id because a job record was created.

Thanks. I'll close this as resolved.