Ticket 15162

Summary:	srun: error: Unable to allocate resources: Requested node configuration is not available
Product:	Slurm	Reporter:	Wayfinder Infrastructure Support <infrastructure-support.wayfinder>
Component:	Scheduling	Assignee:	Oscar Hernández <oscar.hernandez>
Status:	RESOLVED INVALID	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	- Unsupported Older Versions
Hardware:	Linux
OS:	Linux
Site:	wfr	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf gres.conf slurmd2 slurmctld

Description Wayfinder Infrastructure Support 2022-10-12 15:01:22 MDT

Created attachment 27240 [details]
slurm.conf

Hello,
I'm having trouble getting a bash prompt to any of my nodes in the quadro partition.

srun  --nodelist=ng-201-21 --pty bash
srun: error: Unable to allocate resources: Requested node configuration is not available

root@nm-203-19:/etc/slurm# scontrol show node ng-201-21
NodeName=ng-201-21 Arch=x86_64 CoresPerSocket=20
   CPUAlloc=0 CPUTot=80 CPULoad=3.04
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:quadrortx8000:3,gpu:quadrortx8000:3
   NodeAddr=ng-201-21 NodeHostName=ng-201-21 Version=18.08
   OS=Linux 4.18.0-18-generic #19~18.04.1-Ubuntu SMP Fri Apr 5 10:22:13 UTC 2019
   RealMemory=772676 AllocMem=0 FreeMem=746675 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=201458 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=quadro
   BootTime=2022-09-30T19:15:11 SlurmdStartTime=2022-09-30T19:16:04
   CfgTRES=cpu=80,mem=772676M,billing=80
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Comment 1 Wayfinder Infrastructure Support 2022-10-12 15:01:50 MDT

Created attachment 27241 [details]
gres.conf

Comment 2 Jason Booth 2022-10-12 15:17:34 MDT

Please raise the log level and duplicate the issue, then lower the log level. Compress the logs and attach those to this issue.


> scontrol setdebug debug
Reproduce the error
> scontrol setdebug info

Attach the logs. Please also attach the compressed slurmd.log from ng-201-21.

Comment 3 Wayfinder Infrastructure Support 2022-10-12 17:03:01 MDT

Created attachment 27243 [details]
slurmd2

Hi,
Increased the log level but it doesn't but it looks like the file did not change from the first.

Deric

Comment 4 Oscar Hernández 2022-10-13 02:38:57 MDT

Hi Deric,

Thanks for the provided data.

> Increased the log level but it doesn't but it looks like the file did not
> change from the first.

This is expected if you were only looking at the slurmd logs, the log level changes suggested by Jason do only affect the slurmctld.log. Which is the one we think is going to give us more information of what is going on.

Could I ask you to repeat the test? this time also enabling "SelectType" debugflag? To do so, you should:
(enable logging)
scontrol setdebug debug
scontrol setdebugflags +selecttype

Run the failing job.
(then to return to normal logging)
scontrol setdebug info
scontrol setdebugflags -selecttype

You will then need to attach the relevant part of the slurmctld.log (the timeframe of the lo in which the job was submitted). Also, it would be great if you could also give us the JOBID of the failed job, so that we can correctly identify it in the logs ("sacct" command should give you the information).

Just another couple of questions. Was this working before? Does it successfully run if you execute "srun  --nodelist=ng-201-21 hostname"?

Kind regards,
Oscar

Comment 5 Wayfinder Infrastructure Support 2022-10-13 08:56:51 MDT

Oscar,
This was working before.
I ran it twice and noticed that it says "volta" partition. This node should be in the quadro partition.

(base) deric.chau.ctr@ng-201-21:~$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
149497         hostname      volta                     1     FAILED      1:0
149498         hostname      volta                     1     FAILED      1:0


Attaching slurmctld

Comment 6 Wayfinder Infrastructure Support 2022-10-13 08:58:28 MDT

Created attachment 27252 [details]
slurmctld

Comment 7 Oscar Hernández 2022-10-13 10:15:52 MDT

Hi Deric,

Thanks for the test and the logs.

Looking at your configuration and the output you are getting, I would say this is the expected behavior. Have you recently changed the default partition in the cluster?

What is happening is that Slurm is trying to launch the job to the default partition, which is currently "Volta":

PartitionName=volta Default=YES MinNodes=1 ...

But as the node you are asking for is part of the "quadro" partition. Slurm is letting you know that the resources requested (node ng-201-21) are not in partition "volta". The logs do also confirm that:

[2022-10-13T14:45:44.663] No nodes satisfy requirements for JobId=149497 in partition volta
[2022-10-13T14:45:44.663] _slurm_rpc_allocate_resources: Requested node configuration is not available 

This should work:

srun -p quadro --nodelist=ng-201-21

A partition must always be specified when submitting a job. If it is not, slurm will try to allocate it in the default partition (it won't autoselect it based on the node requested).

Let me know if that helps and has sense for your situation.

Kind regards,
Oscar

Comment 8 Wayfinder Infrastructure Support 2022-10-13 13:07:11 MDT

Resolved. Thank you