15446 – When nodes cannot satisfy a job due to topology the reported error is "BAD CONSTRAINT"

Ticket 15446 - When nodes cannot satisfy a job due to topology the reported error is "BAD CONSTRAINT"

Summary: When nodes cannot satisfy a job due to topology the reported error is "BAD CO...

Status:	RESOLVED DUPLICATE of ticket 4687

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	22.05.6
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Dominik Bartkiewicz
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-11-17 07:14 MST by Greg Wickham
Modified:	2022-11-18 04:18 MST (History)
CC List:	1 user (show)

See Also:	4687
Site:	KAUST
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Greg Wickham 2022-11-17 07:14:50 MST

While setting up our test cluster in preparation to migrating to 22.05.6 a node was inadvertently left out of the switch topology.

When submitting a job that could only be satisfied by using nodes that are not connected in the switch topology the error returned to the user is "BAD CONSTRAINT".

Digging deeper, increasing logging on slurmctld, this is the only discernible message indicating an issue:

[2022-11-16T11:45:50.566] debug2: job_allocate: setting JobId=760 to "BadConstraints" due to a flaw in the job request (Requested node configuration is not available)

This message appears at the end of the messages included below.

[2022-11-16T11:45:50.564] debug2: _build_node_list: JobId=760 matched 0 nodes (cn513-17-r) due to job partition or features
[2022-11-16T11:45:50.564] debug2: _build_node_list: JobId=760 matched 0 nodes (cn603-14-l) due to job partition or features
[2022-11-16T11:45:50.564] debug2: found 1 usable nodes from config containing gpu203-23-l
[2022-11-16T11:45:50.564] debug2: found 1 usable nodes from config containing gpu203-23-r
[2022-11-16T11:45:50.564] debug2: found 1 usable nodes from config containing gpu609-03
[2022-11-16T11:45:50.564] debug3: _pick_best_nodes: JobId=760 idle_nodes 4 share_nodes 5
[2022-11-16T11:45:50.564] debug2: select/cons_tres: select_p_job_test: evaluating JobId=760
[2022-11-16T11:45:50.565] debug2: select/cons_tres: select_p_job_test: evaluating JobId=760
[2022-11-16T11:45:50.565] debug2: select/cons_tres: select_p_job_test: evaluating JobId=760
[2022-11-16T11:45:50.565] _pick_best_nodes: JobId=760 never runnable in partition gpu
[2022-11-16T11:45:50.565] debug2: Spawning RPC agent for msg_type SRUN_JOB_COMPLETE
[2022-11-16T11:45:50.566] debug2: job_allocate: setting JobId=760 to "BadConstraints" due to a flaw in the job request (Requested node configuration is not available)
[2022-11-16T11:45:50.566] _slurm_rpc_allocate_resources: Requested node configuration is not available 


The only place the issue is accurately identified is in a slurmd log file:

[2022-11-16T11:30:40.064] error: WARNING: switches lack access to 1 nodes: gpu203-23-l

BTW you might ask why didn't we look at the topology when this message was displayed "[2022-11-16T11:45:34.158] topology/tree: _validate_switches: TOPOLOGY: warning -- no switch can reach all nodes through its descendants. If this is not intentional, fix the topology.conf file.
".

This is as there are both Intel and AMD nodes in the cluster, and the topology of these two types of nodes is deliberately disjoint so that a job doesn't launch spanning two architecture types.

   -greg

Comment 2 Jason Booth 2022-11-17 10:32:00 MST

Greg this does look like a duplicate of bug#4687. This is something we are looking at adding in the 23.02 release. I will have Dominik confirm this is a duplicate before we change the status of this bug to reflect this.

Comment 3 Dominik Bartkiewicz 2022-11-18 04:18:39 MST

Hi

Yes, this is on the list of planned improvements around topology addressed to 23.02
I'll go ahead and close this out as a duplicate of bug 4687.

Dominik

*** This ticket has been marked as a duplicate of ticket 4687 ***