Ticket 15446

Summary: When nodes cannot satisfy a job due to topology the reported error is "BAD CONSTRAINT"
Product: Slurm Reporter: Greg Wickham <greg.wickham>
Component: SchedulingAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: marshall
Version: 22.05.6   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=4687
Site: KAUST Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Greg Wickham 2022-11-17 07:14:50 MST
While setting up our test cluster in preparation to migrating to 22.05.6 a node was inadvertently left out of the switch topology.

When submitting a job that could only be satisfied by using nodes that are not connected in the switch topology the error returned to the user is "BAD CONSTRAINT".

Digging deeper, increasing logging on slurmctld, this is the only discernible message indicating an issue:

[2022-11-16T11:45:50.566] debug2: job_allocate: setting JobId=760 to "BadConstraints" due to a flaw in the job request (Requested node configuration is not available)

This message appears at the end of the messages included below.

[2022-11-16T11:45:50.564] debug2: _build_node_list: JobId=760 matched 0 nodes (cn513-17-r) due to job partition or features
[2022-11-16T11:45:50.564] debug2: _build_node_list: JobId=760 matched 0 nodes (cn603-14-l) due to job partition or features
[2022-11-16T11:45:50.564] debug2: found 1 usable nodes from config containing gpu203-23-l
[2022-11-16T11:45:50.564] debug2: found 1 usable nodes from config containing gpu203-23-r
[2022-11-16T11:45:50.564] debug2: found 1 usable nodes from config containing gpu609-03
[2022-11-16T11:45:50.564] debug3: _pick_best_nodes: JobId=760 idle_nodes 4 share_nodes 5
[2022-11-16T11:45:50.564] debug2: select/cons_tres: select_p_job_test: evaluating JobId=760
[2022-11-16T11:45:50.565] debug2: select/cons_tres: select_p_job_test: evaluating JobId=760
[2022-11-16T11:45:50.565] debug2: select/cons_tres: select_p_job_test: evaluating JobId=760
[2022-11-16T11:45:50.565] _pick_best_nodes: JobId=760 never runnable in partition gpu
[2022-11-16T11:45:50.565] debug2: Spawning RPC agent for msg_type SRUN_JOB_COMPLETE
[2022-11-16T11:45:50.566] debug2: job_allocate: setting JobId=760 to "BadConstraints" due to a flaw in the job request (Requested node configuration is not available)
[2022-11-16T11:45:50.566] _slurm_rpc_allocate_resources: Requested node configuration is not available 


The only place the issue is accurately identified is in a slurmd log file:

[2022-11-16T11:30:40.064] error: WARNING: switches lack access to 1 nodes: gpu203-23-l

BTW you might ask why didn't we look at the topology when this message was displayed "[2022-11-16T11:45:34.158] topology/tree: _validate_switches: TOPOLOGY: warning -- no switch can reach all nodes through its descendants. If this is not intentional, fix the topology.conf file.
".

This is as there are both Intel and AMD nodes in the cluster, and the topology of these two types of nodes is deliberately disjoint so that a job doesn't launch spanning two architecture types.

   -greg
Comment 2 Jason Booth 2022-11-17 10:32:00 MST
Greg this does look like a duplicate of bug#4687. This is something we are looking at adding in the 23.02 release. I will have Dominik confirm this is a duplicate before we change the status of this bug to reflect this.
Comment 3 Dominik Bartkiewicz 2022-11-18 04:18:39 MST
Hi

Yes, this is on the list of planned improvements around topology addressed to 23.02
I'll go ahead and close this out as a duplicate of bug 4687.

Dominik

*** This ticket has been marked as a duplicate of ticket 4687 ***