While setting up our test cluster in preparation to migrating to 22.05.6 a node was inadvertently left out of the switch topology. When submitting a job that could only be satisfied by using nodes that are not connected in the switch topology the error returned to the user is "BAD CONSTRAINT". Digging deeper, increasing logging on slurmctld, this is the only discernible message indicating an issue: [2022-11-16T11:45:50.566] debug2: job_allocate: setting JobId=760 to "BadConstraints" due to a flaw in the job request (Requested node configuration is not available) This message appears at the end of the messages included below. [2022-11-16T11:45:50.564] debug2: _build_node_list: JobId=760 matched 0 nodes (cn513-17-r) due to job partition or features [2022-11-16T11:45:50.564] debug2: _build_node_list: JobId=760 matched 0 nodes (cn603-14-l) due to job partition or features [2022-11-16T11:45:50.564] debug2: found 1 usable nodes from config containing gpu203-23-l [2022-11-16T11:45:50.564] debug2: found 1 usable nodes from config containing gpu203-23-r [2022-11-16T11:45:50.564] debug2: found 1 usable nodes from config containing gpu609-03 [2022-11-16T11:45:50.564] debug3: _pick_best_nodes: JobId=760 idle_nodes 4 share_nodes 5 [2022-11-16T11:45:50.564] debug2: select/cons_tres: select_p_job_test: evaluating JobId=760 [2022-11-16T11:45:50.565] debug2: select/cons_tres: select_p_job_test: evaluating JobId=760 [2022-11-16T11:45:50.565] debug2: select/cons_tres: select_p_job_test: evaluating JobId=760 [2022-11-16T11:45:50.565] _pick_best_nodes: JobId=760 never runnable in partition gpu [2022-11-16T11:45:50.565] debug2: Spawning RPC agent for msg_type SRUN_JOB_COMPLETE [2022-11-16T11:45:50.566] debug2: job_allocate: setting JobId=760 to "BadConstraints" due to a flaw in the job request (Requested node configuration is not available) [2022-11-16T11:45:50.566] _slurm_rpc_allocate_resources: Requested node configuration is not available The only place the issue is accurately identified is in a slurmd log file: [2022-11-16T11:30:40.064] error: WARNING: switches lack access to 1 nodes: gpu203-23-l BTW you might ask why didn't we look at the topology when this message was displayed "[2022-11-16T11:45:34.158] topology/tree: _validate_switches: TOPOLOGY: warning -- no switch can reach all nodes through its descendants. If this is not intentional, fix the topology.conf file. ". This is as there are both Intel and AMD nodes in the cluster, and the topology of these two types of nodes is deliberately disjoint so that a job doesn't launch spanning two architecture types. -greg
Greg this does look like a duplicate of bug#4687. This is something we are looking at adding in the 23.02 release. I will have Dominik confirm this is a duplicate before we change the status of this bug to reflect this.
Hi Yes, this is on the list of planned improvements around topology addressed to 23.02 I'll go ahead and close this out as a duplicate of bug 4687. Dominik *** This ticket has been marked as a duplicate of ticket 4687 ***