2950 – Invalid check for max_nodes in _part_access_check

Ticket 2950 - Invalid check for max_nodes in _part_access_check

Summary: Invalid check for max_nodes in _part_access_check

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	16.05.0
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Tim Wickberg
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2016-07-28 13:22 MDT by Dorian Krause
Modified:	2016-08-08 17:56 MDT (History)
CC List:	0 users

See Also:
Site:	-Other-
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	16.05.4 17.02-pre1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Dorian Krause 2016-07-28 13:22:10 MDT

When running `srun -n 1 hostname` on a 16.05.0 installation we get:

$ srun -n 1 hostname
srun: error: Unable to allocate resources: Node count specification invalid

In the controller log we find:

[2016-07-28T21:00:51.335] Set debug level to 9
[2016-07-28T21:00:58.338] debug2: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION from uid=10034
[2016-07-28T21:00:58.338] debug3: JobDesc: user_id=10034 job_id=N/A partition=(null) name=hostname
[2016-07-28T21:00:58.338] debug3:    cpus=1-4294967294 pn_min_cpus=-1 core_spec=-1
[2016-07-28T21:00:58.338] debug3:    Nodes=1-[4294967294] Sock/Node=65534 Core/Sock=65534 Thread/Core=65534
[2016-07-28T21:00:58.338] debug3:    pn_min_memory_job=-1 pn_min_tmp_disk=-1
[2016-07-28T21:00:58.338] debug3:    immediate=0 features=(null) reservation=(null)
[2016-07-28T21:00:58.338] debug3:    req_nodes=(null) exc_nodes=(null) gres=(null)
[2016-07-28T21:00:58.338] debug3:    time_limit=-1--1 priority=-1 contiguous=0 shared=-1
[2016-07-28T21:00:58.338] debug3:    kill_on_node_fail=-1 script=(null)
[2016-07-28T21:00:58.338] debug3:    argv="hostname"
[2016-07-28T21:00:58.338] debug3:    stdin=(null) stdout=(null) stderr=(null)
[2016-07-28T21:00:58.338] debug3:    work_dir=/homeb/zam/kraused alloc_node:sid=j3m01:20976
[2016-07-28T21:00:58.338] debug3:    power_flags=
[2016-07-28T21:00:58.338] debug3:    resp_host=192.168.12.10 alloc_resp_port=59111 other_port=41534
[2016-07-28T21:00:58.338] debug3:    dependency=(null) account=(null) qos=(null) comment=(null)
[2016-07-28T21:00:58.338] debug3:    mail_type=0 mail_user=(null) nice=0 num_tasks=1 open_mode=0 overcommit=-1 acctg_freq=(null)
[2016-07-28T21:00:58.339] debug3:    network=(null) begin=Unknown cpus_per_task=-1 requeue=-1 licenses=(null)
[2016-07-28T21:00:58.339] debug3:    end_time= signal=0@0 wait_all_nodes=-1 cpu_freq=
[2016-07-28T21:00:58.339] debug3:    ntasks_per_node=-1 ntasks_per_socket=-1 ntasks_per_core=-1
[2016-07-28T21:00:58.339] debug3:    mem_bind=65534:(null) plane_size:65534
[2016-07-28T21:00:58.339] debug3:    array_inx=(null)
[2016-07-28T21:00:58.339] debug3:    burst_buffer=(null)
[2016-07-28T21:00:58.339] debug3:    mcs_label=(null)
[2016-07-28T21:00:58.339] debug3:    deadline=Unknown
[2016-07-28T21:00:58.339] debug3:    bitflags=0
[2016-07-28T21:00:58.339] debug3: found correct user
[2016-07-28T21:00:58.339] debug3: found correct association
[2016-07-28T21:00:58.339] debug3: found correct qos
[2016-07-28T21:00:58.339] _part_access_check: Job requested for nodes (4294967294) greater than partition psslurm(2) max nodes

The failing _part_access_check appears to not handle the case job_desc->max_nodes == NO_VAL as expected. The patch below (on top of 16.05.3) has not yet been tested on the affected (we will do so as soon as possible) but should address this problem:

diff --git a/src/slurmctld/job_mgr.c b/src/slurmctld/job_mgr.c
index cbf669f..0de3254 100644
--- a/src/slurmctld/job_mgr.c
+++ b/src/slurmctld/job_mgr.c
@@ -5206,7 +5206,7 @@ static int _part_access_check(struct part_record *part_ptr,
        }
 
        if ((part_ptr->state_up & PARTITION_SCHED) &&
-           (job_desc->min_nodes != NO_VAL) &&
+           (job_desc->max_nodes != NO_VAL) &&
            (job_desc->max_nodes > max_nodes_tmp)) {
                info("_part_access_check: Job requested for nodes (%u) "
                     "greater than partition %s(%u) max nodes",

Comment 1 Dorian Krause 2016-07-31 05:30:44 MDT

I can now confirm that the mentioned patch does fix the problem when applied to 16.05.3.

Comment 2 Tim Wickberg 2016-08-08 17:56:38 MDT

Thank you for the patch - we'd managed to stumble onto this internally around the same time when reviewing defects reported by Coverity. As you noticed, min_nodes should have been max_nodes.

Fixed with commit f89369c389 which will be in 16.05.4.