Ticket 3650

Summary:	Using scontrol top for many jobs leads to held job state
Product:	Slurm	Reporter:	HMS Research Computing <rc>
Component:	User Commands	Assignee:	Moe Jette <jette>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	16.05.4
Hardware:	Linux
OS:	Linux
Site:	Harvard Medical School	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	17.02.2
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Prevent job priorities from being set to zero/held when "scontrol top" command run An additional patch for main scheduler loop running so long

Description HMS Research Computing 2017-03-31 12:21:53 MDT

Hello,

One of our users reported that using scontrol top does not function as expected for moving many (~4,000) jobs to the top of the queue. He was able to the change the priority of the first few jobs as expected, but then received error messages like this: "Job is in held state, pending scheduler release for job XYZ." 

The jobs that were not moved to the top of the queue went into JobHeldAdmin state and could not be released by the user with scontrol:

scontrol release 589820
Access/permission denied for job 589820
slurm_suspend error: Access/permission denied

To alleviate this issue, he came up with a complex work-around, which checks that "last job to be sent to the top of the queue has run" before going to the next one. 

As this is not an optimal solution, are there any suggested methods for moving thousands of jobs to the top of a queue?

Thanks!
Kathleen

Comment 1 Tim Wickberg 2017-03-31 13:22:59 MDT

This is a known limitation of the current 'scontrol top' limitation. Unfortunately, it wasn't designed for the use case you see to have applied here - it was meant for an occasional override, and not for constant usage.

I'll note that this does not move jobs "to the top of the queue"; it works by setting the priority of all other jobs under your account lower than the "top" job. A side-effect of the current implementation is leading to the priority of those held jobs being zero - which is equivalent to an AdminHold internally. I'll look into patching that behavior, but I expect to have a more robust implementation in place for the 17.11 release in November. (Bug 3653 will track our progress on this, although we've been discussing it internally for some time before.)

Comment 2 Moe Jette 2017-03-31 14:04:19 MDT

I'd guess there wasn't a large enough range of job priorities to order all of his jobs. If his highest priority job is, say 100, there's no way to priority order more than 99 of his jobs (priority zero is designed to hold jobs) without raising the priority of some of his jobs, which this command does not do. What it does do is alter the job's "nice" value in Slurm so as to
1. Not increase the priority of his highest priority job
2. Maintain an overall constant sum of job priorities for the user

The job priority is a 32-bit field, I'd suggest that you increase the PriorityWeight factors in slurm.conf by several orders of magnitude to make use of more of that 32-bit range of priorities.

Judging from your message, there appears to also be some bug in Slurm's priority calculations which fails to prevent jobs from going to priority zero (held).

Comment 5 Moe Jette 2017-04-04 18:26:05 MDT

Created attachment 4291 [details]
Prevent job priorities from being set to zero/held when "scontrol top" command run

I was able to reproduce the problem reported and the attached patch fixes it. I'd like to install this on your system Wednesday.

Comment 6 Moe Jette 2017-04-04 18:29:36 MDT

Created attachment 4292 [details]
An additional patch for main scheduler loop running so long

Comment 23 Moe Jette 2017-04-07 08:07:23 MDT

(In reply to Moe Jette from comment #6)
> Created attachment 4292 [details]
> An additional patch for main scheduler loop running so long

Kathleen, I'm not sure this will fix the problem with the message timeouts. What would probably be best is if you can 

1. note when you next see a message timeout
2. save the slurmctld log file around that time period
3. run sdiag and save that output
4. Open a new bug and attach the logs

I'm particularly concerned about the main scheduling logic running for 15+ seconds at a time and don't see how that can happen

Comment 25 Moe Jette 2017-04-20 09:51:29 MDT

Closing bug. The patch is available in Slurm version 17.02.2.