Ticket 3650 - Using scontrol top for many jobs leads to held job state
Summary: Using scontrol top for many jobs leads to held job state
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other tickets)
Version: 16.05.4
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Moe Jette
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-03-31 12:21 MDT by HMS Research Computing
Modified: 2017-04-20 09:51 MDT (History)
0 users

See Also:
Site: Harvard Medical School
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.02.2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Prevent job priorities from being set to zero/held when "scontrol top" command run (1.03 KB, patch)
2017-04-04 18:26 MDT, Moe Jette
Details | Diff
An additional patch for main scheduler loop running so long (1.03 KB, patch)
2017-04-04 18:29 MDT, Moe Jette
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description HMS Research Computing 2017-03-31 12:21:53 MDT
Hello,

One of our users reported that using scontrol top does not function as expected for moving many (~4,000) jobs to the top of the queue. He was able to the change the priority of the first few jobs as expected, but then received error messages like this: "Job is in held state, pending scheduler release for job XYZ." 

The jobs that were not moved to the top of the queue went into JobHeldAdmin state and could not be released by the user with scontrol:

scontrol release 589820
Access/permission denied for job 589820
slurm_suspend error: Access/permission denied

To alleviate this issue, he came up with a complex work-around, which checks that "last job to be sent to the top of the queue has run" before going to the next one. 

As this is not an optimal solution, are there any suggested methods for moving thousands of jobs to the top of a queue?

Thanks!
Kathleen
Comment 1 Tim Wickberg 2017-03-31 13:22:59 MDT
This is a known limitation of the current 'scontrol top' limitation. Unfortunately, it wasn't designed for the use case you see to have applied here - it was meant for an occasional override, and not for constant usage.

I'll note that this does not move jobs "to the top of the queue"; it works by setting the priority of all other jobs under your account lower than the "top" job. A side-effect of the current implementation is leading to the priority of those held jobs being zero - which is equivalent to an AdminHold internally. I'll look into patching that behavior, but I expect to have a more robust implementation in place for the 17.11 release in November. (Bug 3653 will track our progress on this, although we've been discussing it internally for some time before.)
Comment 2 Moe Jette 2017-03-31 14:04:19 MDT
I'd guess there wasn't a large enough range of job priorities to order all of his jobs. If his highest priority job is, say 100, there's no way to priority order more than 99 of his jobs (priority zero is designed to hold jobs) without raising the priority of some of his jobs, which this command does not do. What it does do is alter the job's "nice" value in Slurm so as to
1. Not increase the priority of his highest priority job
2. Maintain an overall constant sum of job priorities for the user

The job priority is a 32-bit field, I'd suggest that you increase the PriorityWeight factors in slurm.conf by several orders of magnitude to make use of more of that 32-bit range of priorities.

Judging from your message, there appears to also be some bug in Slurm's priority calculations which fails to prevent jobs from going to priority zero (held).
Comment 5 Moe Jette 2017-04-04 18:26:05 MDT
Created attachment 4291 [details]
Prevent job priorities from being set to zero/held when "scontrol top" command run

I was able to reproduce the problem reported and the attached patch fixes it. I'd like to install this on your system Wednesday.
Comment 6 Moe Jette 2017-04-04 18:29:36 MDT
Created attachment 4292 [details]
An additional patch for main scheduler loop running so long
Comment 23 Moe Jette 2017-04-07 08:07:23 MDT
(In reply to Moe Jette from comment #6)
> Created attachment 4292 [details]
> An additional patch for main scheduler loop running so long

Kathleen, I'm not sure this will fix the problem with the message timeouts. What would probably be best is if you can 

1. note when you next see a message timeout
2. save the slurmctld log file around that time period
3. run sdiag and save that output
4. Open a new bug and attach the logs

I'm particularly concerned about the main scheduling logic running for 15+ seconds at a time and don't see how that can happen
Comment 25 Moe Jette 2017-04-20 09:51:29 MDT
Closing bug. The patch is available in Slurm version 17.02.2.