7791 – Scheduling issue

Ticket 7791 - Scheduling issue

Summary: Scheduling issue

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	18.08.6
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Director of Support
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-09-23 02:43 MDT by Gin Tan
Modified:	2019-10-11 09:37 MDT (History)
CC List:	0 users

See Also:
Site:	Monash University
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Gin Tan 2019-09-23 02:43:09 MDT

We are still trying to troubleshoot the scheduling issue that we have here.

The low priority jobs aren't getting considered even though the job size is small.

One of the jobs that we are trying to re-nice had also produced some error:

# grep 11349763 slurmctld.log
[2019-09-23T15:20:38.702] Recovered JobId=11349763 Assoc=2955
[2019-09-23T16:37:53.716] Recovered JobId=11349763 Assoc=2955
[2019-09-23T18:28:12.116] ignore nice set request on JobId=11349763
[2019-09-23T18:28:12.116] _slurm_rpc_update_job: complete JobId=11349763 uid=0 usec=149744
[2019-09-23T18:29:37.447] ignore nice set request on JobId=11349763
[2019-09-23T18:29:37.448] _slurm_rpc_update_job: complete JobId=11349763 uid=0 usec=3447
[2019-09-23T18:30:27.861] ignore nice set request on JobId=11349763
[2019-09-23T18:30:27.861] _slurm_rpc_update_job: complete JobId=11349763 uid=0 usec=224725


18:19:43 m3-login2:~ ctan$ sjob 11349763
JobId=11349763 JobName=MyJob
   UserId=abut0011(12029) GroupId=monashuniversity(10025) MCS_label=N/A
   Priority=18000 Nice=0 Account=of33 QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2019-09-19T15:00:44 EligibleTime=2019-09-19T15:00:44
   AccrueTime=2019-09-19T15:00:44
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-09-23T18:34:22
   Partition=comp AllocNode:Sid=m3-login1:15075
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=4G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/scratch/of33/Alana/sbatch_script_test.sh
   WorkDir=/scratch/of33/Alana
   StdErr=/scratch/of33/Alana/slurm-11349763.out
   StdIn=/dev/null
   StdOut=/scratch/of33/Alana/slurm-11349763.out
   Power=

18:42:05 m3-login2:~ ctan$ sudo `which scontrol` update JobID=11349763 Nice=-10000
18:42:08 m3-login2:~ ctan$ echo $?
0

The command will complete but job priority won't get updated.

Any ideas?

Comment 3 Douglas Wightman 2019-09-23 15:10:06 MDT

The error message "ignore nice set request on JobId" occurs when the job's priority has already been manually set, either at submission or via scontrol. In other words, if the job's priority was already adjusted by an administrator then adjusting the "nice" value will have no effect and is ignored. Do you know if this job's priority was manually adjusted?

Comment 4 Gin Tan 2019-09-23 17:21:15 MDT

We haven't changed the priority for the job and renice was the attempt.

Further to that, when we tried to see the priority of his jobs with sprio, it returns with no job.

09:16:00 m3-login1:~ ctan$ sprio -u abut0011
          JOBID PARTITION     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS

09:16:20 m3-login1:~ ctan$ man sprio

09:16:46 m3-login1:~ ctan$ squeue -u abut0011
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          11349763      comp    MyJob abut0011 PD       0:00      1 (Priority)
          11349781      comp    MyJob abut0011 PD       0:00      1 (Priority)


His jobs are in the queue with the squeue & scontrol command.

Based on the man page, by default, sprio return information for all pending jobs, is there a reason why the jobs 11349763 & 11349781 aren't showing in the command?

09:19:59 m3-login1:~ ctan$ sprio -j 11349763
Unable to find jobs matching user/id(s) specified
09:20:01 m3-login1:~ ctan$ sprio -j 11349781
Unable to find jobs matching user/id(s) specified

Comment 5 Douglas Wightman 2019-09-24 08:54:29 MDT

If the job had a manual priority set then it also would not show up in sprio. It looks like in the case for job 11349763 the priority was set by an administrator to "18000" specifically (a nice even number too).

Comment 6 Douglas Wightman 2019-09-26 14:18:07 MDT

We will be updating the documentation to note that jobs with a manual priority are not displayed in sprio. Is there anything else that we can help with on this ticket?

Comment 8 Douglas Wightman 2019-10-10 15:05:18 MDT

Info given. sprio output will be addressed with bug 4757.