9361 – Similar Slurm Jobs have different Priority

Ticket 9361 - Similar Slurm Jobs have different Priority

Summary: Similar Slurm Jobs have different Priority

Status:	RESOLVED CANNOTREPRODUCE

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	20.02.1
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Dominik Bartkiewicz
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-07-08 10:12 MDT by Kris Whetham
Modified:	2020-07-10 11:34 MDT (History)
CC List:	1 user (show)

See Also:
Site:	FB (PSLA)
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Kris Whetham 2020-07-08 10:12:49 MDT

Job 27977958 has been pending for a long time and we are trying to understand if this is a scheduling issue.

JobId=27977958 JobName=ipl
   UserId=antares(1185200412) GroupId=antares(1185200412) MCS_label=N/A
   Priority=257717 Nice=0 Account=fairusers QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2020-07-06T17:20:48 EligibleTime=2020-07-06T17:20:48
   AccrueTime=2020-07-06T17:20:48
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-07-07T16:35:01
   Partition=learnfair AllocNode:Sid=localhost:25357
   ReqNodeList=(null) ExcNodeList=learnfair[1109,1338]
   NodeList=(null)
   NumNodes=8-8 NumCPUs=640 NumTasks=64 CPUs/Task=10 ReqB:S:C:T=0:0:*:*
   TRES=cpu=640,mem=3840G,node=8,billing=2144,gres/gpu=64
   Socks/Node=* NtasksPerN:B:S:C=8:0:*:* CoreSpec=*
   MinCPUsNode=80 MinMemoryCPU=6G MinTmpDiskNode=0
   Features=volta32gb DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/checkpoint/antares/experiments/ipl/run_cont8nodes.sh continue from_gab_fixed_decoder_rescore_fixedparams_filtloopimp_filtlwd_filtlenbird --flagsfile=/checkpoint/antares/experiments/ipl/config/from_gab.cfg --tr_nbest=50 --itersave=true --itersaven=40 --use_band=false --use_band_bird=true --lmweight=0.6192 --eoscore=-0.5652
   WorkDir=/checkpoint/antares/experiments/ipl
   AdminComment=stdin: /dev/null stdout: /checkpoint/antares/tmp/ipl-%j.out stderr: /checkpoint/antares/tmp/ipl-%j.err workdir: /checkpoint/antares/experiments/ipl command: /checkpoint/antares/experiments/ipl/run_cont8nodes.sh continue from_gab_fixed_decoder_rescore_fixedparams_filtloopimp_filtlwd_filtlenbird --flagsfile=/checkpoint/antares/experiments/ipl/config/from_gab.cfg --tr_nbest=50 --itersave=true --itersaven=40 --use_band=false --use_band_bird=true --lmweight=0.6192 --eoscore=-0.5652
   StdErr=/checkpoint/antares/tmp/ipl-27977958.err
   StdIn=/dev/null
   StdOut=/checkpoint/antares/tmp/ipl-27977958.out
   Power=
   TresPerNode=gpu:8
   MailUser=(null) MailType=NONE

Similar Job with much higher priority

   JobId=27980023 JobName=ipl
   UserId=gab(1185200041) GroupId=gab(1185200041) MCS_label=N/A
   Priority=295236 Nice=0 Account=fairusers QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=1-01:02:18 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2020-07-06T18:17:49 EligibleTime=2020-07-06T18:17:49
   AccrueTime=2020-07-06T18:17:49
   StartTime=2020-07-06T19:30:12 EndTime=2020-07-08T19:30:13 Deadline=N/A
   PreemptEligibleTime=2020-07-06T20:30:12 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-07-06T19:30:12
   Partition=learnfair AllocNode:Sid=localhost:75246
   ReqNodeList=(null) ExcNodeList=learnfair[1109,1338]
   NodeList=learnfair[1385,1398,1403,1412,1420,1439,1451,1481]
   BatchHost=learnfair1385
   NumNodes=8 NumCPUs=640 NumTasks=64 CPUs/Task=10 ReqB:S:C:T=0:0:*:*
   TRES=cpu=640,node=8,billing=1664,gres/gpu=64
   Socks/Node=* NtasksPerN:B:S:C=8:0:*:* CoreSpec=*
   MinCPUsNode=80 MinMemoryCPU=6G MinTmpDiskNode=0
   Features=volta32gb DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/checkpoint/antares/experiments/ipl/run_cont8nodes_lex.sh continue from_gab_lexicon_based_fixed_decoder_rescore_filtloopimp_filtlwd_filtlen_lmdecay --flagsfile=/checkpoint/antares/experiments/ipl/config/from_gab_lexicon_based.cfg --tr_nbest=50 --itersave=true --itersaven=40 --use_band=true --lr_decay=600 --lr_decay_step=50 --wordscore_range=0,3 --ipl_decay_lm=true
   WorkDir=/checkpoint/antares/experiments/ipl
   AdminComment=stdin: /dev/null stdout: /checkpoint/antares/tmp/ipl-%j.out stderr: /checkpoint/antares/tmp/ipl-%j.err workdir: /checkpoint/antares/experiments/ipl command: /checkpoint/antares/experiments/ipl/run_cont8nodes_lex.sh continue from_gab_lexicon_based_fixed_decoder_rescore_filtloopimp_filtlwd_filtlen_lmdecay --flagsfile=/checkpoint/antares/experiments/ipl/config/from_gab_lexicon_based.cfg --tr_nbest=50 --itersave=true --itersaven=40 --use_band=true --lr_decay=600 --lr_decay_step=50 --wordscore_range=0,3 --ipl_decay_lm=true
   StdErr=/checkpoint/antares/tmp/ipl-27980023.err
   StdIn=/dev/null
   StdOut=/checkpoint/antares/tmp/ipl-27980023.out
   Power=
   TresPerNode=gpu:8
   MailUser=(null) MailType=NONE

Comment 1 Dominik Bartkiewicz 2020-07-08 10:31:44 MDT

Hi

Can you send me the output from sprio for these jobs?
eg.:
sprio -l -j 27977958,27980023

Dominik

Comment 2 Kris Whetham 2020-07-08 12:19:44 MDT

$ sprio -l -j 27977958,27980023
          JOBID PARTITION     USER   PRIORITY       SITE        AGE      ASSOC  FAIRSHARE    JOBSIZE  PARTITION        QOS        NICE                 TRES
       27977958 learnfair  antares     258746          0       2505          0       6239          3     250000          0           0 





Here is another example provided by the user:

JobId=28028239 JobName=mmbt.mmbt.bs128.s1.adam_w.lr5e-05.mu88000.allTrue.lrScale0.1.lrBackbone1e-06
   UserId=asg(1185300563) GroupId=asg(1185300563) MCS_label=N/A
   Priority=311511 Nice=0 Account=fairusers QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=01:13:00 TimeLimit=3-00:00:00 TimeMin=N/A
   SubmitTime=2020-07-08T08:20:09 EligibleTime=2020-07-08T08:20:09
   AccrueTime=2020-07-08T08:20:09
   StartTime=2020-07-08T08:36:27 EndTime=2020-07-11T08:36:28 Deadline=N/A
   PreemptEligibleTime=2020-07-08T09:36:27 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-07-08T08:36:27
   Partition=learnfair AllocNode:Sid=localhost:64181
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=learnfair[1119,1307,1324,1331,1387,1415,1462,1497]
   BatchHost=learnfair1119
   NumNodes=8 NumCPUs=512 NumTasks=8 CPUs/Task=64 ReqB:S:C:T=0:0:*:*
   TRES=cpu=512,node=8,billing=1536,gres/gpu=64
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   MinCPUsNode=64 MinMemoryCPU=7G MinTmpDiskNode=0
   Features=volta32gb DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/private/home/asg/projects/detr_mmf_main
   AdminComment=stdin: /dev/null stdout: /checkpoint/asg/jobs/mmf/checkpoints/detr_mmf/mmbt_detr/06_27/vqa2_direct/vg_r101/mmbt.mmbt.bs128.s1.adam_w.lr5e-05.mu88000.allTrue.lrScale0.1.lrBackbone1e-06.ngpu64/train.log stderr: /checkpoint/asg/jobs/mmf/checkpoints/detr_mmf/mmbt_detr/06_27/vqa2_direct/vg_r101/mmbt.mmbt.bs128.s1.adam_w.lr5e-05.mu88000.allTrue.lrScale0.1.lrBackbone1e-06.ngpu64/train.stderr.%j workdir: /private/home/asg/projects/detr_mmf_main command:
   Comment=MMF Load testing
   StdErr=/checkpoint/asg/jobs/mmf/checkpoints/detr_mmf/mmbt_detr/06_27/vqa2_direct/vg_r101/mmbt.mmbt.bs128.s1.adam_w.lr5e-05.mu88000.allTrue.lrScale0.1.lrBackbone1e-06.ngpu64/train.stderr.28028239
   StdIn=/dev/null
   StdOut=/checkpoint/asg/jobs/mmf/checkpoints/detr_mmf/mmbt_detr/06_27/vqa2_direct/vg_r101/mmbt.mmbt.bs128.s1.adam_w.lr5e-05.mu88000.allTrue.lrScale0.1.lrBackbone1e-06.ngpu64/train.log
   Power=
   TresPerNode=gpu:volta:8
   MailUser=(null) MailType=NONE

Comment 3 Dominik Bartkiewicz 2020-07-08 13:50:45 MDT

Hi

Can you send me spiro output for pair of pending jobs one high and one low prio?

Dominik

Comment 4 Kris Whetham 2020-07-10 11:34:25 MDT

Hi

Users have reported the question jobs have run and they are not currently seeing this issue. Marking case a Resolved, Thanks. 

-Kris