| Summary: | Similar Slurm Jobs have different Priority | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Kris Whetham <kwhetham> |
| Component: | Scheduling | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED CANNOTREPRODUCE | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | bart |
| Version: | 20.02.1 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | FB (PSLA) | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Kris Whetham
2020-07-08 10:12:49 MDT
Hi Can you send me the output from sprio for these jobs? eg.: sprio -l -j 27977958,27980023 Dominik $ sprio -l -j 27977958,27980023
JOBID PARTITION USER PRIORITY SITE AGE ASSOC FAIRSHARE JOBSIZE PARTITION QOS NICE TRES
27977958 learnfair antares 258746 0 2505 0 6239 3 250000 0 0
Here is another example provided by the user:
JobId=28028239 JobName=mmbt.mmbt.bs128.s1.adam_w.lr5e-05.mu88000.allTrue.lrScale0.1.lrBackbone1e-06
UserId=asg(1185300563) GroupId=asg(1185300563) MCS_label=N/A
Priority=311511 Nice=0 Account=fairusers QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=01:13:00 TimeLimit=3-00:00:00 TimeMin=N/A
SubmitTime=2020-07-08T08:20:09 EligibleTime=2020-07-08T08:20:09
AccrueTime=2020-07-08T08:20:09
StartTime=2020-07-08T08:36:27 EndTime=2020-07-11T08:36:28 Deadline=N/A
PreemptEligibleTime=2020-07-08T09:36:27 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-07-08T08:36:27
Partition=learnfair AllocNode:Sid=localhost:64181
ReqNodeList=(null) ExcNodeList=(null)
NodeList=learnfair[1119,1307,1324,1331,1387,1415,1462,1497]
BatchHost=learnfair1119
NumNodes=8 NumCPUs=512 NumTasks=8 CPUs/Task=64 ReqB:S:C:T=0:0:*:*
TRES=cpu=512,node=8,billing=1536,gres/gpu=64
Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
MinCPUsNode=64 MinMemoryCPU=7G MinTmpDiskNode=0
Features=volta32gb DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/private/home/asg/projects/detr_mmf_main
AdminComment=stdin: /dev/null stdout: /checkpoint/asg/jobs/mmf/checkpoints/detr_mmf/mmbt_detr/06_27/vqa2_direct/vg_r101/mmbt.mmbt.bs128.s1.adam_w.lr5e-05.mu88000.allTrue.lrScale0.1.lrBackbone1e-06.ngpu64/train.log stderr: /checkpoint/asg/jobs/mmf/checkpoints/detr_mmf/mmbt_detr/06_27/vqa2_direct/vg_r101/mmbt.mmbt.bs128.s1.adam_w.lr5e-05.mu88000.allTrue.lrScale0.1.lrBackbone1e-06.ngpu64/train.stderr.%j workdir: /private/home/asg/projects/detr_mmf_main command:
Comment=MMF Load testing
StdErr=/checkpoint/asg/jobs/mmf/checkpoints/detr_mmf/mmbt_detr/06_27/vqa2_direct/vg_r101/mmbt.mmbt.bs128.s1.adam_w.lr5e-05.mu88000.allTrue.lrScale0.1.lrBackbone1e-06.ngpu64/train.stderr.28028239
StdIn=/dev/null
StdOut=/checkpoint/asg/jobs/mmf/checkpoints/detr_mmf/mmbt_detr/06_27/vqa2_direct/vg_r101/mmbt.mmbt.bs128.s1.adam_w.lr5e-05.mu88000.allTrue.lrScale0.1.lrBackbone1e-06.ngpu64/train.log
Power=
TresPerNode=gpu:volta:8
MailUser=(null) MailType=NONE
Hi Can you send me spiro output for pair of pending jobs one high and one low prio? Dominik Hi Users have reported the question jobs have run and they are not currently seeing this issue. Marking case a Resolved, Thanks. -Kris |