Job 27977958 has been pending for a long time and we are trying to understand if this is a scheduling issue. JobId=27977958 JobName=ipl UserId=antares(1185200412) GroupId=antares(1185200412) MCS_label=N/A Priority=257717 Nice=0 Account=fairusers QOS=normal JobState=PENDING Reason=Priority Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A SubmitTime=2020-07-06T17:20:48 EligibleTime=2020-07-06T17:20:48 AccrueTime=2020-07-06T17:20:48 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-07-07T16:35:01 Partition=learnfair AllocNode:Sid=localhost:25357 ReqNodeList=(null) ExcNodeList=learnfair[1109,1338] NodeList=(null) NumNodes=8-8 NumCPUs=640 NumTasks=64 CPUs/Task=10 ReqB:S:C:T=0:0:*:* TRES=cpu=640,mem=3840G,node=8,billing=2144,gres/gpu=64 Socks/Node=* NtasksPerN:B:S:C=8:0:*:* CoreSpec=* MinCPUsNode=80 MinMemoryCPU=6G MinTmpDiskNode=0 Features=volta32gb DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/checkpoint/antares/experiments/ipl/run_cont8nodes.sh continue from_gab_fixed_decoder_rescore_fixedparams_filtloopimp_filtlwd_filtlenbird --flagsfile=/checkpoint/antares/experiments/ipl/config/from_gab.cfg --tr_nbest=50 --itersave=true --itersaven=40 --use_band=false --use_band_bird=true --lmweight=0.6192 --eoscore=-0.5652 WorkDir=/checkpoint/antares/experiments/ipl AdminComment=stdin: /dev/null stdout: /checkpoint/antares/tmp/ipl-%j.out stderr: /checkpoint/antares/tmp/ipl-%j.err workdir: /checkpoint/antares/experiments/ipl command: /checkpoint/antares/experiments/ipl/run_cont8nodes.sh continue from_gab_fixed_decoder_rescore_fixedparams_filtloopimp_filtlwd_filtlenbird --flagsfile=/checkpoint/antares/experiments/ipl/config/from_gab.cfg --tr_nbest=50 --itersave=true --itersaven=40 --use_band=false --use_band_bird=true --lmweight=0.6192 --eoscore=-0.5652 StdErr=/checkpoint/antares/tmp/ipl-27977958.err StdIn=/dev/null StdOut=/checkpoint/antares/tmp/ipl-27977958.out Power= TresPerNode=gpu:8 MailUser=(null) MailType=NONE Similar Job with much higher priority JobId=27980023 JobName=ipl UserId=gab(1185200041) GroupId=gab(1185200041) MCS_label=N/A Priority=295236 Nice=0 Account=fairusers QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=1-01:02:18 TimeLimit=2-00:00:00 TimeMin=N/A SubmitTime=2020-07-06T18:17:49 EligibleTime=2020-07-06T18:17:49 AccrueTime=2020-07-06T18:17:49 StartTime=2020-07-06T19:30:12 EndTime=2020-07-08T19:30:13 Deadline=N/A PreemptEligibleTime=2020-07-06T20:30:12 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-07-06T19:30:12 Partition=learnfair AllocNode:Sid=localhost:75246 ReqNodeList=(null) ExcNodeList=learnfair[1109,1338] NodeList=learnfair[1385,1398,1403,1412,1420,1439,1451,1481] BatchHost=learnfair1385 NumNodes=8 NumCPUs=640 NumTasks=64 CPUs/Task=10 ReqB:S:C:T=0:0:*:* TRES=cpu=640,node=8,billing=1664,gres/gpu=64 Socks/Node=* NtasksPerN:B:S:C=8:0:*:* CoreSpec=* MinCPUsNode=80 MinMemoryCPU=6G MinTmpDiskNode=0 Features=volta32gb DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/checkpoint/antares/experiments/ipl/run_cont8nodes_lex.sh continue from_gab_lexicon_based_fixed_decoder_rescore_filtloopimp_filtlwd_filtlen_lmdecay --flagsfile=/checkpoint/antares/experiments/ipl/config/from_gab_lexicon_based.cfg --tr_nbest=50 --itersave=true --itersaven=40 --use_band=true --lr_decay=600 --lr_decay_step=50 --wordscore_range=0,3 --ipl_decay_lm=true WorkDir=/checkpoint/antares/experiments/ipl AdminComment=stdin: /dev/null stdout: /checkpoint/antares/tmp/ipl-%j.out stderr: /checkpoint/antares/tmp/ipl-%j.err workdir: /checkpoint/antares/experiments/ipl command: /checkpoint/antares/experiments/ipl/run_cont8nodes_lex.sh continue from_gab_lexicon_based_fixed_decoder_rescore_filtloopimp_filtlwd_filtlen_lmdecay --flagsfile=/checkpoint/antares/experiments/ipl/config/from_gab_lexicon_based.cfg --tr_nbest=50 --itersave=true --itersaven=40 --use_band=true --lr_decay=600 --lr_decay_step=50 --wordscore_range=0,3 --ipl_decay_lm=true StdErr=/checkpoint/antares/tmp/ipl-27980023.err StdIn=/dev/null StdOut=/checkpoint/antares/tmp/ipl-27980023.out Power= TresPerNode=gpu:8 MailUser=(null) MailType=NONE
Hi Can you send me the output from sprio for these jobs? eg.: sprio -l -j 27977958,27980023 Dominik
$ sprio -l -j 27977958,27980023 JOBID PARTITION USER PRIORITY SITE AGE ASSOC FAIRSHARE JOBSIZE PARTITION QOS NICE TRES 27977958 learnfair antares 258746 0 2505 0 6239 3 250000 0 0 Here is another example provided by the user: JobId=28028239 JobName=mmbt.mmbt.bs128.s1.adam_w.lr5e-05.mu88000.allTrue.lrScale0.1.lrBackbone1e-06 UserId=asg(1185300563) GroupId=asg(1185300563) MCS_label=N/A Priority=311511 Nice=0 Account=fairusers QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=01:13:00 TimeLimit=3-00:00:00 TimeMin=N/A SubmitTime=2020-07-08T08:20:09 EligibleTime=2020-07-08T08:20:09 AccrueTime=2020-07-08T08:20:09 StartTime=2020-07-08T08:36:27 EndTime=2020-07-11T08:36:28 Deadline=N/A PreemptEligibleTime=2020-07-08T09:36:27 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-07-08T08:36:27 Partition=learnfair AllocNode:Sid=localhost:64181 ReqNodeList=(null) ExcNodeList=(null) NodeList=learnfair[1119,1307,1324,1331,1387,1415,1462,1497] BatchHost=learnfair1119 NumNodes=8 NumCPUs=512 NumTasks=8 CPUs/Task=64 ReqB:S:C:T=0:0:*:* TRES=cpu=512,node=8,billing=1536,gres/gpu=64 Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* MinCPUsNode=64 MinMemoryCPU=7G MinTmpDiskNode=0 Features=volta32gb DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/private/home/asg/projects/detr_mmf_main AdminComment=stdin: /dev/null stdout: /checkpoint/asg/jobs/mmf/checkpoints/detr_mmf/mmbt_detr/06_27/vqa2_direct/vg_r101/mmbt.mmbt.bs128.s1.adam_w.lr5e-05.mu88000.allTrue.lrScale0.1.lrBackbone1e-06.ngpu64/train.log stderr: /checkpoint/asg/jobs/mmf/checkpoints/detr_mmf/mmbt_detr/06_27/vqa2_direct/vg_r101/mmbt.mmbt.bs128.s1.adam_w.lr5e-05.mu88000.allTrue.lrScale0.1.lrBackbone1e-06.ngpu64/train.stderr.%j workdir: /private/home/asg/projects/detr_mmf_main command: Comment=MMF Load testing StdErr=/checkpoint/asg/jobs/mmf/checkpoints/detr_mmf/mmbt_detr/06_27/vqa2_direct/vg_r101/mmbt.mmbt.bs128.s1.adam_w.lr5e-05.mu88000.allTrue.lrScale0.1.lrBackbone1e-06.ngpu64/train.stderr.28028239 StdIn=/dev/null StdOut=/checkpoint/asg/jobs/mmf/checkpoints/detr_mmf/mmbt_detr/06_27/vqa2_direct/vg_r101/mmbt.mmbt.bs128.s1.adam_w.lr5e-05.mu88000.allTrue.lrScale0.1.lrBackbone1e-06.ngpu64/train.log Power= TresPerNode=gpu:volta:8 MailUser=(null) MailType=NONE
Hi Can you send me spiro output for pair of pending jobs one high and one low prio? Dominik
Hi Users have reported the question jobs have run and they are not currently seeing this issue. Marking case a Resolved, Thanks. -Kris