We had the following job submitted today root@ericidle:/etc/cron.daily # scontrol show job 62688 JobId=62688 JobName=po_mcomp UserId=emmanuelgonzalez(45150) GroupId=lyons-lab(31100) MCS_label=N/A Priority=4 Nice=0 Account=lyons-lab QOS=part_qos_standard JobState=PENDING Reason=QOSMaxMemoryPerJob Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=18:00:00 TimeMin=N/A SubmitTime=2020-08-17T10:46:39 EligibleTime=2020-08-17T10:46:39 AccrueTime=2020-08-17T10:46:39 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-08-17T11:44:42 Partition=standard AllocNode:Sid=wentletrap:12390 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1-1 NumCPUs=94 NumTasks=94 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=94,mem=47000G,node=1,billing=94 Socks/Node=* NtasksPerN:B:S:C=94:0:*:* CoreSpec=* MinCPUsNode=94 MinMemoryCPU=500G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=bash WorkDir=/xdisk/ericlyons/big_data/egonzalez/PhytoOracle/stereoTopRGB Power= This job was preventing any other job from being scheduled for some reason for at least 30 minutes. In the slurmctld log, all that was reported was that this job would never run in partition standard and no attempt appeared to be made to start another job. I'm hoping you can explain this behavior and/or let me know what I can do to avoid this situation in the future. Thanks!
(In reply to Todd Merritt from comment #0) > We had the following job submitted today > > root@ericidle:/etc/cron.daily # scontrol show job 62688 > JobId=62688 JobName=po_mcomp > UserId=emmanuelgonzalez(45150) GroupId=lyons-lab(31100) MCS_label=N/A > Priority=4 Nice=0 Account=lyons-lab QOS=part_qos_standard > JobState=PENDING Reason=QOSMaxMemoryPerJob Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 > RunTime=00:00:00 TimeLimit=18:00:00 TimeMin=N/A > SubmitTime=2020-08-17T10:46:39 EligibleTime=2020-08-17T10:46:39 > AccrueTime=2020-08-17T10:46:39 > StartTime=Unknown EndTime=Unknown Deadline=N/A > SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-08-17T11:44:42 > Partition=standard AllocNode:Sid=wentletrap:12390 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=(null) > NumNodes=1-1 NumCPUs=94 NumTasks=94 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=94,mem=47000G,node=1,billing=94 > Socks/Node=* NtasksPerN:B:S:C=94:0:*:* CoreSpec=* > MinCPUsNode=94 MinMemoryCPU=500G MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > Command=bash > WorkDir=/xdisk/ericlyons/big_data/egonzalez/PhytoOracle/stereoTopRGB > Power= > > This job was preventing any other job from being scheduled for some reason > for at least 30 minutes. In the slurmctld log, all that was reported was > that this job would never run in partition standard and no attempt appeared > to be made to start another job. I'm hoping you can explain this behavior > and/or let me know what I can do to avoid this situation in the future. > > Thanks! Hi Todd, What happened after 30 minutes? Jobs have suddenly started to be scheduled? I would need to see your backfill parameters (send me back your slurm.conf) and the slurmctld log. Also the output of: - sacctmgr show qos -p - sinfo - squeue - sdiag (though if everything is working fine now it won't help much) Theoretically one job can reserve resources for him, and even if the nodes looks idle, they are reserved. But it is not ok if the job cannot run due to a limit, then resources must be freed. Moreover the job only requested 1 node. I will also do a test to see if this case works as expected. Maybe it could be another situation and not directly related to this job. Thanks
Hi, I ran scancel on the job after 30 minutes since it was blocking several interactive jobs from starting. As soon as I canceled it, the backlog of jobs all started. The backlogged jobs all listed Priority as their reason for not starting. I'll attach the requested files. root@ericidle:~ # sacctmgr show qos -p Name|Priority|GraceTime|Preempt|PreemptExemptTime|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES| normal|0|00:00:00|||cluster|||1.000000|||||||||||||||||| part_qos_windfall|1|00:00:00|user_qos_idlecycles||cluster|||1.000000|||||||mem=8064G|||10-00:00:00|||1000||||| part_qos_standard|3|00:00:00|part_qos_windfall,user_qos_idlecycles||cluster|||1.000000|||||||mem=8064G|||10-00:00:00|||1000||||| user_qos_bjoyce3|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=2112|cpu=21000000||2000|2000||||||||||||| user_qos_tmerritt|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=2112|cpu=21000000||2000|2000||||||||||||| user_qos_idlecycles|0|00:00:00|||cluster|OverPartQOS||1.000000||||100|200||||||||||||| user_qos_nkchen|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=3674,gres/gpu:volta=0|cpu=16819200||2000|2000||||||||||||| user_qos_timeifler|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=3866,gres/gpu:volta=0|cpu=25228800||2000|2000||||||||||||| user_qos_josh|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=3338,gres/gpu:volta=2|cpu=2102400||2000|2000||||||||||||| user_qos_jlbredas|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=5210,gres/gpu:volta=0|cpu=84096000||2000|2000||||||||||||| user_qos_denard|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=3698,gres/gpu:volta=1|cpu=17870400||2000|2000||||||||||||| user_qos_kgklein|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=3386,gres/gpu:volta=0|cpu=4204800||2000|2000||||||||||||| user_qos_xytang|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=3386,gres/gpu:volta=0|cpu=4204800||2000|2000||||||||||||| user_qos_jrussell|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=4442,gres/gpu:volta=0|cpu=50457600||2000|2000||||||||||||| root@ericidle:~ # sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST windfall* up infinite 6 down* r1u29n2,r2u06n1,r2u12n1,r2u14n2,r3u05n2,r3u36n2 windfall* up infinite 1 drain r1u11n1 windfall* up infinite 80 mix r1u03n[1-2],r1u07n[1-2],r1u08n[1-2],r1u09n[1-2],r1u10n[1-2],r1u11n2,r1u12n[1-2],r1u16n1,r1u17n1,r1u18n[1-2],r1u25n1,r2u27n[1-2],r2u28n[1-2],r2u29n[1-2],r2u30n[1-2],r2u31n[1-2],r2u32n[1-2],r2u33n[1-2],r2u34n[1-2],r2u35n[1-2],r2u36n[1-2],r3u31n[1-2],r3u32n[1-2],r3u33n[1-2],r3u34n[1-2],r3u35n[1-2],r4u13n[1-2],r4u15n[1-2],r4u18n[1-2],r4u25n[1-2],r4u26n[1-2],r4u27n[1-2],r4u28n[1-2],r4u29n[1-2],r4u30n[1-2],r4u31n[1-2],r4u32n[1-2],r4u33n[1-2],r4u34n[1-2],r4u35n[1-2],r4u36n[1-2],r5u19n1,r5u25n1 windfall* up infinite 6 alloc r1u04n[1-2],r1u05n[1-2],r1u06n[1-2] windfall* up infinite 129 idle r1u13n[1-2],r1u14n[1-2],r1u15n[1-2],r1u16n2,r1u17n2,r1u25n2,r1u26n[1-2],r1u27n[1-2],r1u28n[1-2],r1u29n1,r1u30n[1-2],r1u31n[1-2],r1u32n[1-2],r1u33n[1-2],r1u34n[1-2],r1u35n[1-2],r1u36n[1-2],r2u03n[1-2],r2u04n[1-2],r2u05n[1-2],r2u06n2,r2u07n[1-2],r2u08n[1-2],r2u09n[1-2],r2u10n[1-2],r2u11n[1-2],r2u12n2,r2u13n[1-2],r2u14n1,r2u15n[1-2],r2u16n[1-2],r2u17n[1-2],r2u18n[1-2],r2u25n[1-2],r2u26n[1-2],r3u05n1,r3u06n[1-2],r3u07n[1-2],r3u08n[1-2],r3u09n[1-2],r3u10n[1-2],r3u11n[1-2],r3u12n[1-2],r3u13n[1-2],r3u14n[1-2],r3u15n[1-2],r3u16n[1-2],r3u17n[1-2],r3u18n[1-2],r3u25n[1-2],r3u26n[1-2],r3u27n[1-2],r3u28n[1-2],r3u29n[1-2],r3u30n[1-2],r3u36n1,r4u07n[1-2],r4u08n[1-2],r4u09n[1-2],r4u10n[1-2],r4u11n[1-2],r4u12n[1-2],r4u14n[1-2],r4u16n[1-2],r4u17n[1-2],r5u11n1,r5u13n1,r5u15n1,r5u17n1,r5u24n1,r5u27n1,r5u29n1,r5u31n1 standard up infinite 6 down* r1u29n2,r2u06n1,r2u12n1,r2u14n2,r3u05n2,r3u36n2 standard up infinite 1 drain r1u11n1 standard up infinite 80 mix r1u03n[1-2],r1u07n[1-2],r1u08n[1-2],r1u09n[1-2],r1u10n[1-2],r1u11n2,r1u12n[1-2],r1u16n1,r1u17n1,r1u18n[1-2],r1u25n1,r2u27n[1-2],r2u28n[1-2],r2u29n[1-2],r2u30n[1-2],r2u31n[1-2],r2u32n[1-2],r2u33n[1-2],r2u34n[1-2],r2u35n[1-2],r2u36n[1-2],r3u31n[1-2],r3u32n[1-2],r3u33n[1-2],r3u34n[1-2],r3u35n[1-2],r4u13n[1-2],r4u15n[1-2],r4u18n[1-2],r4u25n[1-2],r4u26n[1-2],r4u27n[1-2],r4u28n[1-2],r4u29n[1-2],r4u30n[1-2],r4u31n[1-2],r4u32n[1-2],r4u33n[1-2],r4u34n[1-2],r4u35n[1-2],r4u36n[1-2],r5u19n1,r5u25n1 standard up infinite 6 alloc r1u04n[1-2],r1u05n[1-2],r1u06n[1-2] standard up infinite 129 idle r1u13n[1-2],r1u14n[1-2],r1u15n[1-2],r1u16n2,r1u17n2,r1u25n2,r1u26n[1-2],r1u27n[1-2],r1u28n[1-2],r1u29n1,r1u30n[1-2],r1u31n[1-2],r1u32n[1-2],r1u33n[1-2],r1u34n[1-2],r1u35n[1-2],r1u36n[1-2],r2u03n[1-2],r2u04n[1-2],r2u05n[1-2],r2u06n2,r2u07n[1-2],r2u08n[1-2],r2u09n[1-2],r2u10n[1-2],r2u11n[1-2],r2u12n2,r2u13n[1-2],r2u14n1,r2u15n[1-2],r2u16n[1-2],r2u17n[1-2],r2u18n[1-2],r2u25n[1-2],r2u26n[1-2],r3u05n1,r3u06n[1-2],r3u07n[1-2],r3u08n[1-2],r3u09n[1-2],r3u10n[1-2],r3u11n[1-2],r3u12n[1-2],r3u13n[1-2],r3u14n[1-2],r3u15n[1-2],r3u16n[1-2],r3u17n[1-2],r3u18n[1-2],r3u25n[1-2],r3u26n[1-2],r3u27n[1-2],r3u28n[1-2],r3u29n[1-2],r3u30n[1-2],r3u36n1,r4u07n[1-2],r4u08n[1-2],r4u09n[1-2],r4u10n[1-2],r4u11n[1-2],r4u12n[1-2],r4u14n[1-2],r4u16n[1-2],r4u17n[1-2],r5u11n1,r5u13n1,r5u15n1,r5u17n1,r5u24n1,r5u27n1,r5u29n1,r5u31n1 root@ericidle:~ # squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 62980_[7-10] standard slurm-su fahclien PD 0:00 1 (AssocGrpCPUMinutesLimit) 60663 standard test7 jeongpil R 6-01:54:37 2 r3u35n[1-2] 60662 standard test6 jeongpil R 6-01:54:49 2 r3u34n[1-2] 60661 standard test4 jeongpil R 6-01:54:56 2 r3u33n[1-2] 60660 standard test2 jeongpil R 6-01:55:03 2 r3u32n[1-2] 60659 standard test1 jeongpil R 6-01:55:12 2 r3u31n[1-2] 60578 windfall test2.5 jeongpil R 7-06:35:05 2 r4u36n[1-2] 60577 windfall test2.4 jeongpil R 7-06:35:23 2 r4u35n[1-2] 60574 windfall test2.3 jeongpil R 7-06:41:04 2 r4u34n[1-2] 60573 windfall test2.2 jeongpil R 7-06:41:10 2 r4u33n[1-2] 60568 windfall test2.1 jeongpil R 7-06:41:53 2 r4u32n[1-2] 60567 windfall test2.0 jeongpil R 7-06:41:59 2 r4u31n[1-2] 60566 windfall test1.9 jeongpil R 7-06:42:59 2 r4u30n[1-2] 60565 windfall test1.8 jeongpil R 7-06:43:24 2 r4u29n[1-2] 60564 windfall test1.7 jeongpil R 7-06:43:46 2 r4u28n[1-2] 60563 windfall test1.6 jeongpil R 7-06:44:23 2 r4u27n[1-2] 60561 standard test1.1 jeongpil R 7-06:46:12 2 r4u26n[1-2] 60560 standard test9 jeongpil R 7-06:46:17 2 r4u25n[1-2] 60559 standard test8 jeongpil R 7-06:46:20 2 r4u18n[1-2] 60556 standard test5 jeongpil R 7-06:46:47 2 r4u15n[1-2] 60554 standard test3 jeongpil R 7-06:47:06 2 r4u13n[1-2] 64417 standard benchmar josephlo R 5:20:53 1 r1u12n2 64427 standard Helicove benowitz R 4:28:05 1 r5u19n1 64436 windfall n208835 jeongpil R 3:58:57 1 r2u29n2 64435 windfall n208830 jeongpil R 3:59:03 1 r2u29n1 64434 windfall n208805 jeongpil R 3:59:06 1 r2u30n2 64433 windfall n208800 jeongpil R 3:59:15 1 r2u30n1 64432 windfall n204835 jeongpil R 4:01:03 1 r2u31n2 64431 windfall n204830 jeongpil R 4:01:08 1 r2u31n1 64430 windfall n204805 jeongpil R 4:01:13 1 r2u32n2 64429 windfall n204800 jeongpil R 4:01:18 1 r2u32n1 64414 standard Abyss-k5 natalier R 5:29:34 1 r1u12n1 64413 standard Abyss-k4 natalier R 5:30:12 1 r1u11n2 64412 standard Abyss-k4 natalier R 5:30:46 1 r1u10n1 64392 standard eng_memo plovett R 6:03:23 1 r1u07n2 64387 standard eng_memo plovett R 6:11:32 1 r1u07n1 64366 windfall n216835 jeongpil R 8:57:05 1 r2u27n2 64365 windfall n216830 jeongpil R 8:57:11 1 r2u27n1 64364 windfall n216805 jeongpil R 8:57:16 1 r2u28n2 64363 windfall n216800 jeongpil R 8:57:22 1 r2u28n1 64264 windfall hyphy denard R 11:08:13 1 r1u09n2 64225 windfall hyphy denard R 11:08:16 1 r1u09n1 64236 windfall hyphy denard R 11:08:16 1 r1u09n2 64237 windfall hyphy denard R 11:08:16 1 r1u09n2 64238 windfall hyphy denard R 11:08:16 1 r1u09n2 64239 windfall hyphy denard R 11:08:16 1 r1u09n2 64249 windfall hyphy denard R 11:08:16 1 r1u09n2 64250 windfall hyphy denard R 11:08:16 1 r1u09n2 64255 windfall hyphy denard R 11:08:16 1 r1u09n2 64256 windfall hyphy denard R 11:08:16 1 r1u09n2 64260 windfall hyphy denard R 11:08:16 1 r1u09n2 64183 windfall hyphy denard R 11:08:19 1 r1u07n2 64185 windfall hyphy denard R 11:08:19 1 r1u09n1 64189 windfall hyphy denard R 11:08:19 1 r1u09n1 64198 windfall hyphy denard R 11:08:19 1 r1u09n1 64202 windfall hyphy denard R 11:08:19 1 r1u09n1 64211 windfall hyphy denard R 11:08:19 1 r1u09n1 64213 windfall hyphy denard R 11:08:19 1 r1u09n1 64141 windfall hyphy denard R 11:08:21 1 r1u03n1 64142 windfall hyphy denard R 11:08:21 1 r1u03n1 64147 windfall hyphy denard R 11:08:21 1 r1u03n1 64153 windfall hyphy denard R 11:08:21 1 r1u07n1 64181 windfall hyphy denard R 11:08:21 1 r1u07n2 64113 windfall sn136835 jeongpil R 11:16:12 1 r2u33n2 64112 windfall sn136830 jeongpil R 11:16:16 1 r2u33n1 64111 windfall sn136805 jeongpil R 11:16:20 1 r2u34n2 64110 windfall sn136800 jeongpil R 11:16:24 1 r2u34n1 64109 windfall n136835 jeongpil R 11:17:44 1 r2u35n2 64108 windfall n136830 jeongpil R 11:17:49 1 r2u35n1 64107 windfall n136805 jeongpil R 11:17:53 1 r2u36n2 64106 windfall n136800 jeongpil R 11:17:56 1 r2u36n1 63990_0 standard SEDNoIR rehvidin R 11:57:39 1 r1u03n2 63990_1 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n1 63990_2 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n1 63990_3 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n1 63990_4 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n1 63990_5 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n1 63990_6 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n1 63990_7 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n1 63990_8 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2 63990_9 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2 63990_10 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2 63990_11 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2 63990_12 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2 63990_13 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2 63990_14 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2 63990_15 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2 63990_16 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2 63990_17 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2 63990_18 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2 63990_19 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2 63990_20 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2 63990_21 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n1 63990_22 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n1 63990_23 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n1 63990_24 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n1 63990_25 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n1 63990_26 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n1 63990_27 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n1 63990_28 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n1 63990_29 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_30 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_31 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_32 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_33 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_34 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_35 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_36 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_37 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_38 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_39 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_40 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_41 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_42 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_43 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_44 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_45 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_46 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_47 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_48 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_49 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_50 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_51 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_52 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_53 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_54 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_55 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_56 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_57 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_58 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_59 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_60 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_61 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_62 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_63 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_64 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_65 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_66 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_67 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_68 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_69 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_70 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_71 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_72 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_73 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_74 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_75 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_76 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_77 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2 63990_78 standard SEDNoIR rehvidin R 11:57:39 1 r1u09n1 63990_79 standard SEDNoIR rehvidin R 11:57:39 1 r1u09n1 63990_80 standard SEDNoIR rehvidin R 11:57:39 1 r1u09n1 63990_81 standard SEDNoIR rehvidin R 11:57:39 1 r1u09n1 63990_82 standard SEDNoIR rehvidin R 11:57:39 1 r1u09n1 63990_83 standard SEDNoIR rehvidin R 11:57:39 1 r1u09n1 63990_84 standard SEDNoIR rehvidin R 11:57:39 1 r1u09n1 63990_85 standard SEDNoIR rehvidin R 11:57:39 1 r1u09n1 63990_86 standard SEDNoIR rehvidin R 11:57:39 1 r1u09n1 63990_87 standard SEDNoIR rehvidin R 11:57:39 1 r1u09n1 63990_88 standard SEDNoIR rehvidin R 11:57:39 1 r1u09n1 63990_89 standard SEDNoIR rehvidin R 11:57:39 1 r1u09n1 63160 windfall fv_s_tes lauterbu R 12:31:08 1 r1u07n2 63161 windfall fv_s_200 lauterbu R 12:31:08 1 r1u08n1 63145 windfall hyphy denard R 13:12:38 1 r1u07n1 63109 windfall hyphy denard R 13:12:41 1 r1u07n1 63114 windfall hyphy denard R 13:12:41 1 r1u07n1 63130 windfall hyphy denard R 13:12:41 1 r1u07n1 63086 windfall hyphy denard R 13:12:43 1 r1u07n1 63078 windfall hyphy denard R 13:12:44 1 r1u07n1 63079 windfall hyphy denard R 13:12:44 1 r1u07n1 63070 windfall hyphy denard R 13:13:02 1 r1u03n2 63010 windfall hyphy denard R 13:13:05 1 r1u03n1 63019 windfall hyphy denard R 13:13:05 1 r1u03n1 63038 windfall hyphy denard R 13:13:05 1 r1u03n2 63040 windfall hyphy denard R 13:13:05 1 r1u03n2 62980_1 standard slurm-su fahclien R 14:32:02 1 r1u04n1 62980_2 standard slurm-su fahclien R 14:32:02 1 r1u04n2 62980_3 standard slurm-su fahclien R 14:32:02 1 r1u05n1 62980_4 standard slurm-su fahclien R 14:32:02 1 r1u05n2 62980_5 standard slurm-su fahclien R 14:32:02 1 r1u06n1 62980_6 standard slurm-su fahclien R 14:32:02 1 r1u06n2 62854 windfall hyphy denard R 15:45:56 1 r1u03n1 62863 windfall hyphy denard R 15:45:56 1 r1u03n1 62838 windfall hyphy denard R 15:45:59 1 r1u03n1 62846 windfall hyphy denard R 15:45:59 1 r1u03n1 62709 windfall tar_back denard R 17:45:59 1 r1u10n2 62712 windfall database emsenhub R 17:45:59 1 r1u10n2 62682 windfall hyphy denard R 18:45:23 1 r1u25n1 62641 windfall hyphy denard R 18:45:26 1 r1u18n2 62644 windfall hyphy denard R 18:45:26 1 r1u18n2 62647 windfall hyphy denard R 18:45:26 1 r1u18n2 62537 windfall hyphy denard R 18:47:16 1 r1u18n1 62564 windfall hyphy denard R 18:47:16 1 r1u18n1 62569 windfall hyphy denard R 18:47:16 1 r1u18n1 62476 windfall hyphy denard R 18:47:22 1 r1u17n1 62487 windfall hyphy denard R 18:47:22 1 r1u17n1 62414 windfall hyphy denard R 18:47:25 1 r1u16n1 62437 windfall hyphy denard R 18:47:25 1 r1u17n1 62373 windfall hyphy denard R 18:47:28 1 r1u10n2 62366 windfall hyphy denard R 18:47:30 1 r1u10n2 62276 windfall hyphy denard R 18:47:36 1 r1u08n1 62277 windfall hyphy denard R 18:47:36 1 r1u08n1 62306 windfall hyphy denard R 18:47:36 1 r1u09n2 62211 windfall hyphy denard R 18:47:43 1 r1u07n2 60580 standard lstmAnal dmschwar R 7-01:27:44 1 r5u25n1
Created attachment 15481 [details] slurmctld log
Created attachment 15482 [details] slurm config
Unfortunately your debug level of 'info' (default) doesn't let me see anything relevant. You should try to fix several errors that appear in your logs: - "Invalid argument", are the related nodes running an old version? - low real_memory size (483605 < 515830), you need to adjust the memory Why do you have this?: DebugFlags=NO_CONF_HASH Besides that, I'd need you to run with DebugFlags=backfill and SlurmctldDebug=debug (debug2 would be ideal for getting more info if this doesn't impact your performance). After increasing the logs, do you think it would be possible to reproduce the issue running a job with the exact parameters than the one that caused the issue? If so, I'd need then the sdiag while the queue is stuck and the slurmctld log. Thanks
Hi Felip, Yes, I cleaned up most of those errors that I saw when investigating this. We had a node that apparently lost a dimm and I took it offline. The invalid argument errors from the r5 nodes were related to my colleague adding them to slurm with an incorrect number of GPUs. I re-imaged them and they're registered correctly now. When we initially started this deployment, I had added the DebugFlags=NO_CONF_HASH flag in hopes that I could do a minimal configuration on the job submission nodes but slurm has disabused me of that notion and the configurations are now fully synchronized through a horrible manual process. I'm looking forward to the configuration distribution feature that I saw in v20 when we're able to upgrade :) I think I can reproduce it, I'm pretty sure that the culprit was that they were requesting 47T of ram. I can try modifying those parameters and submitting a test to see if it blocks again. Are there any other parameters that you'd suggest setting to help us debug scheduling issues in the future as well? Thanks!
Hrm, well, I set the scheduler parameters and tried to duplicate this job by submitting something with #SBATCH --ntasks=94 #SBATCH --nodes=1 ###SBATCH --mem-per-cpu=1GB #SBATCH --mem=47000G #SBATCH --time=00:10:00 #SBATCH --job-name=slurm-standard-test #SBATCH --account=tmerritt #SBATCH --partition=standard #SBATCH --output=slurm-standard-test.out but I get sbatch: error: Memory specification can not be satisfied sbatch: error: Batch job submission failed: Requested node configuration is not available That's what I would have expected the user to get from their job as well. Is there something that I'm missing that might allow this job to have been submitted? Thanks!
I can submit a job in my small system which looks mostly like your first one: JobId=6552 JobName=wrap UserId=lipi(1000) GroupId=lipi(1000) MCS_label=N/A Priority=0 Nice=0 Account=lipi QOS=part_qos_standard WCKey=* JobState=PENDING Reason=BadConstraints Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2020-08-19T16:41:31 EligibleTime=2020-08-19T16:41:32 AccrueTime=2020-08-19T16:41:32 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-08-19T16:42:39 Partition=debug AllocNode:Sid=llagosti:15926 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1-1 NumCPUs=94 NumTasks=94 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=94,mem=47000G,node=1,billing=94 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=500G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/home/lipi/slurm/20.02 StdErr=/home/lipi/slurm/20.02/slurm-6552.out StdIn=/dev/null StdOut=/home/lipi/slurm/20.02/slurm-6552.out Power= MailUser=(null) MailType=NONE I see: slurmctld: _build_node_list: No nodes satisfy JobId=6552 requirements in partition debug slurmctld: sched: schedule: JobId=6552 non-runnable: Requested node configuration is not available This is quite easy, just submit a job which exceeds QoS limits and which is not scheduled to run immediately (e.g. begin=now+1). You will get the job into the system but with Reason=QOSMaxMemoryPerJob. Then you can update the job and do inconsistent changes like increasing memory or nr. of nodes. The difference from your case is the Reason is then set to BadConstraints. I think that the numbers you are seeing are just an aesthetic issue, because the job is submitted for a later time and not evaluated immediately. This can also happen for jobs which are PD and then updated inconsistently. Slurm lets you update your job and does not check that the parameters you submitted are consistent with the configuration, but it waits the job to be evaluated by the scheduler and then it sets the Reason appropriately. In my tests jobs are still running in the queue as usual, but you said that after cancelling the wrong job, other PD jobs moved immediately to R. I am not seeing this. Will come back if I find a similar situation. In the meantime, may you try to create a job like in my example and let me know what happens on your system? Thanks
Thanks Felip, I was able to submit with a future begin time but I get an error when I try to modify the job tmerritt@junonia:~/puma $ scontrol update jobid=66473 mem=47000G Update of this parameter is not supported: mem=47000G Request aborted tmerritt@junonia:~/puma $ scontrol update jobid=66473 tres=mem=47000G Update of this parameter is not supported: tres=mem=47000G Request aborted Perhaps I'm just doing it wrong?
(In reply to Todd Merritt from comment #10) > Thanks Felip, > > I was able to submit with a future begin time but I get an error when I try > to modify the job > > tmerritt@junonia:~/puma $ scontrol update jobid=66473 mem=47000G > Update of this parameter is not supported: mem=47000G > Request aborted > tmerritt@junonia:~/puma $ scontrol update jobid=66473 tres=mem=47000G > Update of this parameter is not supported: tres=mem=47000G > Request aborted > > Perhaps I'm just doing it wrong? Your part_qos_standard max mem per job is 8064G, is this ok?: part_qos_standard|3|00:00:00|part_qos_windfall,user_qos_idlecycles||cluster|||1.000000|||||||mem=8064G|||10-00:00:00|||1000||||| ]$ sbatch -N1 --qos=part_qos_standard --mem-per-cpu=8065G --wrap "srun hostname" (exceed qos max mem per job) Just try these combinations: scontrol update jobid xxx starttime=now+1 scontrol update job xxx MinMemoryCPU=512000 scontrol update job xxx numcpus=94 Other keywords: NumTasks NumNodes MinCPUSNode For accepted keywords see scontrol_update_job() in slurm/src/scontrol/update_job.c
Thanks, I was able to modify the job and get it to list the reason as BadConstraints as you indicate. I'll leave it in the queue for a bit and see if jobs start backing up behind it.
Hi I haven't seen this pop up again. You can close this ticket out and I'll open a new one with better logs and an sdiag if it happens again. Thanks, Todd