| Summary: | help explaining scheduler backlog | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Todd Merritt <tmerritt> |
| Component: | Scheduling | Assignee: | Felip Moll <felip.moll> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 19.05.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | U of AZ | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurmctld log
slurm config |
||
|
Description
Todd Merritt
2020-08-17 12:51:16 MDT
(In reply to Todd Merritt from comment #0) > We had the following job submitted today > > root@ericidle:/etc/cron.daily # scontrol show job 62688 > JobId=62688 JobName=po_mcomp > UserId=emmanuelgonzalez(45150) GroupId=lyons-lab(31100) MCS_label=N/A > Priority=4 Nice=0 Account=lyons-lab QOS=part_qos_standard > JobState=PENDING Reason=QOSMaxMemoryPerJob Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 > RunTime=00:00:00 TimeLimit=18:00:00 TimeMin=N/A > SubmitTime=2020-08-17T10:46:39 EligibleTime=2020-08-17T10:46:39 > AccrueTime=2020-08-17T10:46:39 > StartTime=Unknown EndTime=Unknown Deadline=N/A > SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-08-17T11:44:42 > Partition=standard AllocNode:Sid=wentletrap:12390 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=(null) > NumNodes=1-1 NumCPUs=94 NumTasks=94 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=94,mem=47000G,node=1,billing=94 > Socks/Node=* NtasksPerN:B:S:C=94:0:*:* CoreSpec=* > MinCPUsNode=94 MinMemoryCPU=500G MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > Command=bash > WorkDir=/xdisk/ericlyons/big_data/egonzalez/PhytoOracle/stereoTopRGB > Power= > > This job was preventing any other job from being scheduled for some reason > for at least 30 minutes. In the slurmctld log, all that was reported was > that this job would never run in partition standard and no attempt appeared > to be made to start another job. I'm hoping you can explain this behavior > and/or let me know what I can do to avoid this situation in the future. > > Thanks! Hi Todd, What happened after 30 minutes? Jobs have suddenly started to be scheduled? I would need to see your backfill parameters (send me back your slurm.conf) and the slurmctld log. Also the output of: - sacctmgr show qos -p - sinfo - squeue - sdiag (though if everything is working fine now it won't help much) Theoretically one job can reserve resources for him, and even if the nodes looks idle, they are reserved. But it is not ok if the job cannot run due to a limit, then resources must be freed. Moreover the job only requested 1 node. I will also do a test to see if this case works as expected. Maybe it could be another situation and not directly related to this job. Thanks Hi,
I ran scancel on the job after 30 minutes since it was blocking several interactive jobs from starting. As soon as I canceled it, the backlog of jobs all started. The backlogged jobs all listed Priority as their reason for not starting.
I'll attach the requested files.
root@ericidle:~ # sacctmgr show qos -p
Name|Priority|GraceTime|Preempt|PreemptExemptTime|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES|
normal|0|00:00:00|||cluster|||1.000000||||||||||||||||||
part_qos_windfall|1|00:00:00|user_qos_idlecycles||cluster|||1.000000|||||||mem=8064G|||10-00:00:00|||1000|||||
part_qos_standard|3|00:00:00|part_qos_windfall,user_qos_idlecycles||cluster|||1.000000|||||||mem=8064G|||10-00:00:00|||1000|||||
user_qos_bjoyce3|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=2112|cpu=21000000||2000|2000|||||||||||||
user_qos_tmerritt|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=2112|cpu=21000000||2000|2000|||||||||||||
user_qos_idlecycles|0|00:00:00|||cluster|OverPartQOS||1.000000||||100|200|||||||||||||
user_qos_nkchen|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=3674,gres/gpu:volta=0|cpu=16819200||2000|2000|||||||||||||
user_qos_timeifler|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=3866,gres/gpu:volta=0|cpu=25228800||2000|2000|||||||||||||
user_qos_josh|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=3338,gres/gpu:volta=2|cpu=2102400||2000|2000|||||||||||||
user_qos_jlbredas|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=5210,gres/gpu:volta=0|cpu=84096000||2000|2000|||||||||||||
user_qos_denard|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=3698,gres/gpu:volta=1|cpu=17870400||2000|2000|||||||||||||
user_qos_kgklein|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=3386,gres/gpu:volta=0|cpu=4204800||2000|2000|||||||||||||
user_qos_xytang|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=3386,gres/gpu:volta=0|cpu=4204800||2000|2000|||||||||||||
user_qos_jrussell|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=4442,gres/gpu:volta=0|cpu=50457600||2000|2000|||||||||||||
root@ericidle:~ # sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
windfall* up infinite 6 down* r1u29n2,r2u06n1,r2u12n1,r2u14n2,r3u05n2,r3u36n2
windfall* up infinite 1 drain r1u11n1
windfall* up infinite 80 mix r1u03n[1-2],r1u07n[1-2],r1u08n[1-2],r1u09n[1-2],r1u10n[1-2],r1u11n2,r1u12n[1-2],r1u16n1,r1u17n1,r1u18n[1-2],r1u25n1,r2u27n[1-2],r2u28n[1-2],r2u29n[1-2],r2u30n[1-2],r2u31n[1-2],r2u32n[1-2],r2u33n[1-2],r2u34n[1-2],r2u35n[1-2],r2u36n[1-2],r3u31n[1-2],r3u32n[1-2],r3u33n[1-2],r3u34n[1-2],r3u35n[1-2],r4u13n[1-2],r4u15n[1-2],r4u18n[1-2],r4u25n[1-2],r4u26n[1-2],r4u27n[1-2],r4u28n[1-2],r4u29n[1-2],r4u30n[1-2],r4u31n[1-2],r4u32n[1-2],r4u33n[1-2],r4u34n[1-2],r4u35n[1-2],r4u36n[1-2],r5u19n1,r5u25n1
windfall* up infinite 6 alloc r1u04n[1-2],r1u05n[1-2],r1u06n[1-2]
windfall* up infinite 129 idle r1u13n[1-2],r1u14n[1-2],r1u15n[1-2],r1u16n2,r1u17n2,r1u25n2,r1u26n[1-2],r1u27n[1-2],r1u28n[1-2],r1u29n1,r1u30n[1-2],r1u31n[1-2],r1u32n[1-2],r1u33n[1-2],r1u34n[1-2],r1u35n[1-2],r1u36n[1-2],r2u03n[1-2],r2u04n[1-2],r2u05n[1-2],r2u06n2,r2u07n[1-2],r2u08n[1-2],r2u09n[1-2],r2u10n[1-2],r2u11n[1-2],r2u12n2,r2u13n[1-2],r2u14n1,r2u15n[1-2],r2u16n[1-2],r2u17n[1-2],r2u18n[1-2],r2u25n[1-2],r2u26n[1-2],r3u05n1,r3u06n[1-2],r3u07n[1-2],r3u08n[1-2],r3u09n[1-2],r3u10n[1-2],r3u11n[1-2],r3u12n[1-2],r3u13n[1-2],r3u14n[1-2],r3u15n[1-2],r3u16n[1-2],r3u17n[1-2],r3u18n[1-2],r3u25n[1-2],r3u26n[1-2],r3u27n[1-2],r3u28n[1-2],r3u29n[1-2],r3u30n[1-2],r3u36n1,r4u07n[1-2],r4u08n[1-2],r4u09n[1-2],r4u10n[1-2],r4u11n[1-2],r4u12n[1-2],r4u14n[1-2],r4u16n[1-2],r4u17n[1-2],r5u11n1,r5u13n1,r5u15n1,r5u17n1,r5u24n1,r5u27n1,r5u29n1,r5u31n1
standard up infinite 6 down* r1u29n2,r2u06n1,r2u12n1,r2u14n2,r3u05n2,r3u36n2
standard up infinite 1 drain r1u11n1
standard up infinite 80 mix r1u03n[1-2],r1u07n[1-2],r1u08n[1-2],r1u09n[1-2],r1u10n[1-2],r1u11n2,r1u12n[1-2],r1u16n1,r1u17n1,r1u18n[1-2],r1u25n1,r2u27n[1-2],r2u28n[1-2],r2u29n[1-2],r2u30n[1-2],r2u31n[1-2],r2u32n[1-2],r2u33n[1-2],r2u34n[1-2],r2u35n[1-2],r2u36n[1-2],r3u31n[1-2],r3u32n[1-2],r3u33n[1-2],r3u34n[1-2],r3u35n[1-2],r4u13n[1-2],r4u15n[1-2],r4u18n[1-2],r4u25n[1-2],r4u26n[1-2],r4u27n[1-2],r4u28n[1-2],r4u29n[1-2],r4u30n[1-2],r4u31n[1-2],r4u32n[1-2],r4u33n[1-2],r4u34n[1-2],r4u35n[1-2],r4u36n[1-2],r5u19n1,r5u25n1
standard up infinite 6 alloc r1u04n[1-2],r1u05n[1-2],r1u06n[1-2]
standard up infinite 129 idle r1u13n[1-2],r1u14n[1-2],r1u15n[1-2],r1u16n2,r1u17n2,r1u25n2,r1u26n[1-2],r1u27n[1-2],r1u28n[1-2],r1u29n1,r1u30n[1-2],r1u31n[1-2],r1u32n[1-2],r1u33n[1-2],r1u34n[1-2],r1u35n[1-2],r1u36n[1-2],r2u03n[1-2],r2u04n[1-2],r2u05n[1-2],r2u06n2,r2u07n[1-2],r2u08n[1-2],r2u09n[1-2],r2u10n[1-2],r2u11n[1-2],r2u12n2,r2u13n[1-2],r2u14n1,r2u15n[1-2],r2u16n[1-2],r2u17n[1-2],r2u18n[1-2],r2u25n[1-2],r2u26n[1-2],r3u05n1,r3u06n[1-2],r3u07n[1-2],r3u08n[1-2],r3u09n[1-2],r3u10n[1-2],r3u11n[1-2],r3u12n[1-2],r3u13n[1-2],r3u14n[1-2],r3u15n[1-2],r3u16n[1-2],r3u17n[1-2],r3u18n[1-2],r3u25n[1-2],r3u26n[1-2],r3u27n[1-2],r3u28n[1-2],r3u29n[1-2],r3u30n[1-2],r3u36n1,r4u07n[1-2],r4u08n[1-2],r4u09n[1-2],r4u10n[1-2],r4u11n[1-2],r4u12n[1-2],r4u14n[1-2],r4u16n[1-2],r4u17n[1-2],r5u11n1,r5u13n1,r5u15n1,r5u17n1,r5u24n1,r5u27n1,r5u29n1,r5u31n1
root@ericidle:~ # squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
62980_[7-10] standard slurm-su fahclien PD 0:00 1 (AssocGrpCPUMinutesLimit)
60663 standard test7 jeongpil R 6-01:54:37 2 r3u35n[1-2]
60662 standard test6 jeongpil R 6-01:54:49 2 r3u34n[1-2]
60661 standard test4 jeongpil R 6-01:54:56 2 r3u33n[1-2]
60660 standard test2 jeongpil R 6-01:55:03 2 r3u32n[1-2]
60659 standard test1 jeongpil R 6-01:55:12 2 r3u31n[1-2]
60578 windfall test2.5 jeongpil R 7-06:35:05 2 r4u36n[1-2]
60577 windfall test2.4 jeongpil R 7-06:35:23 2 r4u35n[1-2]
60574 windfall test2.3 jeongpil R 7-06:41:04 2 r4u34n[1-2]
60573 windfall test2.2 jeongpil R 7-06:41:10 2 r4u33n[1-2]
60568 windfall test2.1 jeongpil R 7-06:41:53 2 r4u32n[1-2]
60567 windfall test2.0 jeongpil R 7-06:41:59 2 r4u31n[1-2]
60566 windfall test1.9 jeongpil R 7-06:42:59 2 r4u30n[1-2]
60565 windfall test1.8 jeongpil R 7-06:43:24 2 r4u29n[1-2]
60564 windfall test1.7 jeongpil R 7-06:43:46 2 r4u28n[1-2]
60563 windfall test1.6 jeongpil R 7-06:44:23 2 r4u27n[1-2]
60561 standard test1.1 jeongpil R 7-06:46:12 2 r4u26n[1-2]
60560 standard test9 jeongpil R 7-06:46:17 2 r4u25n[1-2]
60559 standard test8 jeongpil R 7-06:46:20 2 r4u18n[1-2]
60556 standard test5 jeongpil R 7-06:46:47 2 r4u15n[1-2]
60554 standard test3 jeongpil R 7-06:47:06 2 r4u13n[1-2]
64417 standard benchmar josephlo R 5:20:53 1 r1u12n2
64427 standard Helicove benowitz R 4:28:05 1 r5u19n1
64436 windfall n208835 jeongpil R 3:58:57 1 r2u29n2
64435 windfall n208830 jeongpil R 3:59:03 1 r2u29n1
64434 windfall n208805 jeongpil R 3:59:06 1 r2u30n2
64433 windfall n208800 jeongpil R 3:59:15 1 r2u30n1
64432 windfall n204835 jeongpil R 4:01:03 1 r2u31n2
64431 windfall n204830 jeongpil R 4:01:08 1 r2u31n1
64430 windfall n204805 jeongpil R 4:01:13 1 r2u32n2
64429 windfall n204800 jeongpil R 4:01:18 1 r2u32n1
64414 standard Abyss-k5 natalier R 5:29:34 1 r1u12n1
64413 standard Abyss-k4 natalier R 5:30:12 1 r1u11n2
64412 standard Abyss-k4 natalier R 5:30:46 1 r1u10n1
64392 standard eng_memo plovett R 6:03:23 1 r1u07n2
64387 standard eng_memo plovett R 6:11:32 1 r1u07n1
64366 windfall n216835 jeongpil R 8:57:05 1 r2u27n2
64365 windfall n216830 jeongpil R 8:57:11 1 r2u27n1
64364 windfall n216805 jeongpil R 8:57:16 1 r2u28n2
64363 windfall n216800 jeongpil R 8:57:22 1 r2u28n1
64264 windfall hyphy denard R 11:08:13 1 r1u09n2
64225 windfall hyphy denard R 11:08:16 1 r1u09n1
64236 windfall hyphy denard R 11:08:16 1 r1u09n2
64237 windfall hyphy denard R 11:08:16 1 r1u09n2
64238 windfall hyphy denard R 11:08:16 1 r1u09n2
64239 windfall hyphy denard R 11:08:16 1 r1u09n2
64249 windfall hyphy denard R 11:08:16 1 r1u09n2
64250 windfall hyphy denard R 11:08:16 1 r1u09n2
64255 windfall hyphy denard R 11:08:16 1 r1u09n2
64256 windfall hyphy denard R 11:08:16 1 r1u09n2
64260 windfall hyphy denard R 11:08:16 1 r1u09n2
64183 windfall hyphy denard R 11:08:19 1 r1u07n2
64185 windfall hyphy denard R 11:08:19 1 r1u09n1
64189 windfall hyphy denard R 11:08:19 1 r1u09n1
64198 windfall hyphy denard R 11:08:19 1 r1u09n1
64202 windfall hyphy denard R 11:08:19 1 r1u09n1
64211 windfall hyphy denard R 11:08:19 1 r1u09n1
64213 windfall hyphy denard R 11:08:19 1 r1u09n1
64141 windfall hyphy denard R 11:08:21 1 r1u03n1
64142 windfall hyphy denard R 11:08:21 1 r1u03n1
64147 windfall hyphy denard R 11:08:21 1 r1u03n1
64153 windfall hyphy denard R 11:08:21 1 r1u07n1
64181 windfall hyphy denard R 11:08:21 1 r1u07n2
64113 windfall sn136835 jeongpil R 11:16:12 1 r2u33n2
64112 windfall sn136830 jeongpil R 11:16:16 1 r2u33n1
64111 windfall sn136805 jeongpil R 11:16:20 1 r2u34n2
64110 windfall sn136800 jeongpil R 11:16:24 1 r2u34n1
64109 windfall n136835 jeongpil R 11:17:44 1 r2u35n2
64108 windfall n136830 jeongpil R 11:17:49 1 r2u35n1
64107 windfall n136805 jeongpil R 11:17:53 1 r2u36n2
64106 windfall n136800 jeongpil R 11:17:56 1 r2u36n1
63990_0 standard SEDNoIR rehvidin R 11:57:39 1 r1u03n2
63990_1 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n1
63990_2 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n1
63990_3 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n1
63990_4 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n1
63990_5 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n1
63990_6 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n1
63990_7 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n1
63990_8 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2
63990_9 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2
63990_10 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2
63990_11 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2
63990_12 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2
63990_13 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2
63990_14 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2
63990_15 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2
63990_16 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2
63990_17 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2
63990_18 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2
63990_19 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2
63990_20 standard SEDNoIR rehvidin R 11:57:39 1 r1u07n2
63990_21 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n1
63990_22 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n1
63990_23 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n1
63990_24 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n1
63990_25 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n1
63990_26 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n1
63990_27 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n1
63990_28 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n1
63990_29 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_30 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_31 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_32 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_33 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_34 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_35 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_36 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_37 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_38 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_39 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_40 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_41 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_42 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_43 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_44 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_45 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_46 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_47 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_48 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_49 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_50 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_51 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_52 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_53 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_54 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_55 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_56 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_57 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_58 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_59 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_60 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_61 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_62 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_63 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_64 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_65 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_66 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_67 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_68 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_69 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_70 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_71 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_72 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_73 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_74 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_75 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_76 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_77 standard SEDNoIR rehvidin R 11:57:39 1 r1u08n2
63990_78 standard SEDNoIR rehvidin R 11:57:39 1 r1u09n1
63990_79 standard SEDNoIR rehvidin R 11:57:39 1 r1u09n1
63990_80 standard SEDNoIR rehvidin R 11:57:39 1 r1u09n1
63990_81 standard SEDNoIR rehvidin R 11:57:39 1 r1u09n1
63990_82 standard SEDNoIR rehvidin R 11:57:39 1 r1u09n1
63990_83 standard SEDNoIR rehvidin R 11:57:39 1 r1u09n1
63990_84 standard SEDNoIR rehvidin R 11:57:39 1 r1u09n1
63990_85 standard SEDNoIR rehvidin R 11:57:39 1 r1u09n1
63990_86 standard SEDNoIR rehvidin R 11:57:39 1 r1u09n1
63990_87 standard SEDNoIR rehvidin R 11:57:39 1 r1u09n1
63990_88 standard SEDNoIR rehvidin R 11:57:39 1 r1u09n1
63990_89 standard SEDNoIR rehvidin R 11:57:39 1 r1u09n1
63160 windfall fv_s_tes lauterbu R 12:31:08 1 r1u07n2
63161 windfall fv_s_200 lauterbu R 12:31:08 1 r1u08n1
63145 windfall hyphy denard R 13:12:38 1 r1u07n1
63109 windfall hyphy denard R 13:12:41 1 r1u07n1
63114 windfall hyphy denard R 13:12:41 1 r1u07n1
63130 windfall hyphy denard R 13:12:41 1 r1u07n1
63086 windfall hyphy denard R 13:12:43 1 r1u07n1
63078 windfall hyphy denard R 13:12:44 1 r1u07n1
63079 windfall hyphy denard R 13:12:44 1 r1u07n1
63070 windfall hyphy denard R 13:13:02 1 r1u03n2
63010 windfall hyphy denard R 13:13:05 1 r1u03n1
63019 windfall hyphy denard R 13:13:05 1 r1u03n1
63038 windfall hyphy denard R 13:13:05 1 r1u03n2
63040 windfall hyphy denard R 13:13:05 1 r1u03n2
62980_1 standard slurm-su fahclien R 14:32:02 1 r1u04n1
62980_2 standard slurm-su fahclien R 14:32:02 1 r1u04n2
62980_3 standard slurm-su fahclien R 14:32:02 1 r1u05n1
62980_4 standard slurm-su fahclien R 14:32:02 1 r1u05n2
62980_5 standard slurm-su fahclien R 14:32:02 1 r1u06n1
62980_6 standard slurm-su fahclien R 14:32:02 1 r1u06n2
62854 windfall hyphy denard R 15:45:56 1 r1u03n1
62863 windfall hyphy denard R 15:45:56 1 r1u03n1
62838 windfall hyphy denard R 15:45:59 1 r1u03n1
62846 windfall hyphy denard R 15:45:59 1 r1u03n1
62709 windfall tar_back denard R 17:45:59 1 r1u10n2
62712 windfall database emsenhub R 17:45:59 1 r1u10n2
62682 windfall hyphy denard R 18:45:23 1 r1u25n1
62641 windfall hyphy denard R 18:45:26 1 r1u18n2
62644 windfall hyphy denard R 18:45:26 1 r1u18n2
62647 windfall hyphy denard R 18:45:26 1 r1u18n2
62537 windfall hyphy denard R 18:47:16 1 r1u18n1
62564 windfall hyphy denard R 18:47:16 1 r1u18n1
62569 windfall hyphy denard R 18:47:16 1 r1u18n1
62476 windfall hyphy denard R 18:47:22 1 r1u17n1
62487 windfall hyphy denard R 18:47:22 1 r1u17n1
62414 windfall hyphy denard R 18:47:25 1 r1u16n1
62437 windfall hyphy denard R 18:47:25 1 r1u17n1
62373 windfall hyphy denard R 18:47:28 1 r1u10n2
62366 windfall hyphy denard R 18:47:30 1 r1u10n2
62276 windfall hyphy denard R 18:47:36 1 r1u08n1
62277 windfall hyphy denard R 18:47:36 1 r1u08n1
62306 windfall hyphy denard R 18:47:36 1 r1u09n2
62211 windfall hyphy denard R 18:47:43 1 r1u07n2
60580 standard lstmAnal dmschwar R 7-01:27:44 1 r5u25n1
Created attachment 15481 [details]
slurmctld log
Created attachment 15482 [details]
slurm config
Unfortunately your debug level of 'info' (default) doesn't let me see anything relevant. You should try to fix several errors that appear in your logs: - "Invalid argument", are the related nodes running an old version? - low real_memory size (483605 < 515830), you need to adjust the memory Why do you have this?: DebugFlags=NO_CONF_HASH Besides that, I'd need you to run with DebugFlags=backfill and SlurmctldDebug=debug (debug2 would be ideal for getting more info if this doesn't impact your performance). After increasing the logs, do you think it would be possible to reproduce the issue running a job with the exact parameters than the one that caused the issue? If so, I'd need then the sdiag while the queue is stuck and the slurmctld log. Thanks Hi Felip, Yes, I cleaned up most of those errors that I saw when investigating this. We had a node that apparently lost a dimm and I took it offline. The invalid argument errors from the r5 nodes were related to my colleague adding them to slurm with an incorrect number of GPUs. I re-imaged them and they're registered correctly now. When we initially started this deployment, I had added the DebugFlags=NO_CONF_HASH flag in hopes that I could do a minimal configuration on the job submission nodes but slurm has disabused me of that notion and the configurations are now fully synchronized through a horrible manual process. I'm looking forward to the configuration distribution feature that I saw in v20 when we're able to upgrade :) I think I can reproduce it, I'm pretty sure that the culprit was that they were requesting 47T of ram. I can try modifying those parameters and submitting a test to see if it blocks again. Are there any other parameters that you'd suggest setting to help us debug scheduling issues in the future as well? Thanks! Hrm, well, I set the scheduler parameters and tried to duplicate this job by submitting something with #SBATCH --ntasks=94 #SBATCH --nodes=1 ###SBATCH --mem-per-cpu=1GB #SBATCH --mem=47000G #SBATCH --time=00:10:00 #SBATCH --job-name=slurm-standard-test #SBATCH --account=tmerritt #SBATCH --partition=standard #SBATCH --output=slurm-standard-test.out but I get sbatch: error: Memory specification can not be satisfied sbatch: error: Batch job submission failed: Requested node configuration is not available That's what I would have expected the user to get from their job as well. Is there something that I'm missing that might allow this job to have been submitted? Thanks! I can submit a job in my small system which looks mostly like your first one: JobId=6552 JobName=wrap UserId=lipi(1000) GroupId=lipi(1000) MCS_label=N/A Priority=0 Nice=0 Account=lipi QOS=part_qos_standard WCKey=* JobState=PENDING Reason=BadConstraints Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2020-08-19T16:41:31 EligibleTime=2020-08-19T16:41:32 AccrueTime=2020-08-19T16:41:32 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-08-19T16:42:39 Partition=debug AllocNode:Sid=llagosti:15926 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1-1 NumCPUs=94 NumTasks=94 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=94,mem=47000G,node=1,billing=94 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=500G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/home/lipi/slurm/20.02 StdErr=/home/lipi/slurm/20.02/slurm-6552.out StdIn=/dev/null StdOut=/home/lipi/slurm/20.02/slurm-6552.out Power= MailUser=(null) MailType=NONE I see: slurmctld: _build_node_list: No nodes satisfy JobId=6552 requirements in partition debug slurmctld: sched: schedule: JobId=6552 non-runnable: Requested node configuration is not available This is quite easy, just submit a job which exceeds QoS limits and which is not scheduled to run immediately (e.g. begin=now+1). You will get the job into the system but with Reason=QOSMaxMemoryPerJob. Then you can update the job and do inconsistent changes like increasing memory or nr. of nodes. The difference from your case is the Reason is then set to BadConstraints. I think that the numbers you are seeing are just an aesthetic issue, because the job is submitted for a later time and not evaluated immediately. This can also happen for jobs which are PD and then updated inconsistently. Slurm lets you update your job and does not check that the parameters you submitted are consistent with the configuration, but it waits the job to be evaluated by the scheduler and then it sets the Reason appropriately. In my tests jobs are still running in the queue as usual, but you said that after cancelling the wrong job, other PD jobs moved immediately to R. I am not seeing this. Will come back if I find a similar situation. In the meantime, may you try to create a job like in my example and let me know what happens on your system? Thanks Thanks Felip, I was able to submit with a future begin time but I get an error when I try to modify the job tmerritt@junonia:~/puma $ scontrol update jobid=66473 mem=47000G Update of this parameter is not supported: mem=47000G Request aborted tmerritt@junonia:~/puma $ scontrol update jobid=66473 tres=mem=47000G Update of this parameter is not supported: tres=mem=47000G Request aborted Perhaps I'm just doing it wrong? (In reply to Todd Merritt from comment #10) > Thanks Felip, > > I was able to submit with a future begin time but I get an error when I try > to modify the job > > tmerritt@junonia:~/puma $ scontrol update jobid=66473 mem=47000G > Update of this parameter is not supported: mem=47000G > Request aborted > tmerritt@junonia:~/puma $ scontrol update jobid=66473 tres=mem=47000G > Update of this parameter is not supported: tres=mem=47000G > Request aborted > > Perhaps I'm just doing it wrong? Your part_qos_standard max mem per job is 8064G, is this ok?: part_qos_standard|3|00:00:00|part_qos_windfall,user_qos_idlecycles||cluster|||1.000000|||||||mem=8064G|||10-00:00:00|||1000||||| ]$ sbatch -N1 --qos=part_qos_standard --mem-per-cpu=8065G --wrap "srun hostname" (exceed qos max mem per job) Just try these combinations: scontrol update jobid xxx starttime=now+1 scontrol update job xxx MinMemoryCPU=512000 scontrol update job xxx numcpus=94 Other keywords: NumTasks NumNodes MinCPUSNode For accepted keywords see scontrol_update_job() in slurm/src/scontrol/update_job.c Thanks, I was able to modify the job and get it to list the reason as BadConstraints as you indicate. I'll leave it in the queue for a bit and see if jobs start backing up behind it. Hi I haven't seen this pop up again. You can close this ticket out and I'll open a new one with better logs and an sdiag if it happens again. Thanks, Todd |