Ticket 9598

Summary:	help explaining scheduler backlog
Product:	Slurm	Reporter:	Todd Merritt <tmerritt>
Component:	Scheduling	Assignee:	Felip Moll <felip.moll>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	19.05.6
Hardware:	Linux
OS:	Linux
Site:	U of AZ	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmctld log slurm config

Description Todd Merritt 2020-08-17 12:51:16 MDT

We had the following job submitted today

root@ericidle:/etc/cron.daily # scontrol show job 62688
JobId=62688 JobName=po_mcomp
   UserId=emmanuelgonzalez(45150) GroupId=lyons-lab(31100) MCS_label=N/A
   Priority=4 Nice=0 Account=lyons-lab QOS=part_qos_standard
   JobState=PENDING Reason=QOSMaxMemoryPerJob Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=18:00:00 TimeMin=N/A
   SubmitTime=2020-08-17T10:46:39 EligibleTime=2020-08-17T10:46:39
   AccrueTime=2020-08-17T10:46:39
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-08-17T11:44:42
   Partition=standard AllocNode:Sid=wentletrap:12390
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=94 NumTasks=94 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=94,mem=47000G,node=1,billing=94
   Socks/Node=* NtasksPerN:B:S:C=94:0:*:* CoreSpec=*
   MinCPUsNode=94 MinMemoryCPU=500G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/xdisk/ericlyons/big_data/egonzalez/PhytoOracle/stereoTopRGB
   Power=

This job was preventing any other job from being scheduled for some reason for at least 30 minutes. In the slurmctld log, all that was reported was that this job would never run in partition standard and no attempt appeared to be made to start another job. I'm hoping you can explain this behavior and/or let me know what I can do to avoid this situation in the future.

Thanks!

Comment 1 Felip Moll 2020-08-18 05:47:57 MDT

(In reply to Todd Merritt from comment #0)
> We had the following job submitted today
> 
> root@ericidle:/etc/cron.daily # scontrol show job 62688
> JobId=62688 JobName=po_mcomp
>    UserId=emmanuelgonzalez(45150) GroupId=lyons-lab(31100) MCS_label=N/A
>    Priority=4 Nice=0 Account=lyons-lab QOS=part_qos_standard
>    JobState=PENDING Reason=QOSMaxMemoryPerJob Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
>    RunTime=00:00:00 TimeLimit=18:00:00 TimeMin=N/A
>    SubmitTime=2020-08-17T10:46:39 EligibleTime=2020-08-17T10:46:39
>    AccrueTime=2020-08-17T10:46:39
>    StartTime=Unknown EndTime=Unknown Deadline=N/A
>    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-08-17T11:44:42
>    Partition=standard AllocNode:Sid=wentletrap:12390
>    ReqNodeList=(null) ExcNodeList=(null)
>    NodeList=(null)
>    NumNodes=1-1 NumCPUs=94 NumTasks=94 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>    TRES=cpu=94,mem=47000G,node=1,billing=94
>    Socks/Node=* NtasksPerN:B:S:C=94:0:*:* CoreSpec=*
>    MinCPUsNode=94 MinMemoryCPU=500G MinTmpDiskNode=0
>    Features=(null) DelayBoot=00:00:00
>    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>    Command=bash
>    WorkDir=/xdisk/ericlyons/big_data/egonzalez/PhytoOracle/stereoTopRGB
>    Power=
> 
> This job was preventing any other job from being scheduled for some reason
> for at least 30 minutes. In the slurmctld log, all that was reported was
> that this job would never run in partition standard and no attempt appeared
> to be made to start another job. I'm hoping you can explain this behavior
> and/or let me know what I can do to avoid this situation in the future.
> 
> Thanks!

Hi Todd,

What happened after 30 minutes? Jobs have suddenly started to be scheduled?

I would need to see your backfill parameters (send me back your slurm.conf) and the slurmctld log.
Also the output of:

- sacctmgr show qos -p
- sinfo
- squeue
- sdiag (though if everything is working fine now it won't help much)

Theoretically one job can reserve resources for him, and even if the nodes looks idle, they are reserved. But it is not ok if the job cannot run due to a limit, then resources must be freed. Moreover the job only requested 1 node. I will also do a test to see if this case works as expected. Maybe it could be another situation and not directly related to this job.

Thanks

Comment 2 Todd Merritt 2020-08-18 06:32:24 MDT

Hi,

I ran scancel on the job after 30 minutes since it was blocking several interactive jobs from starting. As soon as I canceled it, the backlog of jobs all started. The backlogged jobs all listed Priority as their reason for not starting.

I'll attach the requested files.

root@ericidle:~ # sacctmgr show qos -p
Name|Priority|GraceTime|Preempt|PreemptExemptTime|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES|
normal|0|00:00:00|||cluster|||1.000000||||||||||||||||||
part_qos_windfall|1|00:00:00|user_qos_idlecycles||cluster|||1.000000|||||||mem=8064G|||10-00:00:00|||1000|||||
part_qos_standard|3|00:00:00|part_qos_windfall,user_qos_idlecycles||cluster|||1.000000|||||||mem=8064G|||10-00:00:00|||1000|||||
user_qos_bjoyce3|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=2112|cpu=21000000||2000|2000|||||||||||||
user_qos_tmerritt|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=2112|cpu=21000000||2000|2000|||||||||||||
user_qos_idlecycles|0|00:00:00|||cluster|OverPartQOS||1.000000||||100|200|||||||||||||
user_qos_nkchen|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=3674,gres/gpu:volta=0|cpu=16819200||2000|2000|||||||||||||
user_qos_timeifler|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=3866,gres/gpu:volta=0|cpu=25228800||2000|2000|||||||||||||
user_qos_josh|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=3338,gres/gpu:volta=2|cpu=2102400||2000|2000|||||||||||||
user_qos_jlbredas|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=5210,gres/gpu:volta=0|cpu=84096000||2000|2000|||||||||||||
user_qos_denard|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=3698,gres/gpu:volta=1|cpu=17870400||2000|2000|||||||||||||
user_qos_kgklein|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=3386,gres/gpu:volta=0|cpu=4204800||2000|2000|||||||||||||
user_qos_xytang|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=3386,gres/gpu:volta=0|cpu=4204800||2000|2000|||||||||||||
user_qos_jrussell|5|00:00:00|part_qos_windfall||cluster|OverPartQOS||1.000000|cpu=4442,gres/gpu:volta=0|cpu=50457600||2000|2000|||||||||||||

root@ericidle:~ # sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
windfall*    up   infinite      6  down* r1u29n2,r2u06n1,r2u12n1,r2u14n2,r3u05n2,r3u36n2
windfall*    up   infinite      1  drain r1u11n1
windfall*    up   infinite     80    mix r1u03n[1-2],r1u07n[1-2],r1u08n[1-2],r1u09n[1-2],r1u10n[1-2],r1u11n2,r1u12n[1-2],r1u16n1,r1u17n1,r1u18n[1-2],r1u25n1,r2u27n[1-2],r2u28n[1-2],r2u29n[1-2],r2u30n[1-2],r2u31n[1-2],r2u32n[1-2],r2u33n[1-2],r2u34n[1-2],r2u35n[1-2],r2u36n[1-2],r3u31n[1-2],r3u32n[1-2],r3u33n[1-2],r3u34n[1-2],r3u35n[1-2],r4u13n[1-2],r4u15n[1-2],r4u18n[1-2],r4u25n[1-2],r4u26n[1-2],r4u27n[1-2],r4u28n[1-2],r4u29n[1-2],r4u30n[1-2],r4u31n[1-2],r4u32n[1-2],r4u33n[1-2],r4u34n[1-2],r4u35n[1-2],r4u36n[1-2],r5u19n1,r5u25n1
windfall*    up   infinite      6  alloc r1u04n[1-2],r1u05n[1-2],r1u06n[1-2]
windfall*    up   infinite    129   idle r1u13n[1-2],r1u14n[1-2],r1u15n[1-2],r1u16n2,r1u17n2,r1u25n2,r1u26n[1-2],r1u27n[1-2],r1u28n[1-2],r1u29n1,r1u30n[1-2],r1u31n[1-2],r1u32n[1-2],r1u33n[1-2],r1u34n[1-2],r1u35n[1-2],r1u36n[1-2],r2u03n[1-2],r2u04n[1-2],r2u05n[1-2],r2u06n2,r2u07n[1-2],r2u08n[1-2],r2u09n[1-2],r2u10n[1-2],r2u11n[1-2],r2u12n2,r2u13n[1-2],r2u14n1,r2u15n[1-2],r2u16n[1-2],r2u17n[1-2],r2u18n[1-2],r2u25n[1-2],r2u26n[1-2],r3u05n1,r3u06n[1-2],r3u07n[1-2],r3u08n[1-2],r3u09n[1-2],r3u10n[1-2],r3u11n[1-2],r3u12n[1-2],r3u13n[1-2],r3u14n[1-2],r3u15n[1-2],r3u16n[1-2],r3u17n[1-2],r3u18n[1-2],r3u25n[1-2],r3u26n[1-2],r3u27n[1-2],r3u28n[1-2],r3u29n[1-2],r3u30n[1-2],r3u36n1,r4u07n[1-2],r4u08n[1-2],r4u09n[1-2],r4u10n[1-2],r4u11n[1-2],r4u12n[1-2],r4u14n[1-2],r4u16n[1-2],r4u17n[1-2],r5u11n1,r5u13n1,r5u15n1,r5u17n1,r5u24n1,r5u27n1,r5u29n1,r5u31n1
standard     up   infinite      6  down* r1u29n2,r2u06n1,r2u12n1,r2u14n2,r3u05n2,r3u36n2
standard     up   infinite      1  drain r1u11n1
standard     up   infinite     80    mix r1u03n[1-2],r1u07n[1-2],r1u08n[1-2],r1u09n[1-2],r1u10n[1-2],r1u11n2,r1u12n[1-2],r1u16n1,r1u17n1,r1u18n[1-2],r1u25n1,r2u27n[1-2],r2u28n[1-2],r2u29n[1-2],r2u30n[1-2],r2u31n[1-2],r2u32n[1-2],r2u33n[1-2],r2u34n[1-2],r2u35n[1-2],r2u36n[1-2],r3u31n[1-2],r3u32n[1-2],r3u33n[1-2],r3u34n[1-2],r3u35n[1-2],r4u13n[1-2],r4u15n[1-2],r4u18n[1-2],r4u25n[1-2],r4u26n[1-2],r4u27n[1-2],r4u28n[1-2],r4u29n[1-2],r4u30n[1-2],r4u31n[1-2],r4u32n[1-2],r4u33n[1-2],r4u34n[1-2],r4u35n[1-2],r4u36n[1-2],r5u19n1,r5u25n1
standard     up   infinite      6  alloc r1u04n[1-2],r1u05n[1-2],r1u06n[1-2]
standard     up   infinite    129   idle r1u13n[1-2],r1u14n[1-2],r1u15n[1-2],r1u16n2,r1u17n2,r1u25n2,r1u26n[1-2],r1u27n[1-2],r1u28n[1-2],r1u29n1,r1u30n[1-2],r1u31n[1-2],r1u32n[1-2],r1u33n[1-2],r1u34n[1-2],r1u35n[1-2],r1u36n[1-2],r2u03n[1-2],r2u04n[1-2],r2u05n[1-2],r2u06n2,r2u07n[1-2],r2u08n[1-2],r2u09n[1-2],r2u10n[1-2],r2u11n[1-2],r2u12n2,r2u13n[1-2],r2u14n1,r2u15n[1-2],r2u16n[1-2],r2u17n[1-2],r2u18n[1-2],r2u25n[1-2],r2u26n[1-2],r3u05n1,r3u06n[1-2],r3u07n[1-2],r3u08n[1-2],r3u09n[1-2],r3u10n[1-2],r3u11n[1-2],r3u12n[1-2],r3u13n[1-2],r3u14n[1-2],r3u15n[1-2],r3u16n[1-2],r3u17n[1-2],r3u18n[1-2],r3u25n[1-2],r3u26n[1-2],r3u27n[1-2],r3u28n[1-2],r3u29n[1-2],r3u30n[1-2],r3u36n1,r4u07n[1-2],r4u08n[1-2],r4u09n[1-2],r4u10n[1-2],r4u11n[1-2],r4u12n[1-2],r4u14n[1-2],r4u16n[1-2],r4u17n[1-2],r5u11n1,r5u13n1,r5u15n1,r5u17n1,r5u24n1,r5u27n1,r5u29n1,r5u31n1

root@ericidle:~ # squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
      62980_[7-10]  standard slurm-su fahclien PD       0:00      1 (AssocGrpCPUMinutesLimit)
             60663  standard    test7 jeongpil  R 6-01:54:37      2 r3u35n[1-2]
             60662  standard    test6 jeongpil  R 6-01:54:49      2 r3u34n[1-2]
             60661  standard    test4 jeongpil  R 6-01:54:56      2 r3u33n[1-2]
             60660  standard    test2 jeongpil  R 6-01:55:03      2 r3u32n[1-2]
             60659  standard    test1 jeongpil  R 6-01:55:12      2 r3u31n[1-2]
             60578  windfall  test2.5 jeongpil  R 7-06:35:05      2 r4u36n[1-2]
             60577  windfall  test2.4 jeongpil  R 7-06:35:23      2 r4u35n[1-2]
             60574  windfall  test2.3 jeongpil  R 7-06:41:04      2 r4u34n[1-2]
             60573  windfall  test2.2 jeongpil  R 7-06:41:10      2 r4u33n[1-2]
             60568  windfall  test2.1 jeongpil  R 7-06:41:53      2 r4u32n[1-2]
             60567  windfall  test2.0 jeongpil  R 7-06:41:59      2 r4u31n[1-2]
             60566  windfall  test1.9 jeongpil  R 7-06:42:59      2 r4u30n[1-2]
             60565  windfall  test1.8 jeongpil  R 7-06:43:24      2 r4u29n[1-2]
             60564  windfall  test1.7 jeongpil  R 7-06:43:46      2 r4u28n[1-2]
             60563  windfall  test1.6 jeongpil  R 7-06:44:23      2 r4u27n[1-2]
             60561  standard  test1.1 jeongpil  R 7-06:46:12      2 r4u26n[1-2]
             60560  standard    test9 jeongpil  R 7-06:46:17      2 r4u25n[1-2]
             60559  standard    test8 jeongpil  R 7-06:46:20      2 r4u18n[1-2]
             60556  standard    test5 jeongpil  R 7-06:46:47      2 r4u15n[1-2]
             60554  standard    test3 jeongpil  R 7-06:47:06      2 r4u13n[1-2]
             64417  standard benchmar josephlo  R    5:20:53      1 r1u12n2
             64427  standard Helicove benowitz  R    4:28:05      1 r5u19n1
             64436  windfall  n208835 jeongpil  R    3:58:57      1 r2u29n2
             64435  windfall  n208830 jeongpil  R    3:59:03      1 r2u29n1
             64434  windfall  n208805 jeongpil  R    3:59:06      1 r2u30n2
             64433  windfall  n208800 jeongpil  R    3:59:15      1 r2u30n1
             64432  windfall  n204835 jeongpil  R    4:01:03      1 r2u31n2
             64431  windfall  n204830 jeongpil  R    4:01:08      1 r2u31n1
             64430  windfall  n204805 jeongpil  R    4:01:13      1 r2u32n2
             64429  windfall  n204800 jeongpil  R    4:01:18      1 r2u32n1
             64414  standard Abyss-k5 natalier  R    5:29:34      1 r1u12n1
             64413  standard Abyss-k4 natalier  R    5:30:12      1 r1u11n2
             64412  standard Abyss-k4 natalier  R    5:30:46      1 r1u10n1
             64392  standard eng_memo  plovett  R    6:03:23      1 r1u07n2
             64387  standard eng_memo  plovett  R    6:11:32      1 r1u07n1
             64366  windfall  n216835 jeongpil  R    8:57:05      1 r2u27n2
             64365  windfall  n216830 jeongpil  R    8:57:11      1 r2u27n1
             64364  windfall  n216805 jeongpil  R    8:57:16      1 r2u28n2
             64363  windfall  n216800 jeongpil  R    8:57:22      1 r2u28n1
             64264  windfall    hyphy   denard  R   11:08:13      1 r1u09n2
             64225  windfall    hyphy   denard  R   11:08:16      1 r1u09n1
             64236  windfall    hyphy   denard  R   11:08:16      1 r1u09n2
             64237  windfall    hyphy   denard  R   11:08:16      1 r1u09n2
             64238  windfall    hyphy   denard  R   11:08:16      1 r1u09n2
             64239  windfall    hyphy   denard  R   11:08:16      1 r1u09n2
             64249  windfall    hyphy   denard  R   11:08:16      1 r1u09n2
             64250  windfall    hyphy   denard  R   11:08:16      1 r1u09n2
             64255  windfall    hyphy   denard  R   11:08:16      1 r1u09n2
             64256  windfall    hyphy   denard  R   11:08:16      1 r1u09n2
             64260  windfall    hyphy   denard  R   11:08:16      1 r1u09n2
             64183  windfall    hyphy   denard  R   11:08:19      1 r1u07n2
             64185  windfall    hyphy   denard  R   11:08:19      1 r1u09n1
             64189  windfall    hyphy   denard  R   11:08:19      1 r1u09n1
             64198  windfall    hyphy   denard  R   11:08:19      1 r1u09n1
             64202  windfall    hyphy   denard  R   11:08:19      1 r1u09n1
             64211  windfall    hyphy   denard  R   11:08:19      1 r1u09n1
             64213  windfall    hyphy   denard  R   11:08:19      1 r1u09n1
             64141  windfall    hyphy   denard  R   11:08:21      1 r1u03n1
             64142  windfall    hyphy   denard  R   11:08:21      1 r1u03n1
             64147  windfall    hyphy   denard  R   11:08:21      1 r1u03n1
             64153  windfall    hyphy   denard  R   11:08:21      1 r1u07n1
             64181  windfall    hyphy   denard  R   11:08:21      1 r1u07n2
             64113  windfall sn136835 jeongpil  R   11:16:12      1 r2u33n2
             64112  windfall sn136830 jeongpil  R   11:16:16      1 r2u33n1
             64111  windfall sn136805 jeongpil  R   11:16:20      1 r2u34n2
             64110  windfall sn136800 jeongpil  R   11:16:24      1 r2u34n1
             64109  windfall  n136835 jeongpil  R   11:17:44      1 r2u35n2
             64108  windfall  n136830 jeongpil  R   11:17:49      1 r2u35n1
             64107  windfall  n136805 jeongpil  R   11:17:53      1 r2u36n2
             64106  windfall  n136800 jeongpil  R   11:17:56      1 r2u36n1
           63990_0  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u03n2
           63990_1  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u07n1
           63990_2  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u07n1
           63990_3  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u07n1
           63990_4  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u07n1
           63990_5  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u07n1
           63990_6  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u07n1
           63990_7  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u07n1
           63990_8  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u07n2
           63990_9  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u07n2
          63990_10  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u07n2
          63990_11  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u07n2
          63990_12  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u07n2
          63990_13  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u07n2
          63990_14  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u07n2
          63990_15  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u07n2
          63990_16  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u07n2
          63990_17  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u07n2
          63990_18  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u07n2
          63990_19  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u07n2
          63990_20  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u07n2
          63990_21  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n1
          63990_22  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n1
          63990_23  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n1
          63990_24  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n1
          63990_25  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n1
          63990_26  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n1
          63990_27  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n1
          63990_28  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n1
          63990_29  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_30  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_31  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_32  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_33  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_34  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_35  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_36  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_37  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_38  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_39  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_40  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_41  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_42  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_43  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_44  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_45  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_46  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_47  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_48  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_49  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_50  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_51  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_52  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_53  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_54  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_55  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_56  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_57  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_58  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_59  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_60  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_61  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_62  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_63  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_64  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_65  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_66  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_67  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_68  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_69  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_70  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_71  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_72  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_73  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_74  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_75  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_76  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_77  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u08n2
          63990_78  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u09n1
          63990_79  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u09n1
          63990_80  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u09n1
          63990_81  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u09n1
          63990_82  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u09n1
          63990_83  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u09n1
          63990_84  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u09n1
          63990_85  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u09n1
          63990_86  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u09n1
          63990_87  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u09n1
          63990_88  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u09n1
          63990_89  standard  SEDNoIR rehvidin  R   11:57:39      1 r1u09n1
             63160  windfall fv_s_tes lauterbu  R   12:31:08      1 r1u07n2
             63161  windfall fv_s_200 lauterbu  R   12:31:08      1 r1u08n1
             63145  windfall    hyphy   denard  R   13:12:38      1 r1u07n1
             63109  windfall    hyphy   denard  R   13:12:41      1 r1u07n1
             63114  windfall    hyphy   denard  R   13:12:41      1 r1u07n1
             63130  windfall    hyphy   denard  R   13:12:41      1 r1u07n1
             63086  windfall    hyphy   denard  R   13:12:43      1 r1u07n1
             63078  windfall    hyphy   denard  R   13:12:44      1 r1u07n1
             63079  windfall    hyphy   denard  R   13:12:44      1 r1u07n1
             63070  windfall    hyphy   denard  R   13:13:02      1 r1u03n2
             63010  windfall    hyphy   denard  R   13:13:05      1 r1u03n1
             63019  windfall    hyphy   denard  R   13:13:05      1 r1u03n1
             63038  windfall    hyphy   denard  R   13:13:05      1 r1u03n2
             63040  windfall    hyphy   denard  R   13:13:05      1 r1u03n2
           62980_1  standard slurm-su fahclien  R   14:32:02      1 r1u04n1
           62980_2  standard slurm-su fahclien  R   14:32:02      1 r1u04n2
           62980_3  standard slurm-su fahclien  R   14:32:02      1 r1u05n1
           62980_4  standard slurm-su fahclien  R   14:32:02      1 r1u05n2
           62980_5  standard slurm-su fahclien  R   14:32:02      1 r1u06n1
           62980_6  standard slurm-su fahclien  R   14:32:02      1 r1u06n2
             62854  windfall    hyphy   denard  R   15:45:56      1 r1u03n1
             62863  windfall    hyphy   denard  R   15:45:56      1 r1u03n1
             62838  windfall    hyphy   denard  R   15:45:59      1 r1u03n1
             62846  windfall    hyphy   denard  R   15:45:59      1 r1u03n1
             62709  windfall tar_back   denard  R   17:45:59      1 r1u10n2
             62712  windfall database emsenhub  R   17:45:59      1 r1u10n2
             62682  windfall    hyphy   denard  R   18:45:23      1 r1u25n1
             62641  windfall    hyphy   denard  R   18:45:26      1 r1u18n2
             62644  windfall    hyphy   denard  R   18:45:26      1 r1u18n2
             62647  windfall    hyphy   denard  R   18:45:26      1 r1u18n2
             62537  windfall    hyphy   denard  R   18:47:16      1 r1u18n1
             62564  windfall    hyphy   denard  R   18:47:16      1 r1u18n1
             62569  windfall    hyphy   denard  R   18:47:16      1 r1u18n1
             62476  windfall    hyphy   denard  R   18:47:22      1 r1u17n1
             62487  windfall    hyphy   denard  R   18:47:22      1 r1u17n1
             62414  windfall    hyphy   denard  R   18:47:25      1 r1u16n1
             62437  windfall    hyphy   denard  R   18:47:25      1 r1u17n1
             62373  windfall    hyphy   denard  R   18:47:28      1 r1u10n2
             62366  windfall    hyphy   denard  R   18:47:30      1 r1u10n2
             62276  windfall    hyphy   denard  R   18:47:36      1 r1u08n1
             62277  windfall    hyphy   denard  R   18:47:36      1 r1u08n1
             62306  windfall    hyphy   denard  R   18:47:36      1 r1u09n2
             62211  windfall    hyphy   denard  R   18:47:43      1 r1u07n2
             60580  standard lstmAnal dmschwar  R 7-01:27:44      1 r5u25n1

Comment 3 Todd Merritt 2020-08-18 06:36:39 MDT

Created attachment 15481 [details]
slurmctld log

Comment 4 Todd Merritt 2020-08-18 06:37:13 MDT

Created attachment 15482 [details]
slurm config

Comment 6 Felip Moll 2020-08-18 10:08:18 MDT

Unfortunately your debug level of 'info' (default) doesn't let me see anything relevant.

You should try to fix several errors that appear in your logs:

- "Invalid argument", are the related nodes running an old version?
- low real_memory size (483605 < 515830), you need to adjust the memory

Why do you have this?: DebugFlags=NO_CONF_HASH

Besides that, I'd need you to run with DebugFlags=backfill and SlurmctldDebug=debug (debug2 would be ideal for getting more info if this doesn't impact your performance).

After increasing the logs, do you think it would be possible to reproduce the issue running a job with the exact parameters than the one that caused the issue?

If so, I'd need then the sdiag while the queue is stuck and the slurmctld log.

Thanks

Comment 7 Todd Merritt 2020-08-18 10:52:43 MDT

Hi Felip,

Yes, I cleaned up most of those errors that I saw when investigating this. We had a node that apparently lost a dimm and I took it offline. The invalid argument errors from the r5 nodes were related to my colleague adding them to slurm with an incorrect number of GPUs. I re-imaged them and they're registered correctly now.

When we initially started this deployment, I had added the DebugFlags=NO_CONF_HASH flag in hopes that I could do a minimal configuration on the job submission nodes but slurm has disabused me of that notion and the configurations are now fully synchronized through a horrible manual process. I'm looking forward to the configuration distribution feature that I saw in v20 when we're able to upgrade :)

I think I can reproduce it, I'm pretty sure that the culprit was that they were requesting 47T of ram. I can try modifying those parameters and submitting a test to see if it blocks again. Are there any other parameters that you'd suggest setting to help us debug scheduling issues in the future as well?

Thanks!

Comment 8 Todd Merritt 2020-08-18 12:01:01 MDT

Hrm, well, I set the scheduler parameters and tried to duplicate this job by submitting something with

#SBATCH --ntasks=94
#SBATCH --nodes=1
###SBATCH --mem-per-cpu=1GB
#SBATCH --mem=47000G
#SBATCH --time=00:10:00
#SBATCH --job-name=slurm-standard-test
#SBATCH --account=tmerritt
#SBATCH --partition=standard
#SBATCH --output=slurm-standard-test.out

but I get 

sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available

That's what I would have expected the user to get from their job as well. Is there something that I'm missing that might allow this job to have been submitted? Thanks!

Comment 9 Felip Moll 2020-08-19 08:56:44 MDT

I can submit a job in my small system which looks mostly like your first one:

JobId=6552 JobName=wrap
   UserId=lipi(1000) GroupId=lipi(1000) MCS_label=N/A
   Priority=0 Nice=0 Account=lipi QOS=part_qos_standard WCKey=*
   JobState=PENDING Reason=BadConstraints Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2020-08-19T16:41:31 EligibleTime=2020-08-19T16:41:32
   AccrueTime=2020-08-19T16:41:32
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-08-19T16:42:39
   Partition=debug AllocNode:Sid=llagosti:15926
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=94 NumTasks=94 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=94,mem=47000G,node=1,billing=94
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=500G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/lipi/slurm/20.02
   StdErr=/home/lipi/slurm/20.02/slurm-6552.out
   StdIn=/dev/null
   StdOut=/home/lipi/slurm/20.02/slurm-6552.out
   Power=
   MailUser=(null) MailType=NONE

I see:

slurmctld: _build_node_list: No nodes satisfy JobId=6552 requirements in partition debug
slurmctld: sched: schedule: JobId=6552 non-runnable: Requested node configuration is not available

This is quite easy, just submit a job which exceeds QoS limits and which is not scheduled to run immediately (e.g. begin=now+1). You will get the job into the system but with Reason=QOSMaxMemoryPerJob. Then you can update the job and do inconsistent changes like increasing memory or nr. of nodes. The difference from your case is the Reason is then set to BadConstraints.

I think that the numbers you are seeing are just an aesthetic issue, because the job is submitted for a later time and not evaluated immediately. This can also happen for jobs which are PD and then updated inconsistently. Slurm lets you update your job and does not check that the parameters you submitted are consistent with the configuration, but it waits the job to be evaluated by the scheduler and then it sets the Reason appropriately.

In my tests jobs are still running in the queue as usual, but you said that after cancelling the wrong job, other PD jobs moved immediately to R.
I am not seeing this. Will come back if I find a similar situation.

In the meantime, may you try to create a job like in my example and let me know what happens on your system?

Thanks

Comment 10 Todd Merritt 2020-08-19 09:17:52 MDT

Thanks Felip,

I was able to submit with a future begin time but I get an error when I try to modify the job

tmerritt@junonia:~/puma $ scontrol update jobid=66473 mem=47000G
Update of this parameter is not supported: mem=47000G
Request aborted
tmerritt@junonia:~/puma $ scontrol update jobid=66473 tres=mem=47000G
Update of this parameter is not supported: tres=mem=47000G
Request aborted

Perhaps I'm just doing it wrong?

Comment 11 Felip Moll 2020-08-19 10:52:47 MDT

(In reply to Todd Merritt from comment #10)
> Thanks Felip,
> 
> I was able to submit with a future begin time but I get an error when I try
> to modify the job
> 
> tmerritt@junonia:~/puma $ scontrol update jobid=66473 mem=47000G
> Update of this parameter is not supported: mem=47000G
> Request aborted
> tmerritt@junonia:~/puma $ scontrol update jobid=66473 tres=mem=47000G
> Update of this parameter is not supported: tres=mem=47000G
> Request aborted
> 
> Perhaps I'm just doing it wrong?

Your part_qos_standard max mem per job is 8064G, is this ok?:

part_qos_standard|3|00:00:00|part_qos_windfall,user_qos_idlecycles||cluster|||1.000000|||||||mem=8064G|||10-00:00:00|||1000|||||

]$ sbatch -N1 --qos=part_qos_standard --mem-per-cpu=8065G --wrap "srun hostname"  (exceed qos max mem per job)

Just try these combinations:

scontrol update jobid xxx starttime=now+1
scontrol update job xxx MinMemoryCPU=512000
scontrol update job xxx numcpus=94

Other keywords:
NumTasks
NumNodes
MinCPUSNode


For accepted keywords see scontrol_update_job() in slurm/src/scontrol/update_job.c

Comment 12 Todd Merritt 2020-08-19 14:47:30 MDT

Thanks, I was able to modify the job and get it to list the reason as BadConstraints as you indicate. I'll leave it in the queue for a bit and see if jobs start backing up behind it.

Comment 13 Todd Merritt 2020-08-20 12:10:03 MDT

Hi I haven't seen this pop up again. You can close this ticket out and I'll open a new one with better logs and an sdiag if it happens again.

Thanks,
Todd