Ticket 6687

Summary: Scheduling / backfill isn't working properly
Product: Slurm Reporter: Martin Forde <mforde84>
Component: SchedulingAssignee: Jacob Jenson <jacob>
Status: RESOLVED INVALID QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 18.08.4   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf

Description Martin Forde 2019-03-13 10:07:38 MDT
Created attachment 9559 [details]
slurm.conf

We have a handful of jobs which are listed as PENDING, though there are sufficient resources available to run them. I think scheduling is either acting exclusively for the nodes, or there is some issue with backfilling due to priority and time limits. Can you guys help me understand why these pending jobs aren't being scheduled


[root@gosset ~]# utilization #wrapper to determine current running slurm allocations
node		cpu	load	free_mem	date
storage01	1/20	0.07	83436	2019-03-13-11:05
storage02	0/20	0.03	251103	2019-03-13-11:05
storage03	0/20	0.01	311109	2019-03-13-11:05
storage04	0/20	0.03	150431	2019-03-13-11:05
storage05	0/20	0.04	307888	2019-03-13-11:05
storage06	0/20	0.01	358999	2019-03-13-11:05
student01	1/8	0.01	409374	2019-03-13-11:05
student02	0/24	0.45	246936	2019-03-13-11:05
student03	0/32	0.01	403093	2019-03-13-11:05
student04	0/32	0.08	404811	2019-03-13-11:05
student05	0/32	0.10	124610	2019-03-13-11:05
student06	0/32	0.03	188806	2019-03-13-11:05
student07	32/32	0.01	471833	2019-03-13-11:05
student08	11/32	0.01	352936	2019-03-13-11:05
student09	0/20	0.01	15061	2019-03-13-11:05
student10	2/20	1.99	22161	2019-03-13-11:05
student11	2/20	2.58	47538	2019-03-13-11:05
student12	2/20	2.59	90510	2019-03-13-11:05
student13	1/20	1.59	157183	2019-03-13-11:05
student14	1/20	1.53	170368	2019-03-13-11:05
student15	9/20	2.66	101032	2019-03-13-11:05
student16	4/20	4.49	7566	2019-03-13-11:05
student17	1/20	1.62	172203	2019-03-13-11:05
student18	1/20	1.55	17158	2019-03-13-11:05
student19	1/20	1.66	126596	2019-03-13-11:05
student20	1/20	1.48	112036	2019-03-13-11:05
student21	1/20	1.52	185606	2019-03-13-11:05
student22	1/20	1.53	122280	2019-03-13-11:05
student23	1/20	1.52	133716	2019-03-13-11:05
student24	1/20	1.40	122611	2019-03-13-11:05
student25	1/20	1.49	180984	2019-03-13-11:05
student26	1/20	1.71	124888	2019-03-13-11:05
student27	1/20	1.58	134455	2019-03-13-11:05
student28	1/20	1.55	140490	2019-03-13-11:05
student29	1/20	1.51	153481	2019-03-13-11:05
student30	1/20	1.62	155868	2019-03-13-11:05
student31	1/20	1.52	142458	2019-03-13-11:05
student32	1/20	1.51	167212	2019-03-13-11:05
student33	1/20	1.68	175362	2019-03-13-11:05
student34	1/20	1.56	17857	2019-03-13-11:05
student35	1/20	1.41	126248	2019-03-13-11:05
student36	0/20	0.33	247941	2019-03-13-11:05
student37	4/20	2.46	22364	2019-03-13-11:05
student38	9/20	10.74	62300	2019-03-13-11:05
student39	0/20	0.01	4677	2019-03-13-11:05
student40	11/20	10.90	16355	2019-03-13-11:05
student41	4/20	4.66	97747	2019-03-13-11:05
student42	4/20	4.63	147238	2019-03-13-11:05
student43	0/20	0.01	11593	2019-03-13-11:05
student44	0/20	3.27	22301	2019-03-13-11:05


[root@gosset ~]# squeue --format "%.18i %.9P %10S %.7Q %.10l %.2t %.10M %.6D %.4C %R"
             JOBID PARTITION START_TIME PRIORIT TIME_LIMIT ST       TIME  NODES CPUS NODELIST(REASON)
            285843     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Resources)
            285943     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            285844     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285944     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            285845     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285846     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285946     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            285847     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285947     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            285848     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285948     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            285849     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285850     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285851     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285852     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285853     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285854     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285855     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285856     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285857     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285858     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285859     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285860     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285861     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285862     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285863     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285864     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285865     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285965     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            285866     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285966     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            285867     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285967     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            285868     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285968     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            285869     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285870     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285970     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            285871     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285872     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285972     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            285873     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285973     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            285874     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285875     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285876     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285976     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            285877     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285977     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            285878     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285978     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            285879     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285979     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            285880     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285881     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285981     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            285882     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285982     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            285883     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285983     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            285884     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285984     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            285885     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285886     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285887     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285888     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285889     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285890     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285891     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285892     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285893     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285894     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285895     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285896     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            285897     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1    1 (Priority)
            286066     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            286067     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            286068     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            286071     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            286072     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            286073     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            286075     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            286079     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            286080     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            286081     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            286085     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            286094     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            286095     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            286096     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            286659     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            286660     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            286688     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   20 (Priority)
            286711     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   20 (Priority)
            286712     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   20 (Priority)
            285942     nodes 2019-03-17 4294719 5-00:00:00 PD       0:00      1   16 (Priority)
            286030     nodes 2019-03-17 4294719    5:00:00 PD       0:00      1   20 (Priority)
            188147    bigmem 2019-02-12 4294805  UNLIMITED  R 29-03:17:30      1   32 student07
            252611    bigmem 2019-03-04 4294751 365-00:00:00  R 8-19:12:12      1    1 storage01
            271883    bigmem 2019-03-07 4294732 365-00:00:00  R 5-22:50:13      1   10 student08
            275085     nodes 2019-03-08 4294729 10-00:00:00  R 4-22:48:39      1    1 student37
            275088     nodes 2019-03-08 4294729 10-00:00:00  R 4-22:46:09      1    1 student40
            275089     nodes 2019-03-08 4294729 10-00:00:00  R 4-22:46:09      1    1 student40
            275090     nodes 2019-03-08 4294729 10-00:00:00  R 4-22:46:09      1    1 student41
            275091     nodes 2019-03-08 4294729 10-00:00:00  R 4-22:46:09      1    1 student41
            275092     nodes 2019-03-08 4294729 10-00:00:00  R 4-22:46:09      1    1 student41
            275093     nodes 2019-03-08 4294729 10-00:00:00  R 4-22:46:09      1    1 student42
            275094     nodes 2019-03-08 4294729 10-00:00:00  R 4-22:46:09      1    1 student42
            275095     nodes 2019-03-08 4294729 10-00:00:00  R 4-22:45:39      1    1 student42
            275096     nodes 2019-03-08 4294729 10-00:00:00  R 4-22:45:39      1    1 student16
            275097     nodes 2019-03-08 4294729 10-00:00:00  R 4-22:45:39      1    1 student16
            275098     nodes 2019-03-08 4294729 10-00:00:00  R 4-22:45:39      1    1 student16
            275101     nodes 2019-03-08 4294729 10-00:00:00  R 4-22:45:39      1    1 student11
            275102     nodes 2019-03-08 4294729 10-00:00:00  R 4-22:45:39      1    1 student10
            275103     nodes 2019-03-08 4294729 10-00:00:00  R 4-22:45:39      1    1 student10
            275105     nodes 2019-03-08 4294729 10-00:00:00  R 4-22:45:39      1    1 student12
            285802    bigmem 2019-03-12 4294720  UNLIMITED  R 1-00:25:10      1    1 student08
            285813     nodes 2019-03-12 4294719 5-00:00:00  R 1-00:09:02      1    1 student37
            285814     nodes 2019-03-12 4294719 5-00:00:00  R 1-00:09:02      1    1 student16
            285815     nodes 2019-03-12 4294719 5-00:00:00  R 1-00:09:02      1    1 student18
            285816     nodes 2019-03-12 4294719 5-00:00:00  R 1-00:09:02      1    1 student34
            285817     nodes 2019-03-12 4294719 5-00:00:00  R 1-00:09:02      1    1 student11
            285818     nodes 2019-03-12 4294719 5-00:00:00  R 1-00:09:02      1    1 student12
            285819     nodes 2019-03-12 4294719 5-00:00:00  R 1-00:09:02      1    1 student41
            285820     nodes 2019-03-12 4294719 5-00:00:00  R 1-00:09:02      1    1 student42
            285821     nodes 2019-03-12 4294719 5-00:00:00  R 1-00:09:02      1    1 student40
            285822     nodes 2019-03-12 4294719 5-00:00:00  R   20:36:29      1    1 student13
            285823     nodes 2019-03-12 4294719 5-00:00:00  R   20:36:05      1    1 student31
            285824     nodes 2019-03-12 4294719 5-00:00:00  R   20:36:03      1    1 student20
            285825     nodes 2019-03-12 4294719 5-00:00:00  R   20:34:55      1    1 student15
            285826     nodes 2019-03-12 4294719 5-00:00:00  R   20:33:44      1    1 student28
            285827     nodes 2019-03-12 4294719 5-00:00:00  R   20:32:50      1    1 student19
            285828     nodes 2019-03-12 4294719 5-00:00:00  R   20:32:16      1    1 student29
            285829     nodes 2019-03-12 4294719 5-00:00:00  R   20:29:08      1    1 student35
            285830     nodes 2019-03-12 4294719 5-00:00:00  R   20:28:06      1    1 student38
            285831     nodes 2019-03-12 4294719 5-00:00:00  R   20:15:44      1    1 student26
            285832     nodes 2019-03-12 4294719 5-00:00:00  R   20:13:34      1    1 student24
            285833     nodes 2019-03-12 4294719 5-00:00:00  R   20:11:04      1    1 student22
            285834     nodes 2019-03-12 4294719 5-00:00:00  R   20:04:28      1    1 student23
            285835     nodes 2019-03-13 4294719 5-00:00:00  R   10:20:14      1    1 student30
            285836     nodes 2019-03-13 4294719 5-00:00:00  R   10:16:38      1    1 student27
            285837     nodes 2019-03-13 4294719 5-00:00:00  R    9:34:04      1    1 student17
            285838     nodes 2019-03-13 4294719 5-00:00:00  R    8:26:26      1    1 student14
            285839     nodes 2019-03-13 4294719 5-00:00:00  R    8:06:57      1    1 student33
            285840     nodes 2019-03-13 4294719 5-00:00:00  R    7:54:44      1    1 student21
            285841     nodes 2019-03-13 4294719 5-00:00:00  R    7:50:39      1    1 student32
            285842     nodes 2019-03-13 4294719 5-00:00:00  R    7:47:37      1    1 student25
            286097     nodes 2019-03-12 4294719 5-00:00:00  R   19:43:10      1    8 student15
            286663    bigmem 2019-03-13 4294719  UNLIMITED  R    2:23:23      1    1 student01
            286665     nodes 2019-03-13 4294719 5-00:00:00  R    1:43:33      1    1 student37
            286762     nodes 2019-03-13 4294719 5-00:00:00  R      41:58      1    8 student38
            286763     nodes 2019-03-13 4294719 5-00:00:00  R      41:58      1    1 student37
            286764     nodes 2019-03-13 4294719 5-00:00:00  R      12:56      1    8 student40
Comment 1 Jacob Jenson 2019-03-13 10:23:39 MDT
Martin,

These types of requests are typically handled by the SchedMD support engineers. However, before the engineers can engage we need to match this request to an existing Slurm support contract. Can you please tell me which site/company/university this request pertains to? 

Thanks,
Jacob
Comment 2 Martin Forde 2019-03-13 18:20:03 MDT
Sorry dont have an account. Whats the cost for a rogue engineer like 
myself? Im just a poor boy from a poor family. I have a few questions here 
and there about implementation of features, i dont need anything like 24/7 
immediate response from an architect. Mostly stuff like, "hey torque does 
this, how do i do something similar with slurm?" type stuff. Christ Ill 
personally even pay out of pocket for live support on an hourly basis but 
its gotta be at a reason market value for the time.


Figured out the issue, so you can close the ticket either way.


Thanks
M