| Summary: | Scheduling / backfill isn't working properly | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Martin Forde <mforde84> |
| Component: | Scheduling | Assignee: | Jacob Jenson <jacob> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | ||
| Version: | 18.08.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurm.conf | ||
Martin, These types of requests are typically handled by the SchedMD support engineers. However, before the engineers can engage we need to match this request to an existing Slurm support contract. Can you please tell me which site/company/university this request pertains to? Thanks, Jacob Sorry dont have an account. Whats the cost for a rogue engineer like myself? Im just a poor boy from a poor family. I have a few questions here and there about implementation of features, i dont need anything like 24/7 immediate response from an architect. Mostly stuff like, "hey torque does this, how do i do something similar with slurm?" type stuff. Christ Ill personally even pay out of pocket for live support on an hourly basis but its gotta be at a reason market value for the time. Figured out the issue, so you can close the ticket either way. Thanks M |
Created attachment 9559 [details] slurm.conf We have a handful of jobs which are listed as PENDING, though there are sufficient resources available to run them. I think scheduling is either acting exclusively for the nodes, or there is some issue with backfilling due to priority and time limits. Can you guys help me understand why these pending jobs aren't being scheduled [root@gosset ~]# utilization #wrapper to determine current running slurm allocations node cpu load free_mem date storage01 1/20 0.07 83436 2019-03-13-11:05 storage02 0/20 0.03 251103 2019-03-13-11:05 storage03 0/20 0.01 311109 2019-03-13-11:05 storage04 0/20 0.03 150431 2019-03-13-11:05 storage05 0/20 0.04 307888 2019-03-13-11:05 storage06 0/20 0.01 358999 2019-03-13-11:05 student01 1/8 0.01 409374 2019-03-13-11:05 student02 0/24 0.45 246936 2019-03-13-11:05 student03 0/32 0.01 403093 2019-03-13-11:05 student04 0/32 0.08 404811 2019-03-13-11:05 student05 0/32 0.10 124610 2019-03-13-11:05 student06 0/32 0.03 188806 2019-03-13-11:05 student07 32/32 0.01 471833 2019-03-13-11:05 student08 11/32 0.01 352936 2019-03-13-11:05 student09 0/20 0.01 15061 2019-03-13-11:05 student10 2/20 1.99 22161 2019-03-13-11:05 student11 2/20 2.58 47538 2019-03-13-11:05 student12 2/20 2.59 90510 2019-03-13-11:05 student13 1/20 1.59 157183 2019-03-13-11:05 student14 1/20 1.53 170368 2019-03-13-11:05 student15 9/20 2.66 101032 2019-03-13-11:05 student16 4/20 4.49 7566 2019-03-13-11:05 student17 1/20 1.62 172203 2019-03-13-11:05 student18 1/20 1.55 17158 2019-03-13-11:05 student19 1/20 1.66 126596 2019-03-13-11:05 student20 1/20 1.48 112036 2019-03-13-11:05 student21 1/20 1.52 185606 2019-03-13-11:05 student22 1/20 1.53 122280 2019-03-13-11:05 student23 1/20 1.52 133716 2019-03-13-11:05 student24 1/20 1.40 122611 2019-03-13-11:05 student25 1/20 1.49 180984 2019-03-13-11:05 student26 1/20 1.71 124888 2019-03-13-11:05 student27 1/20 1.58 134455 2019-03-13-11:05 student28 1/20 1.55 140490 2019-03-13-11:05 student29 1/20 1.51 153481 2019-03-13-11:05 student30 1/20 1.62 155868 2019-03-13-11:05 student31 1/20 1.52 142458 2019-03-13-11:05 student32 1/20 1.51 167212 2019-03-13-11:05 student33 1/20 1.68 175362 2019-03-13-11:05 student34 1/20 1.56 17857 2019-03-13-11:05 student35 1/20 1.41 126248 2019-03-13-11:05 student36 0/20 0.33 247941 2019-03-13-11:05 student37 4/20 2.46 22364 2019-03-13-11:05 student38 9/20 10.74 62300 2019-03-13-11:05 student39 0/20 0.01 4677 2019-03-13-11:05 student40 11/20 10.90 16355 2019-03-13-11:05 student41 4/20 4.66 97747 2019-03-13-11:05 student42 4/20 4.63 147238 2019-03-13-11:05 student43 0/20 0.01 11593 2019-03-13-11:05 student44 0/20 3.27 22301 2019-03-13-11:05 [root@gosset ~]# squeue --format "%.18i %.9P %10S %.7Q %.10l %.2t %.10M %.6D %.4C %R" JOBID PARTITION START_TIME PRIORIT TIME_LIMIT ST TIME NODES CPUS NODELIST(REASON) 285843 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Resources) 285943 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 285844 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285944 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 285845 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285846 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285946 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 285847 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285947 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 285848 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285948 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 285849 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285850 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285851 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285852 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285853 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285854 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285855 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285856 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285857 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285858 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285859 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285860 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285861 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285862 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285863 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285864 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285865 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285965 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 285866 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285966 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 285867 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285967 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 285868 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285968 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 285869 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285870 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285970 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 285871 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285872 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285972 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 285873 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285973 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 285874 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285875 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285876 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285976 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 285877 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285977 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 285878 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285978 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 285879 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285979 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 285880 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285881 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285981 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 285882 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285982 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 285883 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285983 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 285884 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285984 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 285885 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285886 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285887 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285888 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285889 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285890 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285891 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285892 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285893 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285894 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285895 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285896 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 285897 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 1 (Priority) 286066 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 286067 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 286068 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 286071 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 286072 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 286073 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 286075 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 286079 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 286080 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 286081 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 286085 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 286094 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 286095 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 286096 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 286659 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 286660 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 286688 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 20 (Priority) 286711 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 20 (Priority) 286712 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 20 (Priority) 285942 nodes 2019-03-17 4294719 5-00:00:00 PD 0:00 1 16 (Priority) 286030 nodes 2019-03-17 4294719 5:00:00 PD 0:00 1 20 (Priority) 188147 bigmem 2019-02-12 4294805 UNLIMITED R 29-03:17:30 1 32 student07 252611 bigmem 2019-03-04 4294751 365-00:00:00 R 8-19:12:12 1 1 storage01 271883 bigmem 2019-03-07 4294732 365-00:00:00 R 5-22:50:13 1 10 student08 275085 nodes 2019-03-08 4294729 10-00:00:00 R 4-22:48:39 1 1 student37 275088 nodes 2019-03-08 4294729 10-00:00:00 R 4-22:46:09 1 1 student40 275089 nodes 2019-03-08 4294729 10-00:00:00 R 4-22:46:09 1 1 student40 275090 nodes 2019-03-08 4294729 10-00:00:00 R 4-22:46:09 1 1 student41 275091 nodes 2019-03-08 4294729 10-00:00:00 R 4-22:46:09 1 1 student41 275092 nodes 2019-03-08 4294729 10-00:00:00 R 4-22:46:09 1 1 student41 275093 nodes 2019-03-08 4294729 10-00:00:00 R 4-22:46:09 1 1 student42 275094 nodes 2019-03-08 4294729 10-00:00:00 R 4-22:46:09 1 1 student42 275095 nodes 2019-03-08 4294729 10-00:00:00 R 4-22:45:39 1 1 student42 275096 nodes 2019-03-08 4294729 10-00:00:00 R 4-22:45:39 1 1 student16 275097 nodes 2019-03-08 4294729 10-00:00:00 R 4-22:45:39 1 1 student16 275098 nodes 2019-03-08 4294729 10-00:00:00 R 4-22:45:39 1 1 student16 275101 nodes 2019-03-08 4294729 10-00:00:00 R 4-22:45:39 1 1 student11 275102 nodes 2019-03-08 4294729 10-00:00:00 R 4-22:45:39 1 1 student10 275103 nodes 2019-03-08 4294729 10-00:00:00 R 4-22:45:39 1 1 student10 275105 nodes 2019-03-08 4294729 10-00:00:00 R 4-22:45:39 1 1 student12 285802 bigmem 2019-03-12 4294720 UNLIMITED R 1-00:25:10 1 1 student08 285813 nodes 2019-03-12 4294719 5-00:00:00 R 1-00:09:02 1 1 student37 285814 nodes 2019-03-12 4294719 5-00:00:00 R 1-00:09:02 1 1 student16 285815 nodes 2019-03-12 4294719 5-00:00:00 R 1-00:09:02 1 1 student18 285816 nodes 2019-03-12 4294719 5-00:00:00 R 1-00:09:02 1 1 student34 285817 nodes 2019-03-12 4294719 5-00:00:00 R 1-00:09:02 1 1 student11 285818 nodes 2019-03-12 4294719 5-00:00:00 R 1-00:09:02 1 1 student12 285819 nodes 2019-03-12 4294719 5-00:00:00 R 1-00:09:02 1 1 student41 285820 nodes 2019-03-12 4294719 5-00:00:00 R 1-00:09:02 1 1 student42 285821 nodes 2019-03-12 4294719 5-00:00:00 R 1-00:09:02 1 1 student40 285822 nodes 2019-03-12 4294719 5-00:00:00 R 20:36:29 1 1 student13 285823 nodes 2019-03-12 4294719 5-00:00:00 R 20:36:05 1 1 student31 285824 nodes 2019-03-12 4294719 5-00:00:00 R 20:36:03 1 1 student20 285825 nodes 2019-03-12 4294719 5-00:00:00 R 20:34:55 1 1 student15 285826 nodes 2019-03-12 4294719 5-00:00:00 R 20:33:44 1 1 student28 285827 nodes 2019-03-12 4294719 5-00:00:00 R 20:32:50 1 1 student19 285828 nodes 2019-03-12 4294719 5-00:00:00 R 20:32:16 1 1 student29 285829 nodes 2019-03-12 4294719 5-00:00:00 R 20:29:08 1 1 student35 285830 nodes 2019-03-12 4294719 5-00:00:00 R 20:28:06 1 1 student38 285831 nodes 2019-03-12 4294719 5-00:00:00 R 20:15:44 1 1 student26 285832 nodes 2019-03-12 4294719 5-00:00:00 R 20:13:34 1 1 student24 285833 nodes 2019-03-12 4294719 5-00:00:00 R 20:11:04 1 1 student22 285834 nodes 2019-03-12 4294719 5-00:00:00 R 20:04:28 1 1 student23 285835 nodes 2019-03-13 4294719 5-00:00:00 R 10:20:14 1 1 student30 285836 nodes 2019-03-13 4294719 5-00:00:00 R 10:16:38 1 1 student27 285837 nodes 2019-03-13 4294719 5-00:00:00 R 9:34:04 1 1 student17 285838 nodes 2019-03-13 4294719 5-00:00:00 R 8:26:26 1 1 student14 285839 nodes 2019-03-13 4294719 5-00:00:00 R 8:06:57 1 1 student33 285840 nodes 2019-03-13 4294719 5-00:00:00 R 7:54:44 1 1 student21 285841 nodes 2019-03-13 4294719 5-00:00:00 R 7:50:39 1 1 student32 285842 nodes 2019-03-13 4294719 5-00:00:00 R 7:47:37 1 1 student25 286097 nodes 2019-03-12 4294719 5-00:00:00 R 19:43:10 1 8 student15 286663 bigmem 2019-03-13 4294719 UNLIMITED R 2:23:23 1 1 student01 286665 nodes 2019-03-13 4294719 5-00:00:00 R 1:43:33 1 1 student37 286762 nodes 2019-03-13 4294719 5-00:00:00 R 41:58 1 8 student38 286763 nodes 2019-03-13 4294719 5-00:00:00 R 41:58 1 1 student37 286764 nodes 2019-03-13 4294719 5-00:00:00 R 12:56 1 8 student40