| Summary: | Job in a lower priority queue get into node before jobs in higher priority queue | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Akmal Madzlan <akmalm> |
| Component: | Scheduling | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | brian, da, hajek, jbjohnston |
| Version: | 14.11.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | DownUnder GeoSolutions | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 14.11.8 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurmctld log, config and some squeue output attached
scontrol show jobs 14121153 slurmctld.log |
||
Can you attach "scontrol show jobs" output as well? I see that a lot of the pending jobs are being held back because of dependencies. It's possible that the dependencies created a whole for the jobs to be backfilled into. I was able to recreate a similar situation. Created attachment 2005 [details]
scontrol show jobs 14121153
This is scontrol for jobs 14121153 (user1).
All his jobs is waiting for Resources.
Another thing I notice is the job that 'steal' the node always have a lower JobID, meaning that they are submitted earlier? Maybe they are already scheduled for that node before the job on the higher priority queue is submitted? (In reply to Akmal Madzlan from comment #3) > Another thing I notice is the job that 'steal' the node always have a lower > JobID, meaning that they are submitted earlier? > > Maybe they are already scheduled for that node before the job on the higher > priority queue is submitted? You have "SchedulerParameters=bf_continue" configured. What that means is the backfill scheduler builds a priority ordered queue of jobs pending then proceeds through the full queue (up to 50000 jobs in your case due to "bf_max_job_test=50000"), periodically releasing locks so that other events can take place, such as submitting new jobs. Depending upon your configuration, the backfill scheduler could take minutes to work its way through the full queue. If new higher-priority jobs become eligible during that interval, they will not be considered by the backfill scheduler, which will continue to work its way through the queue and start jobs. Could you run the "sdiag" command and append it's output. That will provide some information about how long the backfill scheduler runs for. Here is sdiag output. I didnt include RPC stats because it is too long and I think you dont need it? [root@pud22 ~]# sdiag ******************************************************* sdiag output at Fri Jul 3 05:54:51 2015 Data since Thu Jul 2 08:00:02 2015 ******************************************************* Server thread count: 3 Agent queue size: 0 Jobs submitted: 2699 Jobs started: 16959 Jobs completed: 16870 Jobs canceled: 718 Jobs failed: 0 Main schedule statistics (microseconds): Last cycle: 99384 Max cycle: 318869 Total cycles: 7650 Mean cycle: 80156 Mean depth cycle: 1170 Cycles per minute: 5 Last queue length: 2425 Backfilling stats Total backfilled jobs (since last slurm start): 48341 Total backfilled jobs (since last stats cycle start): 6962 Total cycles: 1338 Last cycle when: Fri Jul 3 05:53:55 2015 Last cycle: 3247904657 Max cycle: 3980170579 Mean cycle: 1384795 Last depth cycle: 2428 Last depth cycle (try sched): 2305 Depth Mean: 2396 Depth Mean (try depth): 2287 Last queue length: 2419 Queue length mean: 2392 (In reply to Akmal Madzlan from comment #5) > Here is sdiag output. > I didnt include RPC stats because it is too long and I think you dont need > it? What you sent is exactly what I wanted to see. Here's the interesting bit, the cycle times are in micro-seconds: > Backfilling stats > Last cycle: 3247904657 > Max cycle: 3980170579 > Mean cycle: 1384795 So you're averaging about 1.4 seconds, which is pretty reasonable, but the the worst case was 66 minutes. That isn't 66 minutes of run time, but 2 seconds of running alternating with at least 0.5 seconds of sleeping so that other operations can happen. 66 minutes is certainly sufficient time for a lot of other events (jobs coming and going, etc.). In any case, I'm used to seeing times well under 1 minute in the worst case. You've only got a couple of thousand jobs, so it's difficult for me to imagine where all of that time is going. Does anything come to your mind? (In reply to Moe Jette from comment #6) > Does anything come to your mind? I honestly have no idea This is another occurence. [root@pud22 ~]# squeue -aw clus[001-030] JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 14099126_869 idle pt2_030_ user99 R 1:28:02 1 clus025 14099126_868 idle pt2_030_ user99 R 1:37:16 1 clus028 14099126_867 idle pt2_030_ user99 R 1:46:17 1 clus022 14099126_866 idle pt2_030_ user99 R 2:04:58 1 clus006 14099126_864 idle pt2_030_ user99 R 2:18:55 1 clus003 14099126_863 idle pt2_030_ user99 R 2:31:44 1 clus026 14099126_860 idle pt2_030_ user99 R 2:49:28 1 clus030 14121153_1368 teamcott 221_Asim user1 R 8:15 1 clus007 14121153_1339 teamcott 221_Asim user1 R 30:21 1 clus008 14121153_1333 teamcott 221_Asim user1 R 33:36 1 clus029 14121153_1332 teamcott 221_Asim user1 R 33:44 1 clus015 14121153_1329 teamcott 221_Asim user1 R 34:44 1 clus004 14121153_1318 teamcott 221_Asim user1 R 41:05 1 clus010 14121153_1317 teamcott 221_Asim user1 R 44:05 1 clus027 14121153_1306 teamcott 221_Asim user1 R 56:48 1 clus002 14121153_1298 teamcott 221_Asim user1 R 1:11:53 1 clus005 14121153_1299 teamcott 221_Asim user1 R 1:11:53 1 clus024 14121153_1300 teamcott 221_Asim user1 R 1:11:53 1 clus001 14121153_1297 teamcott 221_Asim user1 R 1:12:27 1 clus019 14121153_1284 teamcott 221_Asim user1 R 1:18:33 1 clus018 14121153_1280 teamcott 221_Asim user1 R 1:21:33 1 clus023 14121153_1231 teamcott 221_Asim user1 R 1:39:16 1 clus016 14121153_1228 teamcott 221_Asim user1 R 1:41:37 1 clus012 14121153_1227 teamcott 221_Asim user1 R 1:43:37 1 clus013 14121153_1226 teamcott 221_Asim user1 R 1:44:37 1 clus014 14121153_1220 teamcott 221_Asim user1 R 1:46:37 1 clus020 14121153_1208 teamcott 221_Asim user1 R 1:51:56 1 clus009 14121153_1166 teamcott 221_Asim user1 R 2:04:10 1 clus011 14121153_990 teamcott 221_Asim user1 R 3:26:50 1 clus017 Upon further inspection, not all job that steal the node is allocated by backfill [root@pque0001 ~]# cat /var/log/slurm/slurmctld.log | grep 14099126_869 [2015-07-03T04:32:09.122] backfill: Started JobId=14099126_869 (14138523) on clus025 [root@pque0001 ~]# cat /var/log/slurm/slurmctld.log | grep 14099126_868 [2015-07-03T04:22:55.672] sched: Allocate JobId=14099126_868 (14138406) NodeList=clus028 #CPUs=64 [root@pque0001 ~]# cat /var/log/slurm/slurmctld.log | grep 14099126_867 [2015-07-03T04:13:54.698] sched: Allocate JobId=14099126_867 (14138333) NodeList=clus022 #CPUs=64 [root@pque0001 ~]# cat /var/log/slurm/slurmctld.log | grep 14099126_866 [2015-07-03T03:55:13.466] backfill: Started JobId=14099126_866 (14138171) on clus006 [root@pque0001 ~]# cat /var/log/slurm/slurmctld.log | grep 14099126_864 [2015-07-03T03:41:16.084] backfill: Started JobId=14099126_864 (14138067) on clus003 [root@pque0001 ~]# cat /var/log/slurm/slurmctld.log | grep 14099126_863 [2015-07-03T03:28:27.864] backfill: Started JobId=14099126_863 (14137965) on clus026 [root@pque0001 ~]# cat /var/log/slurm/slurmctld.log | grep 14099126_860 [2015-07-03T03:10:43.169] sched: Allocate JobId=14099126_860 (14137853) NodeList=clus030 #CPUs=64 What I would like to do is concentrate on getting the worst-case backfill scheduling cycle time down to something reasonable. Having it run with hour-old data is definitely going to cause scheduling anomalies. I'd like to add a new parameter that will exit the backfill scheduler loop after reaching some timeout, say 10 seconds, no matter if it doesn't test all of the jobs, but that will not help you right now. Here are a couple of suggested additions/changes to the SchedulerParameters configuration parameter to help now: bf_resolution=600 (Increase from 300) bf_max_job_part=100 (see slurm.conf for details, perhaps a different value would be better for your workload) bf_max_job_test=1000 (see slurm.conf, perhaps decrease from 50000) I would periodically run the sdiag command and look specifically at these items: Backfilling stats Last cycle: 3247904657 Max cycle: 3980170579 Mean cycle: 1384795 You can use "sdiag --reset" as root to reset the counters. I've make those scheduler parameter changes and this is our current sdiag output [root@pud22 ~]# sdiag ******************************************************* sdiag output at Fri Jul 3 11:14:00 2015 Data since Fri Jul 3 09:23:12 2015 ******************************************************* Server thread count: 4 Agent queue size: 0 Jobs submitted: 152 Jobs started: 1671 Jobs completed: 1642 Jobs canceled: 41 Jobs failed: 0 Main schedule statistics (microseconds): Last cycle: 41422 Max cycle: 278514 Total cycles: 651 Mean cycle: 69400 Mean depth cycle: 1095 Cycles per minute: 5 Last queue length: 2222 Backfilling stats Total backfilled jobs (since last slurm start): 662 Total backfilled jobs (since last stats cycle start): 662 Total cycles: 111 Last cycle when: Fri Jul 3 11:13:49 2015 Last cycle: 1701825 Max cycle: 2513933656 Mean cycle: 30731377 Last depth cycle: 2222 Last depth cycle (try sched): 617 Depth Mean: 2131 Depth Mean (try depth): 566 Last queue length: 2222 Queue length mean: 2128 another occurence. But this is a bit different because it is from the same queue. Job 14143084 should not get into the queue until job 14121153 and 14144213 done am I right? Both of them is waiting for resources [root@pud22 ~]# squeue -p teamcott -o "%20i %20P %10u %5t %10M %10Q" JOBID PARTITION USER ST TIME PRIORITY 14144213_[265-288] idle,teamcott user PD 0:00 500 14121153_[2149-2182, teamcott,teamcottGpu user1 PD 0:00 500 14143084_[379,381,38 teamcott,idle user99 PD 0:00 100 14121153_2145 teamcott user1 R 1:32 500 14121153_2138 teamcott user1 R 10:32 500 14121153_2119 teamcott user1 R 21:32 500 14121153_2115 teamcott user1 R 22:19 500 14121153_2081 teamcott user1 R 36:32 500 14121153_2082 teamcott user1 R 36:32 500 14121153_2036 teamcott user1 R 48:57 500 14121153_2035 teamcott user1 R 49:12 500 14121153_1930 teamcott user1 R 1:10:33 500 14121153_1926 teamcott user1 R 1:12:53 500 14121153_1924 teamcott user1 R 1:13:41 500 14121153_1894 teamcott user1 R 1:32:13 500 14121153_1577 teamcott user1 R 1:32:53 500 14121153_1887 teamcott user1 R 1:35:33 500 14121153_1882 teamcott user1 R 1:37:33 500 14121153_1876 teamcott user1 R 1:40:33 500 14121153_1868 teamcott user1 R 1:44:54 500 14121153_1390 teamcott user1 R 1:55:14 500 14121153_1846 teamcott user1 R 1:56:54 500 14121153_1834 teamcott user1 R 2:02:55 500 14121153_1819 teamcott user1 R 2:11:36 500 14121153_1818 teamcott user1 R 2:12:16 500 14121153_1812 teamcott user1 R 2:13:33 500 14121153_1792 teamcott user1 R 2:20:17 500 14121153_1793 teamcott user1 R 2:20:17 500 14143084_377 teamcott user99 R 24:32 100 14143084_375 teamcott user99 R 38:52 100 14143084_371 teamcott user99 R 48:32 100 14121153_1783 teamcott user1 R 49:32 100 14129605_42 teamcott,idle user3 SE 0:00 0 14121153_2058 teamcott,teamcottGpu user1 SE 0:00 0 14121153_2065 teamcott,teamcottGpu user1 SE 0:00 0 I think that changes does improve things. I dont see any job from lower priority queue get into the node before job in higher priority queue today I concentrated on getting the worst-case backfill scheduling cycle time down to something reasonable. Having it run with hour-old data is definitely going to cause scheduling anomalies. What I've done is modify the code to treat bf_interval as a maximum run time for the backfill scheduler. In your case, that's 50 seconds, which includes the run time plus the sleep time (2 seconds of run time and 0.5 seconds of sleeping for other operations). The downside is that the backfill scheduler will not go through many, but perhaps not all of the pending jobs. The commit/patch with the change is here: https://github.com/SchedMD/slurm/commit/7e9442207a61b66f7428f91ea97dbc68b53c881d.patch We will almost certainly tag a new version of Slurm this week with bug fixes, so you might want to use that rather than the patch if you can wait an extra day or two. I suspect that the really long times are related to your configuration parameter of "OverTimeLimit=UNLIMITED". I need to study the ramifications of that some more. (In reply to Moe Jette from comment #12) > I suspect that the really long times are related to your configuration > parameter of "OverTimeLimit=UNLIMITED". I need to study the ramifications of > that some more. This is almost certainly the reason for the really long run times of the backfill scheduler and other scheduling anomalies. I've added logic so the backfill scheduler considers OverTimeLimit and KillWait configuration parameters, but only for job's that have exceeded their time limit. The commit is here: https://github.com/SchedMD/slurm/commit/edfbabe6125336bbb526e3bd6fa12155e0b1bf9d.patch This change should also improve the responsiveness of Slurm for you, at least when there are jobs which have exceeded their time limits. Thanks Moe, If I understand correctly, the first patch will make bf_interval be treated as OverTimeLimit? So if we use OverTimeLimit=UNLIMITED, the second patch will still works? Akmal (In reply to Akmal Madzlan from comment #14) > Thanks Moe, > If I understand correctly, the first patch will make bf_interval be treated > as OverTimeLimit? So if we use OverTimeLimit=UNLIMITED, the second patch > will still works? > > Akmal No. The first patch will cause the backfill scheduler to stop it's current cycle after running for bf_interval (50 seconds in your case). It will then sleep for bf_interval, build a new job queue, and start over. The second patch fixes the real problem that caused the backfill scheduler to run for so long. The second patch addresses jobs that have exceeded their time limit. What happens with the current code (i.e. without the patch) is Slurm says the job will end in the next second, creates a reservation for it and goes on to the next job. Since the backfill scheduler might right for longer than that, it messes up the planned schedule for pending jobs. What that second patch does is for jobs that exceed their time limit, its new estimated end time is set to the scheduled end time (based upon start time + time limit) plus OverTimeLimit plus KillWait. With OverTimeLimit=UNLIMITED, that will push its expected end time out beyond anything the backfill scheduler will reserve resources for. It would probably help if you have a finite value for OverTimeLimit, but this seems like the best option to me and will prevent the backfill scheduler running for so long, but not making real progress Slurm version 14.11.8 has been released with this bug fix. I think we are seeing this issue again. Job seems to be scheduled based on the job priority alone while ignoring the partition priority user A submit job1 with priority=100 to partitionA with priority=21 user B submit job2 with priority=1000 to partitionB with priority=20 User B's job goes first both job is submitted to multiple partition with mixed priority - Lower priority queue [akmalm@pud22 ~]$ scontrol show partition idleGpu PartitionName=idleGpu AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=01:00:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=clus[081-160,241-320,353-362,626-665],pnod[0666-0867] Priority=11 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF State=UP TotalCPUs=12608 TotalNodes=412 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED - Higher Priority queue [akmalm@pud22 ~]$ scontrol show partition teamrottoGpu PartitionName=teamrottoGpu AllowGroups=teamrotto,geodev AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=01:00:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=pnod[0748-0772],clus[136-154,268-270] Priority=21 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF State=UP TotalCPUs=1456 TotalNodes=47 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED - Job running in teamrottoGpu [akmalm@pud22 ~]$ squeue -aw pnod[0748-0772],clus[136-154,268-270] PARTITION PRIORITY NAME USER ST TIME NODES NODELIST(REASON JOBID idleGpu 1000 rt_tomo4a_TTI_Panaus zayn R 1:13:04 1 pnod0763 14795781_272 idleGpu 1000 rt_tomo4a_TTI_Panaus zayn R 1:03:42 1 pnod0767 14795781_325 idle 800 pt2_000_reg01_3x5_pr tomj R 3:35:52 1 pnod0750 14751438_134 teamrottoGp 400 5804_SRME_South_8136 joshuaz R 16:07 1 pnod0771 14799464_2284 teamrottoGp 400 5804_SRME_South_8136 joshuaz R 16:07 1 pnod0766 14799464_2285 idleGpu 200 180_SWAMP_dcepticon_ katnisss R 42:33 1 pnod0761 14786431_405 idleGpu 200 180_SWAMP_dcepticon_ katnisss R 42:33 1 pnod0769 14786431_406 idleGpu 200 180_SWAMP_dcepticon_ katnisss R 40:00 1 pnod0764 14786431_415 idleGpu 200 180_SWAMP_dcepticon_ katnisss R 34:54 1 pnod0757 14786431_424 idleGpu 200 180_SWAMP_dcepticon_ katnisss R 34:54 1 pnod0749 14786431_425 idleGpu 200 180_SWAMP_dcepticon_ katnisss R 31:31 1 pnod0765 14786431_435 idleGpu 200 180_SWAMP_dcepticon_ katnisss R 23:53 1 pnod0753 14786431_479 idleGpu 200 180_SWAMP_dcepticon_ katnisss R 21:20 1 pnod0759 14786431_500 idleGpu 200 180_SWAMP_dcepticon_ katnisss R 21:20 1 pnod0758 14786431_501 idleGpu 200 180_SWAMP_dcepticon_ katnisss R 14:26 1 pnod0752 14786431_543 idleGpu 200 180_SWAMP_dcepticon_ katnisss R 14:26 1 pnod0754 14786431_544 idleGpu 200 180_SWAMP_dcepticon_ katnisss R 13:35 1 pnod0770 14786431_553 idleGpu 200 180_SWAMP_dcepticon_ katnisss R 12:44 1 pnod0762 14786431_558 idleGpu 200 180_SWAMP_dcepticon_ katnisss R 12:44 1 pnod0755 14786431_559 idleGpu 200 180_SWAMP_dcepticon_ katnisss R 11:03 1 pnod0751 14786431_571 idleGpu 200 180_SWAMP_dcepticon_ katnisss R 11:03 1 pnod0760 14786431_572 idleGpu 200 180_SWAMP_dcepticon_ katnisss R 8:30 1 pnod0768 14786431_586 idleGpu 200 180_SWAMP_dcepticon_ katnisss R 8:30 1 pnod0772 14786431_587 idleGpu 200 180_SWAMP_dcepticon_ katnisss R 5:08 1 pnod0748 14786431_604 idleGpu 200 180_SWAMP_dcepticon_ katnisss R 0:53 1 pnod0756 14786431_624 teamrottoGp 100 dp_SWaMP_Nania mishka R 6:18:07 1 clus268 14746919_238 teamrottoGp 100 dp_SWaMP_Nania mishka R 6:14:47 1 clus270 14746919_240 teamrottoGp 100 dp_SWaMP_Nania mishka R 4:39:08 1 clus146 14746919_242 teamrottoGp 100 dp_SWaMP_Nania mishka R 4:38:22 1 clus140 14746919_243 teamrottoGp 100 dp_SWaMP_Nania mishka R 4:36:46 1 clus143 14746919_244 teamrottoGp 100 dp_SWaMP_Nania mishka R 4:32:28 1 clus149 14746919_245 teamrottoGp 100 dp_SWaMP_Nania mishka R 4:28:31 1 clus152 14746919_246 teamrottoGp 100 dp_SWaMP_Nania mishka R 4:26:41 1 clus138 14746919_247 teamrottoGp 100 dp_SWaMP_Nania mishka R 4:26:30 1 clus153 14746919_248 teamrottoGp 100 dp_SWaMP_Nania mishka R 4:25:03 1 clus147 14746919_249 teamrottoGp 100 dp_SWaMP_Nania mishka R 4:07:28 1 clus151 14746919_250 teamrottoGp 100 dp_SWaMP_Nania mishka R 4:02:49 1 clus137 14746919_251 teamrottoGp 100 dp_SWaMP_Nania mishka R 4:00:27 1 clus139 14746919_252 teamrottoGp 100 dp_SWaMP_Nania mishka R 3:33:39 1 clus145 14746919_253 teamrottoGp 100 dp_SWaMP_Nania mishka R 3:16:55 1 clus141 14746919_254 teamrottoGp 100 dp_SWaMP_Nania mishka R 3:13:58 1 clus142 14746919_255 teamrottoGp 100 dp_SWaMP_Nania mishka R 2:35:54 1 clus148 14746919_257 teamrottoGp 100 dp_SWaMP_Nania mishka R 2:28:33 1 clus136 14746919_258 teamrottoGp 100 dp_SWaMP_Nania mishka R 2:17:41 1 clus150 14746919_259 teamrottoGp 100 dp_SWaMP_Nania mishka R 1:50:36 1 clus144 14746919_260 teamrottoGp 100 dp_SWaMP_Nania mishka R 37:53 1 clus269 14746919_261 - There is job from idleGpu get into the nodes while there is still pending job in teamrottoGpu [akmalm@pud22 ~]$ squeue -p teamrottoGpu -tPD PARTITION PRIORITY NAME USER ST TIME NODES NODELIST(REASON JOBID teamrottoGp 100 dp_SWaMP_Nania mishka PD 0:00 1 (Resources) 14746919_162 teamrottoGp 100 dp_SWaMP_Nania mishka PD 0:00 1 (Resources) 14746919_165 teamrottoGp 100 dp_SWaMP_Nania mishka PD 0:00 1 (Resources) 14746919_166 teamrottoGp 100 dp_SWaMP_Nania mishka PD 0:00 1 (Resources) 14746919_167 teamrottoGp 100 dp_SWaMP_Nania mishka PD 0:00 1 (Resources) 14746919_168 teamrottoGp 100 dp_SWaMP_Nania mishka PD 0:00 1 (Resources) 14746919_169 teamrottoGp 100 dp_SWaMP_Nania mishka PD 0:00 1 (Resources) 14746919_[262-990] teamrottoGpu and idleGpu have overlapped nodes Let me investigate and get back to you on this. David Thanks David Have you applied patch for the 1892 problem? David do you mean bug 1829? Yes we've applied that patch yesterday Please attach output from "scontrol show job" for some of the effected jobs (both pending and running), plus the slurmctld log file for the period when job were started, which should not have been. Created attachment 2102 [details]
slurmctld.log
- scontrol show for jobs that should have priority
JobId=14811805 ArrayJobId=14746919 ArrayTaskId=290 JobName=dp_SWaMP_asd
UserId=mashak(1376) GroupId=teamrotto(2116)
Priority=100 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=07:01:12 TimeLimit=01:00:00 TimeMin=N/A
SubmitTime=2015-08-04T13:02:08 EligibleTime=2015-08-04T13:02:09
StartTime=2015-08-07T02:34:31 EndTime=2015-08-07T03:34:31
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=teamrottoGpu AllocNode:Sid=pud153:9997
ReqNodeList=(null) ExcNodeList=(null)
NodeList=clus147
BatchHost=clus147
NumNodes=1 NumCPUs=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=27244M MinTmpDiskNode=16000M
Features=gpu&localdisk Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/p3/qqq/quoBonPr_011/seiTimeProc/prod/000scratch/008_asd.dugio/workflow_2015.08.04_13.01.42_WST/rj.dp_SWaMP_asd.NsIHW3
WorkDir=/p3/qqq/quoBonPr_011/seiTimeProc/prod/000scratch/008_asd.dugio/workflow_2015.08.04_13.01.42_WST
Comment=/p3/qqq/quoBonPr_011/seiTimeProc/prod/000scratch/008_asd.dugio/workflow_2015.08.04_13.01.42_WST/production.job
StdErr=/p3/qqq/quoBonPr_011/seiTimeProc/prod/000scratch/008_asd.dugio/workflow_2015.08.04_13.01.42_WST/logs_cluster/dp_SWaMP_asd/dp_SWaMP_asd.o14746919.290
StdIn=/dev/null
StdOut=/p3/qqq/quoBonPr_011/seiTimeProc/prod/000scratch/008_asd.dugio/workflow_2015.08.04_13.01.42_WST/logs_cluster/dp_SWaMP_asd/dp_SWaMP_asd.o14746919.290
JobId=14774568 ArrayJobId=14746919 ArrayTaskId=166 JobName=dp_SWaMP_asd
UserId=mishka(1376) GroupId=teamrotto(2116)
Priority=100 Nice=0 Account=(null) QOS=normal
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
SubmitTime=2015-08-05T14:42:57 EligibleTime=2015-08-05T14:43:07
StartTime=Unknown EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=teamrottoGpu,idleGpu AllocNode:Sid=pud153:9997
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
BatchHost=clus270
NumNodes=1-1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=27244M MinTmpDiskNode=16000M
Features=gpu&localdisk Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/p3/qqq/quoBonPr_011/seiTimeProc/prod/000scratch/008_asd.dugio/workflow_2015.08.04_13.01.42_WST/rj.dp_SWaMP_asd.NsIHW3
WorkDir=/p3/qqq/quoBonPr_011/seiTimeProc/prod/000scratch/008_asd.dugio/workflow_2015.08.04_13.01.42_WST
Comment=/p3/qqq/quoBonPr_011/seiTimeProc/prod/000scratch/008_asd.dugio/workflow_2015.08.04_13.01.42_WST/production.job
StdErr=/p3/qqq/quoBonPr_011/seiTimeProc/prod/000scratch/008_asd.dugio/workflow_2015.08.04_13.01.42_WST/logs_cluster/dp_SWaMP_asd/dp_SWaMP_asd.o14746919.166
StdIn=/dev/null
StdOut=/p3/qqq/quoBonPr_011/seiTimeProc/prod/000scratch/008_asd.dugio/workflow_2015.08.04_13.01.42_WST/logs_cluster/asd/dp_SWaMP_asd.o14746919.166
- I dont have scontrol output for jobs that steal the node. They all have finished running
- This is the time when the lower priority job get scheduled, taken from the logs
[root@pque0001 ~]# squeue -aw clus[136-153,268-270],pnod[0748-0772] | grep ^idle | awk '{print $9}' | xargs -Ijobid grep jobid /var/log/slurm/slurmctld.log | grep -E "Allocate|Started"
[2015-08-06T12:19:30.429] sched: Allocate JobId=14795781_272 (14796635) NodeList=clus263 #CPUs=32
[2015-08-06T15:20:27.460] backfill: Started JobId=14795781_272 (14796635) on pnod0763
[2015-08-06T12:21:23.547] sched: Allocate JobId=14795781_325 (14796711) NodeList=pnod0727 #CPUs=24
[2015-08-06T15:29:49.682] backfill: Started JobId=14795781_325 (14796711) on pnod0767
[2015-08-06T12:22:14.478] sched: Allocate JobId=14795781_349 (14796755) NodeList=pnod0723 #CPUs=24
[2015-08-06T15:34:55.806] backfill: Started JobId=14795781_349 (14796755) on pnod0756
[2015-08-06T12:57:39.983] backfill: Started JobId=14751438_134 (14764471) on pnod0750
[2015-08-06T15:50:58.286] backfill: Started JobId=14786431_405 (14799194) on pnod0761
[2015-08-06T15:50:58.293] backfill: Started JobId=14786431_406 (14799195) on pnod0769
[2015-08-06T15:53:31.380] backfill: Started JobId=14786431_415 (14799215) on pnod0764
[2015-08-06T15:58:37.829] backfill: Started JobId=14786431_424 (14799258) on pnod0757
[2015-08-06T15:58:37.834] backfill: Started JobId=14786431_425 (14799259) on pnod0749
[2015-08-06T16:02:00.939] backfill: Started JobId=14786431_435 (14799292) on pnod0765
[2015-08-06T16:09:38.038] backfill: Started JobId=14786431_479 (14799358) on pnod0753
[2015-08-06T16:12:11.106] backfill: Started JobId=14786431_500 (14799414) on pnod0759
[2015-08-06T16:12:11.112] backfill: Started JobId=14786431_501 (14799415) on pnod0758
[2015-08-06T16:19:05.525] backfill: Started JobId=14786431_543 (14799480) on pnod0752
[2015-08-06T16:19:05.531] backfill: Started JobId=14786431_544 (14799481) on pnod0754
[2015-08-06T16:19:56.507] backfill: Started JobId=14786431_553 (14799491) on pnod0770
[2015-08-06T16:20:47.624] backfill: Started JobId=14786431_558 (14799496) on pnod0762
[2015-08-06T16:20:47.628] backfill: Started JobId=14786431_559 (14799497) on pnod0755
[2015-08-06T16:22:28.653] backfill: Started JobId=14786431_571 (14799509) on pnod0751
[2015-08-06T16:22:28.658] backfill: Started JobId=14786431_572 (14799510) on pnod0760
[2015-08-06T11:40:05.268] sched: Allocate JobId=14758925_6816 (14758925) NodeList=clus129 #CPUs=40
[2015-08-06T16:24:10.650] backfill: Started JobId=14758925_6816 (14758925) on pnod0748
- I've attached full log
This will not impact scheduling, but it could cause other problems: The slurm.conf files are different on quite a few of your nodes. Run "grep HASH" on your slurmctld log file to identify the nodes: $ grep HASH slurmctld.log | head -3 [2015-08-06T03:47:26.196] error: Node bud34 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2015-08-06T04:20:47.079] error: Node bud34 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2015-08-06T04:54:11.535] error: Node bud34 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. (In reply to Akmal Madzlan from comment #24) > - scontrol show for jobs that should have priority > > JobId=14811805 ArrayJobId=14746919 ArrayTaskId=290 JobName=dp_SWaMP_asd > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > Features=gpu&localdisk Gres=(null) Reservation=(null) > > JobId=14774568 ArrayJobId=14746919 ArrayTaskId=166 JobName=dp_SWaMP_asd > Partition=teamrottoGpu,idleGpu AllocNode:Sid=pud153:9997 > Features=gpu&localdisk Gres=(null) Reservation=(null) Note that these two jobs have "Features=gpu&localdisk" while the nodes pnod[0666-0867] don't have those features: > NodeName=pnod[0666-0867] CPUs=24 RealMemory=128000 Sockets=2 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN Feature=gpu,4phi,4gpu So while they might have a higher priority than other jobs, that's not going to stop other jobs from using those nodes: [ > akmalm@pud22 ~]$ squeue -aw pnod[0748-0772],clus[136-154,268-270] > PARTITION PRIORITY NAME USER ST TIME NODES NODELIST(REASON JOBID > idleGpu 1000 rt_tomo4a_TTI_Panaus zayn R 1:13:04 1 pnod0763 14795781_272 > idleGpu 1000 rt_tomo4a_TTI_Panaus zayn R 1:03:42 1 pnod0767 14795781_325 > idleGpu 200 180_SWAMP_dcepticon_ katnisss R 42:33 1 pnod0761 14786431_405 > idleGpu 200 180_SWAMP_dcepticon_ katnisss R 42:33 1 pnod0769 14786431_406 There definitely was a bug with respect to scheduling job arrays that you now have the fix for. I don't have a complete picture of what is happening with respect to the jobs/priorities/queues/etc., but I'm not seeing any evidence that anything is currently wrong. Does the job's "Feature" specification explain the anomaly? If not, could you provide matching "scontrol show job" and slurmctld log files demonstrating the anomaly? I think you are right. That job doesnt get on those node because it request localdisk which doesnt exist in pnod*. Thanks Moe! (In reply to Akmal Madzlan from comment #27) > I think you are right. That job doesnt get on those node because it request > localdisk which doesnt exist in pnod*. > > Thanks Moe! Closing this ticket. |
Created attachment 2004 [details] slurmctld log, config and some squeue output attached [root@pud22 ~]# squeue -aw clus[081-084,086-103,109-116,241-254],pnod[0666-0707] JOBID PARTITION NAME USER ST 14125227_157 teamcottG 230_SI_R user R 14125227_45 teamcottG 230_SI_R user R 14121150_1665 teamcottG 221_Asim user1 R 14121153_72 teamcottG 221_Asim user1 R 14121153_71 teamcottG 221_Asim user1 R 14121153_70 teamcottG 221_Asim user1 R 14121153_68 teamcottG 221_Asim user1 R 14121153_67 teamcottG 221_Asim user1 R 14121153_66 teamcottG 221_Asim user1 R 14112808_229 idleGpu rt_sten0 user2 R 14121153_65 teamcottG 221_Asim user1 R 14121153_64 teamcottG 221_Asim user1 R 14121153_63 teamcottG 221_Asim user1 R 14121153_62 teamcottG 221_Asim user1 R user2 job (14112808_229) should not get into the node because there is a lot of user1's job still in PD state. It should get higher priority due to the higher priority queue. One thing I noticed is the job is scheduled by backfill. Probably a backfill bug? slurmctld log, config and some squeue output attached