| Summary: | Jobs won't schedule | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Bill Broadley <bill.broadley> |
| Component: | slurmctld | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 20.02.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NREL | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
one hour of logs showing the lack of jobs running.
requested slurm.conf requested 20 minutes with extra debugging Logs with Backfill,BackfillMap enabled. |
||
|
Description
Bill Broadley
2021-09-30 11:55:34 MDT
I put all the long jobs (that wouldn't fun anyways because of the reservation) on hold. Suddenly all the short jobs started. Hi Bill, I have a short answer and a long answer. Let me know if you have any additional questions, or if you are satisfied with the answer and if we can close this ticket. Here's my summary of the long answer: I think backfill was timing out. My recommendations for your SchedulerParameters: * Increase bf_max_time so that the backfill scheduler can evaluate more jobs. - You may want to run with the backfill debug flag turned on so you can see what backfill is doing and when backfill is ending, and this can help you tune bf_max_time. * Greatly reduce or entirely remove default_queue_depth. (If you remove it then it is set to the default value of 100.) Here's my long answer, which turned out to be a lot longer than I originally intended, but I tried to be thorough.: It can be kind of hard to diagnose exactly what is happening here. Based on your description of what happened (holding the large jobs and then smaller jobs started running), I can make some guesses as to what happened, but I don't know without seeing logs. * Because the larger jobs were higher priority, so the main scheduler couldn't start the smaller jobs * Certainly backfill was planning nodes for those larger jobs. So why didn't smaller jobs backfill onto the idle nodes? * Maybe backfill wasn't reaching the smaller jobs. It's possible that backfill stopped before it got to any small jobs that could actually be backfilled. This could definitely happen if the system was busy and reached max_rpc_cnt which would cause backfill to exit early. Or, more likely, I think backfill was timed out before it evaluated the smaller jobs. I recommend increasing bf_max_time. * It's possible that the slightly longer jobs (the few 100 jobs that could run in short or amo) were getting backfill planned, and were blocking the smaller jobs from running. But I have no way to tell. I think that backfill timing out is the more likely culprit. * Do your users submit jobs to multiple partitions? This can also cause surprising results. In fact, it is actually possible for a job that is submitted to multiple partitions to block itself or other jobs from backfilling in a lower priority partition. I can give you more info on this if you want, but we are working on improving the documentation around backfill, in particular how this can happen. Anyway, are you still seeing problems, or are jobs getting scheduled normally? In the future, if you are ever wondering what the backfill scheduler is doing, then I recommend turning on the backfill-related debug flags: scontrol setdebugflags +backfill,backfillmap This will be very verbose, so only keep it on for 5 or 10 minutes, then turn it off with scontrol again: scontrol setdebugflags -backfill,backfillmap But that will give you a complete picture of what backfill is doing with each job. When you submit a ticket like this, we'd love the slurmctld log file for those 5-10 minutes with the backfill,backfillmap debug flags turned on. In addition, I recommend you run sdiag once per minute during this time period and upload the results. If you're running into longer-term scheduling issues, then running sdiag periodically (say, once per hour) for a few days can also be really helpful, and upload the results. We'd also like the whole slurm.conf, especially the partition definitions, so we can see the PriorityTier of each partition. Here are my thoughts about your SchedulerParameters: defer * This means the "quick cycle" of the main scheduler won't run on job submission or job completion. This can be really helpful in systems with lots of RPC's, since it means the slurmctld can spend more time handling those RPC's instead of constantly running the quick scheduling cycle. But it does have the downside of potentially delaying the start of some jobs until the "quick cycle" runs again, which happens every batch_sched_delay seconds. * As long as you know what this parameter does, it's fine to have it. default_queue_depth=10750 * This parameter affects how many jobs the quick cycle of the main scheduler will look at. It does not affect backfill nor does it affect the full cycle of the main scheduler. With this parameter as high as you have it, you are making the quick cycle of the main scheduler take way longer (up to max_sched_time seconds). This could leave the slurmctld with less time to process other RPC's. The full cycle of the main scheduler runs once per sched_interval seconds (60 seconds by default). The full cycle goes through all the jobs in the queue anyway. * I personally recommend you don't set this parameter very high, since the main scheduler's full cycle and the backfill scheduler will evaluate all the jobs in the queue (unless they timeout before evaluating all the jobs). max_rpc_cnt=125 * Scheduling won't happen if there are this many pending RPC's, which can cause backfill to exit early. Obviously this is a balancing act, since you don't want slurmctld to be overwhelmed by RPC's, but you still want slurmctld to be able to schedule jobs. * I think this value is fine. max_sched_time=6 * This extends the time of the main scheduler (both the quick and the full cycles). * I think this is fine - I'm guessing you increased this so that the main scheduler could evaluate all the jobs. Setting this parameter high enough to make the full cycle of the main scheduler evaluate all the jobs means you really don't need default_queue_depth very high. partition_job_depth=10500 * Backfill will evaluate up to this many jobs in each partition. If backfill spends all its time on the higher priority partitions, it's possible that backfill never reaches the lower priority partitions. sched_max_job_start=200 * The maximum number of jobs that the main scheduling logic will start in any single execution. The default value is zero, which imposes no limit. * I'm very confused at why you have this value set to 200, but you have default_queue_depth set to that ginormous number. Why don't you allow the main scheduler to start more than 200 jobs per cycle? sched_min_interval=2000000 * This means the quick cycle of the main scheduler won't run more often than 2 seconds. I think that's fine, although this mostly overlaps with defer. It doesn't entirely overlap with defer, so having them both set is fine, although in practice if you have defer and batch_sched_delay at 20 seconds then if you change this parameter you probably won't be able to tell the difference. * If you remove defer, definitely still keep this parameter. * With this parameter set to 2 seconds, you may not need defer, but it's probably fine either way. batch_sched_delay=20 * With defer set this means that jobs can wait up to 20 seconds after being submitted before they are evaluated by the quick cycle of the main scheduler. * This is fine as long as you know what it's doing. bf_max_job_test=13000 * I assume you set this high enough to evaluate as many jobs in the queue as possible. That's fine, just know that backfill will bail when it has run for bf_max_time seconds, which by default is the same value as bf_interval. So I suspect that the backfill scheduler probably isn't evaluating nearly 13000 jobs, but is probably timing out well before. * I recommend you set bf_max_time bf_interval=30 * After a backfill cycle completes, slurmctld will wait for this many seconds before starting another backfill cycle. * This is the default, and I think it's fine. bf_continue * This is good to see. Personally. I don't know why anyone would *not* have bf_continue. bf_window=15840,bf_resolution=250 * Make sure that job time limits are less than or equal to bf_window, or else the jobs can be starved forever. * bf_resolution should go up if bf_window goes up, but you don't want bf_resolution too high or else the backfill scheduler may not be able to backfill as many jobs if there are jobs with very short time limits. I think your current value is fine, but it also probably wouldn't hurt putting it up to as high as 600. max_switch_wait=172800 * I assume you really don't want jobs to run on more switches than what they requested. I think this is fine. > One possible contributing factor is: > [2021-09-30T11:45:20.026] error: _het_job_configuring_test: Hetjob leader > JobId=3060751+1(3060751) job list is NULL > > It's pretty weird, since this job is not in the slurm database: > select min(id_job) from eagle_job_table; > +-------------+ > | min(id_job) | > +-------------+ > | 6913483 | > > Could that error potentially halt the scheduling decisions as it proceeds > through the list of eligible jobs? > > Any way to get rid of that job? Do you still see this error in the slurmctld? Are you able to query this job with scontrol? scontrol show job 3060751 Are you able to run scancel the job? scancel 3060751 I looked at this error and I don't see how it could affect scheduling other jobs, so I think this error is unrelated to the scheduling issues. Hi Bill, Have you had a chance to review my response? Due to a lack of response from you I'm closing this as infogiven. If you have more questions relating to this ticket, then feel free to respond and re-open the ticket. Sorry, for the delay. We followed your suggestions, our current SchedulerParameters.
SchedulerParameters=defer,default_queue_depth=10750,max_rpc_cnt=125,max_sched_time=6,partition_job_depth=10500,sched_max_job_start=200,sched_min_interval=2000000,batch_sched_delay=20,bf_max_time=75,bf_interval=30,bf_continue,bf_window=15840,bf_resolution=250,max_switch_wait=172800
Setting bf_max_time=75 seems to have helped, but we are still having issues, in particular our GPU partition is failing to schedule.
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gpu up 2-00:00:00 20 alloc r104u[01,03,05,07,09,11,13,15,17,19,21,23,25,35],r105u[09,17,21,23,29,31]
gpu up 2-00:00:00 25 idle r103u[01,03,05,07,09,11,13,15,17,19],r104u[27,29,31,37],r105u[01,03,05,07,11,13,15,19,25,27,35]
So we have 25 nodes idle, but:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
7893443+2 gpu rad dbiagion PD 0:00 1 (None)
7893464+2 gpu rad dbiagion PD 0:00 1 (None)
7895859 gpu ga0 ctripp PD 0:00 1 (Priority)
7895860 gpu ga0 ctripp PD 0:00 1 (Priority)
7895861 gpu ga0 ctripp PD 0:00 1 (Priority)
7895862 gpu ga0 ctripp PD 0:00 1 (Priority)
7895863 gpu ga0 ctripp PD 0:00 1 (Priority)
7895865 gpu ga1 ctripp PD 0:00 1 (Priority)
7895866 gpu ga1 ctripp PD 0:00 1 (Priority)
7895867 gpu ga1 ctripp PD 0:00 1 (Priority)
7895868 gpu ga1 ctripp PD 0:00 1 (Priority)
7895869 gpu ga1 ctripp PD 0:00 1 (Priority)
7895873 gpu ga3 ctripp PD 0:00 1 (Priority)
7895874 gpu ga3 ctripp PD 0:00 1 (Priority)
7895875 gpu ga3 ctripp PD 0:00 1 (Priority)
7895876 gpu ga3 ctripp PD 0:00 1 (Priority)
7895877 gpu ga3 ctripp PD 0:00 1 (Priority)
The two "None" jobs have been in the state for at least 24 hours. I bounced the slurmctld with no impact.
I'll upload the last hour of logs, I have SlurmctldDebug=debug set, so hopefully there's enough info to help.
Currently despite 100s of jobs queued for the gpu partition, no jobs are running.
Created attachment 22283 [details]
one hour of logs showing the lack of jobs running.
Let me know if there's any other helpful info.
bf_max_time=75 is still pretty short for a queue as large as 12000 jobs. Try increasing this to at least 5 minutes (bf_max_time=300). Some sites have this set to 10 minutes or longer. Try running sdiag before and after a backfill scheduling cycle and upload the output. You'll be able to see how many jobs the backfill and main schedulers are getting through on average and on the last cycle with sdiag. Also, try removing "defer". This will enable the quick scheduling cycle to run more often which might get more jobs started. Unless your slurmctld is constantly overloaded by RPCs, I recommend not using defer. But you can add in defer when slurmctld is overloaded by RPCs. Removed defer, and changed bf_max_time=300: SchedulerParameters=default_queue_depth=10750,max_rpc_cnt=125,max_sched_time=6,partition_job_depth=10500,sched_max_job_start=200,sched_min_interval=2000000,batch_sched_delay=20,bf_max_time=300,bf_interval=30,bf_continue,bf_window=15840,bf_resolution=250,max_switch_wait=172800 I did a scontrol reconfig and then did a service slurmctld stop and start. I haven't seen any change, not sure how often the scheduler cycles are, but I do have two sdiag's from after the restart and 20 minutes apart: ******************************************************* sdiag output at Tue Nov 16 12:13:08 2021 (1637089988) Data since Tue Nov 16 12:06:21 2021 (1637089581) ******************************************************* Server thread count: 3 Agent queue size: 0 Agent count: 0 Agent thread count: 0 DBD Agent queue size: 0 Jobs submitted: 0 Jobs started: 7 Jobs completed: 10 Jobs canceled: 0 Jobs failed: 0 Job states ts: Tue Nov 16 12:13:07 2021 (1637089987) Jobs pending: 12859 Jobs running: 1124 Main schedule statistics (microseconds): Last cycle: 5210602 Max cycle: 5995719 Total cycles: 20 Mean cycle: 4438886 Mean depth cycle: 4413 Cycles per minute: 3 Last queue length: 15908 Backfilling stats (WARNING: data obtained in the middle of backfilling execution.) Total backfilled jobs (since last slurm start): 7 Total backfilled jobs (since last stats cycle start): 7 Total backfilled heterogeneous job components: 0 Total cycles: 7 Last cycle when: Tue Nov 16 12:12:55 2021 (1637089975) Last cycle: 23671701 Max cycle: 32448738 Mean cycle: 18938441 Last depth cycle: 2196 Last depth cycle (try sched): 2196 Depth Mean: 3382 Depth Mean (try depth): 3382 Last queue length: 15908 Queue length mean: 18184 Last table size: 100 Mean table size: 86 Latency for 1000 calls to gettimeofday(): 23 microseconds Remote Procedure Call statistics by message type MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:5331 ave_time:72726 total_time:387706843 REQUEST_PARTITION_INFO ( 2009) count:1364 ave_time:2220 total_time:3028628 REQUEST_NODE_INFO_SINGLE ( 2040) count:1035 ave_time:991780 total_time:1026492758 REQUEST_FED_INFO ( 2049) count:239 ave_time:129 total_time:30955 REQUEST_JOB_USER_INFO ( 2039) count:224 ave_time:1065989 total_time:238781704 REQUEST_JOB_INFO ( 2003) count:70 ave_time:556365 total_time:38945575 REQUEST_STEP_COMPLETE ( 5016) count:53 ave_time:673844 total_time:35713769 REQUEST_NODE_INFO ( 2007) count:51 ave_time:448274 total_time:22862018 REQUEST_JOB_STEP_CREATE ( 5001) count:41 ave_time:25571 total_time:1048450 REQUEST_HET_JOB_ALLOC_INFO ( 4027) count:41 ave_time:1094391 total_time:44870045 MESSAGE_EPILOG_COMPLETE ( 6012) count:20 ave_time:3390545 total_time:67810901 REQUEST_COMPLETE_PROLOG ( 6018) count:15 ave_time:121 total_time:1824 REQUEST_JOB_INFO_SINGLE ( 2021) count:15 ave_time:137 total_time:2055 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:10 ave_time:1123552 total_time:11235521 REQUEST_STATS_INFO ( 2035) count:9 ave_time:142 total_time:1286 REQUEST_SHARE_INFO ( 2022) count:9 ave_time:15211 total_time:136907 REQUEST_UPDATE_NODE ( 3002) count:3 ave_time:13824 total_time:41474 REQUEST_CANCEL_JOB_STEP ( 5005) count:2 ave_time:1985 total_time:3971 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:1 ave_time:1782560 total_time:1782560 REQUEST_RECONFIGURE ( 1003) count:1 ave_time:3436794 total_time:3436794 Remote Procedure Call statistics by user root ( 0) count:7603 ave_time:204829 total_time:1557318196 ******************************************************* sdiag output at Tue Nov 16 12:28:03 2021 (1637090883) Data since Tue Nov 16 12:06:21 2021 (1637089581) ******************************************************* Server thread count: 4 Agent queue size: 0 Agent count: 0 Agent thread count: 0 DBD Agent queue size: 0 Jobs submitted: 5 Jobs started: 25 Jobs completed: 25 Jobs canceled: 0 Jobs failed: 0 Job states ts: Tue Nov 16 12:27:35 2021 (1637090855) Jobs pending: 12848 Jobs running: 1124 Main schedule statistics (microseconds): Last cycle: 2498447 Max cycle: 5995719 Total cycles: 62 Mean cycle: 4625937 Mean depth cycle: 4890 Cycles per minute: 2 Last queue length: 15892 Backfilling stats Total backfilled jobs (since last slurm start): 19 Total backfilled jobs (since last stats cycle start): 19 Total backfilled heterogeneous job components: 0 Total cycles: 24 Last cycle when: Tue Nov 16 12:27:36 2021 (1637090856) Last cycle: 17526348 Max cycle: 32448738 Mean cycle: 19814947 Last depth cycle: 3412 Last depth cycle (try sched): 3412 Depth Mean: 3366 Depth Mean (try depth): 3366 Last queue length: 15894 Queue length mean: 15903 Last table size: 100 Mean table size: 96 Latency for 1000 calls to gettimeofday(): 23 microseconds Remote Procedure Call statistics by message type REQUEST_PARTITION_INFO ( 2009) count:6394 ave_time:770 total_time:4926461 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:5581 ave_time:70044 total_time:390920269 REQUEST_NODE_INFO_SINGLE ( 2040) count:5328 ave_time:996617 total_time:5309978964 REQUEST_FED_INFO ( 2049) count:809 ave_time:134 total_time:108950 REQUEST_JOB_USER_INFO ( 2039) count:761 ave_time:858364 total_time:653215595 REQUEST_JOB_INFO ( 2003) count:202 ave_time:694375 total_time:140263862 REQUEST_STEP_COMPLETE ( 5016) count:157 ave_time:701874 total_time:110194292 REQUEST_NODE_INFO ( 2007) count:154 ave_time:550409 total_time:84763090 REQUEST_JOB_STEP_CREATE ( 5001) count:133 ave_time:25808 total_time:3432504 REQUEST_HET_JOB_ALLOC_INFO ( 4027) count:133 ave_time:1114015 total_time:148164044 MESSAGE_EPILOG_COMPLETE ( 6012) count:51 ave_time:2489295 total_time:126954083 REQUEST_COMPLETE_PROLOG ( 6018) count:48 ave_time:867411 total_time:41635767 REQUEST_JOB_INFO_SINGLE ( 2021) count:48 ave_time:135 total_time:6505 REQUEST_CANCEL_JOB_STEP ( 5005) count:46 ave_time:7939 total_time:365212 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:26 ave_time:844082 total_time:21946152 REQUEST_SHARE_INFO ( 2022) count:26 ave_time:16123 total_time:419198 REQUEST_STATS_INFO ( 2035) count:13 ave_time:141 total_time:1835 REQUEST_UPDATE_NODE ( 3002) count:13 ave_time:579190 total_time:7529475 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:4 ave_time:2309448 total_time:9237795 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:2 ave_time:915233 total_time:1830467 REQUEST_RECONFIGURE ( 1003) count:1 ave_time:3436794 total_time:3436794 REQUEST_RESOURCE_ALLOCATION ( 4001) count:1 ave_time:27574 total_time:27574 REQUEST_JOB_READY ( 4019) count:1 ave_time:51544 total_time:51544 Remote Procedure Call statistics by user root ( 0) count:16790 ave_time:362576 total_time:6087658407 > I haven't seen any change, not sure how often the scheduler cycles are, This is determined by bf_interval (which you have set to 30 seconds). > but I do have two sdiag's from after the restart and 20 minutes apart: I'm taking a look at the sdiag. * Queue length is around 16000-18000. * Backfill mean time is 19.8 seconds. Max time is 32.4 seconds. That's not even close to the 300 seconds of bf_max_time. Why is it finishing so soon instead of going through all the jobs? You have bf_continue. Maybe the RPC count goes too high? It might be the case - I'm concerned about these logs: debug: Note large processing time from _slurm_rpc_dump_node_single: usec=1287017 began=10:06:24.340 Is a script calling squeue or scontrol show job (or another Slurm client command) in a loop? That would be bad. See these lines in sdiag: MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:5331 ave_time:72726 total_time:387706843 REQUEST_PARTITION_INFO ( 2009) count:1364 ave_time:2220 total_time:3028628 REQUEST_NODE_INFO_SINGLE ( 2040) count:1035 ave_time:991780 total_time:1026492758 REQUEST_FED_INFO ( 2049) count:239 ave_time:129 total_time:30955 REQUEST_JOB_USER_INFO ( 2039) count:224 ave_time:1065989 total_time:238781704 REQUEST_PARTITION_INFO and REQUEST_NODE_INFO_SINGLE are really high compared to the next ones below them. Can you investigate this and stop any script that is doing this? Can you run one more test for me? scontrol setdebugflags +backfill,backfillmap Wait for 20 minutes scontrol setdebugflags -backfill,backfillmap Then can you upload this slurmctld log file? This will show us exactly what the backfill scheduler is doing. Could you also upload your whole slurm.conf? Created attachment 22287 [details]
requested slurm.conf
slurm.conf uploaded, 20 minute sample being recorded. Created attachment 22288 [details]
requested 20 minutes with extra debugging
$ date; scontrol setdebugflags +backfill,backfillmap; sleep 1200; scontrol setdebugflags -backfill,backfillmap;date
Tue Nov 16 14:28:27 MST 2021
Tue Nov 16 14:48:28 MST 2021
$
I'll investigate the high number of REQUEST_PARTITION_INFO and REQUEST_NODE_INFO_SINGLE. I checked the logs, even with debug on, and it doesn't seem to log the IPs for those calls. Any suggestions on tracking down who/what is causing those lookups? I don't see the backfill debugging logs. If setting the debug flags worked, then I should see logs like this: BACKFILL: _dump_node_space_table: but I don't see those. Can you double check that the log you uploaded was from that test? Also, when you run that scontrol command, you should see this in the slurmctld log: [2021-11-17T10:54:06.169] Set DebugFlags to Backfill,BackfillMap Perhaps, did the scontrol command time out? (In reply to Bill Broadley from comment #13) > I'll investigate the high number of REQUEST_PARTITION_INFO and > REQUEST_NODE_INFO_SINGLE. I checked the logs, even with debug on, and it > doesn't seem to log the IPs for those calls. > > Any suggestions on tracking down who/what is causing those lookups? Look at sdiag. You should see this section: Remote Procedure Call statistics by user root ( 0) count:7603 ave_time:204829 total_time:1557318196 The sdiag that you uploaded only had root. I would expect all the other users to be there. Did you remove all the other users from the output of sdiag? If not, does that meant that everyone on this cluster runs as root? I think this issue is really important - the slurmctld log you uploaded is still filled with the "very large time processing" messages. Created attachment 22302 [details]
Logs with Backfill,BackfillMap enabled.
Logs with Backfill,BackfillMap enabled.
Not sure what happened, but I think I got it right this time. The first line should show debugging enabled and the last line debugging disabled. $ zcat log.gz | head -1 [2021-11-17T11:49:16.591] Set DebugFlags to Backfill,BackfillMap $ zcat log.gz | tail -1 [2021-11-17T12:17:21.829] Set DebugFlags to none > The sdiag that you uploaded only had root. I would expect all the other users to be there. Did you remove all the other users from the output of sdiag? If not, does that meant that everyone on this cluster runs as root? Sorry, I trimmed the nonroot users, all where substantially less than root. I wasn't sure if this ticket was private and didn't want to leak usernames. > I think this issue is really important - the slurmctld log you uploaded is still filled with the "very large time processing" messages. We collect partition and node metrics, I'm guessing that's behind the high API issues. Is there a way to track the socket or IP address of those requests to I can verify them. I'm seeing something like 10-20 per second, and if there's one call per 2600 nodes per minute that would be 40/sec, so I think that could be related. (In reply to Bill Broadley from comment #16) > Not sure what happened, but I think I got it right this time. The first > line should show debugging enabled and the last line debugging disabled. > > $ zcat log.gz | head -1 > [2021-11-17T11:49:16.591] Set DebugFlags to Backfill,BackfillMap > $ zcat log.gz | tail -1 > [2021-11-17T12:17:21.829] Set DebugFlags to none Thanks, it worked. I'll take a look. > > > The sdiag that you uploaded only had root. I would expect all the other users to be there. Did you remove all the other users from the output of sdiag? If not, does that meant that everyone on this cluster runs as root? > > Sorry, I trimmed the nonroot users, all where substantially less than root. > I wasn't sure if this ticket was private and didn't want to leak usernames. Not a problem, I suspected that was the case. This bug is public (bugs are public by default) but if you ever want to mark it private you can by checking the "NREL Bugs" checkbox on this bugzilla page. Generally, root should have the most RPC's because slurmd/slurmstepd runs as root and they make a lot of RPC's to the slurmctld (such as registration messages, job/step complete messages, epilog complete messages). So if you ever see a user who has more RPC's than root, that's usually a bad sign. > > I think this issue is really important - the slurmctld log you uploaded is still filled with the "very large time processing" messages. > > We collect partition and node metrics, I'm guessing that's behind the high > API issues. Is there a way to track the socket or IP address of those > requests to I can verify them. I'm seeing something like 10-20 per second, > and if there's one call per 2600 nodes per minute that would be 40/sec, so I > think that could be related. Here's how to track it: scontrol setdebugflags +protocol WARNING: it will be very verbose. For example, a single squeue sends 3 RPC's and looks like this: [2021-11-17T12:40:19.889] PROTOCOL: _slurmctld_rpc_mgr: accept() connection from 127.0.0.1:45896 [2021-11-17T12:40:19.890] slurmctld_req: received opcode REQUEST_FED_INFO from 127.0.0.1:45896 uid 1000 [2021-11-17T12:40:19.892] PROTOCOL: _slurmctld_rpc_mgr: accept() connection from 127.0.0.1:45898 [2021-11-17T12:40:19.893] slurmctld_req: received opcode REQUEST_JOB_INFO from 127.0.0.1:45898 uid 1000 [2021-11-17T12:40:19.894] PROTOCOL: _slurmctld_rpc_mgr: accept() connection from 127.0.0.1:45900 [2021-11-17T12:40:19.894] slurmctld_req: received opcode REQUEST_PARTITION_INFO from 127.0.0.1:45900 uid 1000 You can see the IP address is from my own machine from UID 1000 (which is me). Just make sure to turn the debug flag off as soon as you're done collecting data. > We collect partition and node metrics, I'm guessing that's behind the high > API issues. That does match up with the RPC's that I see in the logs. How do you collect partition and node metrics? And how often? Alright, I think I found the main culprits for the scheduling issues. Main scheduling: [2021-11-17T11:52:26.727] debug: sched: Running job scheduler [2021-11-17T11:52:32.002] debug: sched: loop taking too long, breaking out This means that the main scheduler isn't getting through all the jobs. It is in fact timing out after max_sched_time seconds (which you have set to 6). I'm wary about increasing this more though because, unlike the backfill scheduler, the main scheduler does not periodically release locks to allow slurmctld to process other RPC's. So instead I recommend reducing partition_job_depth. Right now you have it set to 10500. But it looks likely that the main scheduler is timing out before it can get to all the partitions. Try setting partition_job_depth to something much lower (maybe 1000) and see if that makes a difference. If you still see "sched: loop taking too long, breaking out" then try reducing this parameter a bit more. You may need to experiment to see what value works best. Backfill scheduling: I found why backfill was exiting early - bf_max_job_test is not set, so it is using the default value of 100. This is an example of the log I'm talking about: [2021-11-17T11:56:37.757] backfill: table size limit of 100 reached [2021-11-17T11:56:37.765] backfill: completed testing 3194(2) jobs, usec=191409 So, I recommend increasing bf_max_job_test to something as large as the queue. Once that is done, you should see the backfill scheduler take a lot longer to run and evaluate many more jobs. I see that in your original post on this bug, bf_max_job_test=130000. But in your most recent post with SchedulerParameters, bf_max_job_test was not set anymore. Was this intentionally removed? You should have both bf_max_job_test and bf_max_time high enough so that all the jobs in the queue (or as many as you want) will be evaluated. You can also consider setting bf_max_job_part to limit the backfill scheduler to a certain number of jobs in each partition. This can help ensure the backfill scheduler will evaluate jobs in the lower priority partition. Also, prior to Slurm 21.08, the documentation for this parameter was incorrect (or the parameter was implemented incorrectly). * Prior to 21.08, this parameter actually reflects the max number of entries in the backfill map. * In 21.08, we fixed it so this parameter reflects the max number of jobs tested by the backfill scheduler. In addition, we added a new parameter called "bf_node_space_size" which can be set to limit the number of entries in the backfill map. Hi Bill, I'm just following up. Have you been able to make any changes per my last comment (comment 18)? I'm just following up one more time before closing this bug to see how things are going. I'm closing this bug again as infogiven. Feel free to re-open it if you have more questions. |