Ticket 12583

Summary:	Jobs won't schedule
Product:	Slurm	Reporter:	Bill Broadley <bill.broadley>
Component:	slurmctld	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	20.02.5
Hardware:	Linux
OS:	Linux
Site:	NREL	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	one hour of logs showing the lack of jobs running. requested slurm.conf requested 20 minutes with extra debugging Logs with Backfill,BackfillMap enabled.

Description Bill Broadley 2021-09-30 11:55:34 MDT

Was have a system time tonight at midnight MST (14 hours from now) with a reservation.

Eagle is 2600 nodes or so and is half idle despite having 3000 "short" jobs with time limit of less than 4 hours.

We have a fair number of partitions:

short  = 2590 nodes
medium = 2590 nodes (same pool)
weto   =  288 nodes max (in the pool of 2590)
amo    =  144 nodes max (in the pool of 2590)
long   =  525 nodes max (in the pool of 2590)
bigmem =   45 nodes
gpu    =   45 nodes

We have 12000 jobs pending, and 3100 in short.  The short jobs have a chance of running since they have a 4 hour or less runtime and the reservation is 12 hours away.

Adding complexity is a few 100 jobs can run in short or amo (partition=short,amo).  This gives AMO and WETO users high priority in their partition and normal priority if their partitions are full.

So the problem is we end up with 1000+ idle nodes, 3100 short jobs that could run, but don't.  Seems like the backfill should allow the short jobs to run, because pretty much everything else can't run before the reservation.

Can you recommend a change to these parameters that might help:

SchedulerParameters=defer,default_queue_depth=10750,max_rpc_cnt=125,max_sched_time=6,partition_job_depth=10500,sched_max_job_start=200,sched_min_interval=2000000,batch_sched_delay=20,bf_max_job_test=13000,bf_interval=30,bf_continue,bf_window=15840,bf_resolution=250,max_switch_wait=172800

One possible contributing factor is:
[2021-09-30T11:45:20.026] error: _het_job_configuring_test: Hetjob leader JobId=3060751+1(3060751) job list is NULL

It's pretty weird, since this job is not in the slurm database:
select min(id_job) from eagle_job_table;
+-------------+
| min(id_job) |
+-------------+
|     6913483 |

Could that error potentially halt the scheduling decisions as it proceeds through the list of eligible jobs?

Any way to get rid of that job?

Comment 1 Bill Broadley 2021-09-30 13:04:12 MDT

I put all the long jobs (that wouldn't fun anyways because of the reservation) on hold.  Suddenly all the short jobs started.

Comment 2 Marshall Garey 2021-10-01 16:14:25 MDT

Hi Bill,

I have a short answer and a long answer.  Let me know if you have any additional questions, or if you are satisfied with the answer and if we can close this ticket.


Here's my summary of the long answer:

I think backfill was timing out.
My recommendations for your SchedulerParameters:
* Increase bf_max_time so that the backfill scheduler can evaluate more jobs.
  - You may want to run with the backfill debug flag turned on so you can see what backfill is doing and when backfill is ending, and this can help you tune bf_max_time.
* Greatly reduce or entirely remove default_queue_depth. (If you remove it then it is set to the default value of 100.)




Here's my long answer, which turned out to be a lot longer than I originally intended, but I tried to be thorough.:


It can be kind of hard to diagnose exactly what is happening here. Based on your description of what happened (holding the large jobs and then smaller jobs started running), I can make some guesses as to what happened, but I don't know without seeing logs.
* Because the larger jobs were higher priority, so the main scheduler couldn't start the smaller jobs
* Certainly backfill was planning nodes for those larger jobs. So why didn't smaller jobs backfill onto the idle nodes?
  * Maybe backfill wasn't reaching the smaller jobs. It's possible that backfill stopped before it got to any small jobs that could actually be backfilled. This could definitely happen if the system was busy and reached max_rpc_cnt which would cause backfill to exit early. Or, more likely, I think backfill was timed out before it evaluated the smaller jobs. I recommend increasing bf_max_time.
  * It's possible that the slightly longer jobs (the few 100 jobs that could run in short or amo) were getting backfill planned, and were blocking the smaller jobs from running. But I have no way to tell. I think that backfill timing out is the more likely culprit.
  * Do your users submit jobs to multiple partitions? This can also cause surprising results. In fact, it is actually possible for a job that is submitted to multiple partitions to block itself or other jobs from backfilling in a lower priority partition. I can give you more info on this if you want, but we are working on improving the documentation around backfill, in particular how this can happen.

Anyway, are you still seeing problems, or are jobs getting scheduled normally?

In the future, if you are ever wondering what the backfill scheduler is doing, then I recommend turning on the backfill-related debug flags:

scontrol setdebugflags +backfill,backfillmap

This will be very verbose, so only keep it on for 5 or 10 minutes, then turn it off with scontrol again:

scontrol setdebugflags -backfill,backfillmap

But that will give you a complete picture of what backfill is doing with each job.


When you submit a ticket like this, we'd love the slurmctld log file for those 5-10 minutes with the backfill,backfillmap debug flags turned on.

In addition, I recommend you run sdiag once per minute during this time period and upload the results. If you're running into longer-term scheduling issues, then running sdiag periodically (say, once per hour) for a few days can also be really helpful, and upload the results.

We'd also like the whole slurm.conf, especially the partition definitions, so we can see the PriorityTier of each partition.



Here are my thoughts about your SchedulerParameters:

defer
* This means the "quick cycle" of the main scheduler won't run on job submission or job completion. This can be really helpful in systems with lots of RPC's, since it means the slurmctld can spend more time handling those  RPC's instead of constantly running the quick scheduling cycle. But it does have the downside of potentially delaying the start of some jobs until the "quick cycle" runs again, which happens every batch_sched_delay seconds.
* As long as you know what this parameter does, it's fine to have it.


default_queue_depth=10750
* This parameter affects how many jobs the quick cycle of the main scheduler will look at. It does not affect backfill nor does it affect the full cycle of the main scheduler. With this parameter as high as you have it, you are making the quick cycle of the main scheduler take way longer (up to max_sched_time seconds). This could leave the slurmctld with less time to process other RPC's. The full cycle of the main scheduler runs once per sched_interval seconds (60 seconds by default). The full cycle goes through all the jobs in the queue anyway.
* I personally recommend you don't set this parameter very high, since the main scheduler's full cycle and the backfill scheduler will evaluate all the jobs in the queue (unless they timeout before evaluating all the jobs).


max_rpc_cnt=125
* Scheduling won't happen if there are this many pending RPC's, which can cause backfill to exit early. Obviously this is a balancing act, since you don't want slurmctld to be overwhelmed by RPC's, but you still want slurmctld to be able to schedule jobs.
* I think this value is fine.


max_sched_time=6
* This extends the time of the main scheduler (both the quick and the full cycles).
* I think this is fine - I'm guessing you increased this so that the main scheduler could evaluate all the jobs. Setting this parameter high enough to make the full cycle of the main scheduler evaluate all the jobs means you really don't need default_queue_depth very high.
 

partition_job_depth=10500
* Backfill will evaluate up to this many jobs in each partition. If backfill spends all its time on the higher priority partitions, it's possible that backfill never reaches the lower priority partitions.


sched_max_job_start=200
* The maximum number of jobs that the main scheduling logic will start in any single execution. The default value is zero, which imposes no limit.
* I'm very confused at why you have this value set to 200, but you have default_queue_depth set to that ginormous number. Why don't you allow the main scheduler to start more than 200 jobs per cycle?


sched_min_interval=2000000
* This means the quick cycle of the main scheduler won't run more often than 2 seconds. I think that's fine, although this mostly overlaps with defer. It doesn't entirely overlap with defer, so having them both set is fine, although in practice if you have defer and batch_sched_delay at 20 seconds then if you change this parameter you probably won't be able to tell the difference.
* If you remove defer, definitely still keep this parameter.
* With this parameter set to 2 seconds, you may not need defer, but it's probably fine either way.


batch_sched_delay=20
* With defer set this means that jobs can wait up to 20 seconds after being submitted before they are evaluated by the quick cycle of the main scheduler.
* This is fine as long as you know what it's doing.


bf_max_job_test=13000
* I assume you set this high enough to evaluate as many jobs in the queue as possible. That's fine, just know that backfill will bail when it has run for bf_max_time seconds, which by default is the same value as bf_interval. So I suspect that the backfill scheduler probably isn't evaluating nearly 13000 jobs, but is probably timing out well before.
* I recommend you set bf_max_time


bf_interval=30
* After a backfill cycle completes, slurmctld will wait for this many seconds before starting another backfill cycle.
* This is the default, and I think it's fine.


bf_continue
* This is good to see. Personally. I don't know why anyone would *not* have bf_continue.


bf_window=15840,bf_resolution=250
* Make sure that job time limits are less than or equal to bf_window, or else the jobs can be starved forever.
* bf_resolution should go up if bf_window goes up, but you don't want bf_resolution too high or else the backfill scheduler may not be able to backfill as many jobs if there are jobs with very short time limits. I think your current value is fine, but it also probably wouldn't hurt putting it up to as high as 600.


max_switch_wait=172800
* I assume you really don't want jobs to run on more switches than what they requested. I think this is fine.


> One possible contributing factor is:
> [2021-09-30T11:45:20.026] error: _het_job_configuring_test: Hetjob leader
> JobId=3060751+1(3060751) job list is NULL
>
> It's pretty weird, since this job is not in the slurm database:
> select min(id_job) from eagle_job_table;
> +-------------+
> | min(id_job) |
> +-------------+
> |     6913483 |
>
> Could that error potentially halt the scheduling decisions as it proceeds
> through the list of eligible jobs?
>
> Any way to get rid of that job?

Do you still see this error in the slurmctld? Are you able to query this job with scontrol?
    scontrol show job 3060751

Are you able to run scancel the job?
    scancel 3060751

I looked at this error and I don't see how it could affect scheduling other jobs, so I think this error is unrelated to the scheduling issues.

Comment 3 Marshall Garey 2021-10-07 11:59:12 MDT

Hi Bill,

Have you had a chance to review my response?

Comment 4 Marshall Garey 2021-10-18 09:15:10 MDT

Due to a lack of response from you I'm closing this as infogiven. If you have more questions relating to this ticket, then feel free to respond and re-open the ticket.

Comment 5 Bill Broadley 2021-11-16 11:04:50 MST

Sorry, for the delay.  We followed your suggestions, our current SchedulerParameters.
 
SchedulerParameters=defer,default_queue_depth=10750,max_rpc_cnt=125,max_sched_time=6,partition_job_depth=10500,sched_max_job_start=200,sched_min_interval=2000000,batch_sched_delay=20,bf_max_time=75,bf_interval=30,bf_continue,bf_window=15840,bf_resolution=250,max_switch_wait=172800

Setting bf_max_time=75 seems to have helped, but we are still having issues, in particular our GPU partition is failing to schedule.

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
gpu          up 2-00:00:00     20  alloc r104u[01,03,05,07,09,11,13,15,17,19,21,23,25,35],r105u[09,17,21,23,29,31] 
gpu          up 2-00:00:00     25   idle r103u[01,03,05,07,09,11,13,15,17,19],r104u[27,29,31,37],r105u[01,03,05,07,11,13,15,19,25,27,35] 

So we have 25 nodes idle, but:
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
         7893443+2       gpu      rad dbiagion PD       0:00      1 (None) 
         7893464+2       gpu      rad dbiagion PD       0:00      1 (None) 
           7895859       gpu      ga0   ctripp PD       0:00      1 (Priority) 
           7895860       gpu      ga0   ctripp PD       0:00      1 (Priority) 
           7895861       gpu      ga0   ctripp PD       0:00      1 (Priority) 
           7895862       gpu      ga0   ctripp PD       0:00      1 (Priority) 
           7895863       gpu      ga0   ctripp PD       0:00      1 (Priority) 
           7895865       gpu      ga1   ctripp PD       0:00      1 (Priority) 
           7895866       gpu      ga1   ctripp PD       0:00      1 (Priority) 
           7895867       gpu      ga1   ctripp PD       0:00      1 (Priority) 
           7895868       gpu      ga1   ctripp PD       0:00      1 (Priority) 
           7895869       gpu      ga1   ctripp PD       0:00      1 (Priority) 
           7895873       gpu      ga3   ctripp PD       0:00      1 (Priority) 
           7895874       gpu      ga3   ctripp PD       0:00      1 (Priority) 
           7895875       gpu      ga3   ctripp PD       0:00      1 (Priority) 
           7895876       gpu      ga3   ctripp PD       0:00      1 (Priority) 
           7895877       gpu      ga3   ctripp PD       0:00      1 (Priority) 

The two "None" jobs have been in the state for at least 24 hours.  I bounced the slurmctld with no impact.

I'll upload the last hour of logs, I have SlurmctldDebug=debug set, so hopefully there's enough info to help.

Currently despite 100s of jobs queued for the gpu partition, no jobs are running.

Comment 6 Bill Broadley 2021-11-16 11:05:55 MST

Created attachment 22283 [details]
one hour of logs showing the lack of jobs running.

Let me know if there's any other helpful info.

Comment 7 Marshall Garey 2021-11-16 11:11:30 MST

bf_max_time=75 is still pretty short for a queue as large as 12000 jobs. Try increasing this to at least 5 minutes (bf_max_time=300). Some sites have this set to 10 minutes or longer.

Try running sdiag before and after a backfill scheduling cycle and upload the output. You'll be able to see how many jobs the backfill and main schedulers are getting through on average and on the last cycle with sdiag.

Also, try removing "defer". This will enable the quick scheduling cycle to run more often which might get more jobs started. Unless your slurmctld is constantly overloaded by RPCs, I recommend not using defer. But you can add in defer when slurmctld is overloaded by RPCs.

Comment 8 Bill Broadley 2021-11-16 12:31:23 MST

Removed defer, and changed bf_max_time=300:

SchedulerParameters=default_queue_depth=10750,max_rpc_cnt=125,max_sched_time=6,partition_job_depth=10500,sched_max_job_start=200,sched_min_interval=2000000,batch_sched_delay=20,bf_max_time=300,bf_interval=30,bf_continue,bf_window=15840,bf_resolution=250,max_switch_wait=172800

I did a scontrol reconfig and then did a service slurmctld stop and start.

I haven't seen any change, not sure how often the scheduler cycles are, but I do have two sdiag's from after the restart and 20 minutes apart:
*******************************************************
sdiag output at Tue Nov 16 12:13:08 2021 (1637089988)
Data since      Tue Nov 16 12:06:21 2021 (1637089581)
*******************************************************
Server thread count:  3
Agent queue size:     0
Agent count:          0
Agent thread count:   0
DBD Agent queue size: 0

Jobs submitted: 0
Jobs started:   7
Jobs completed: 10
Jobs canceled:  0
Jobs failed:    0

Job states ts:  Tue Nov 16 12:13:07 2021 (1637089987)
Jobs pending:   12859
Jobs running:   1124

Main schedule statistics (microseconds):
	Last cycle:   5210602
	Max cycle:    5995719
	Total cycles: 20
	Mean cycle:   4438886
	Mean depth cycle:  4413
	Cycles per minute: 3
	Last queue length: 15908

Backfilling stats (WARNING: data obtained in the middle of backfilling execution.)
	Total backfilled jobs (since last slurm start): 7
	Total backfilled jobs (since last stats cycle start): 7
	Total backfilled heterogeneous job components: 0
	Total cycles: 7
	Last cycle when: Tue Nov 16 12:12:55 2021 (1637089975)
	Last cycle: 23671701
	Max cycle:  32448738
	Mean cycle: 18938441
	Last depth cycle: 2196
	Last depth cycle (try sched): 2196
	Depth Mean: 3382
	Depth Mean (try depth): 3382
	Last queue length: 15908
	Queue length mean: 18184
	Last table size: 100
	Mean table size: 86

Latency for 1000 calls to gettimeofday(): 23 microseconds

Remote Procedure Call statistics by message type
	MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:5331   ave_time:72726  total_time:387706843
	REQUEST_PARTITION_INFO                  ( 2009) count:1364   ave_time:2220   total_time:3028628
	REQUEST_NODE_INFO_SINGLE                ( 2040) count:1035   ave_time:991780 total_time:1026492758
	REQUEST_FED_INFO                        ( 2049) count:239    ave_time:129    total_time:30955
	REQUEST_JOB_USER_INFO                   ( 2039) count:224    ave_time:1065989 total_time:238781704
	REQUEST_JOB_INFO                        ( 2003) count:70     ave_time:556365 total_time:38945575
	REQUEST_STEP_COMPLETE                   ( 5016) count:53     ave_time:673844 total_time:35713769
	REQUEST_NODE_INFO                       ( 2007) count:51     ave_time:448274 total_time:22862018
	REQUEST_JOB_STEP_CREATE                 ( 5001) count:41     ave_time:25571  total_time:1048450
	REQUEST_HET_JOB_ALLOC_INFO              ( 4027) count:41     ave_time:1094391 total_time:44870045
	MESSAGE_EPILOG_COMPLETE                 ( 6012) count:20     ave_time:3390545 total_time:67810901
	REQUEST_COMPLETE_PROLOG                 ( 6018) count:15     ave_time:121    total_time:1824
	REQUEST_JOB_INFO_SINGLE                 ( 2021) count:15     ave_time:137    total_time:2055
	REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:10     ave_time:1123552 total_time:11235521
	REQUEST_STATS_INFO                      ( 2035) count:9      ave_time:142    total_time:1286
	REQUEST_SHARE_INFO                      ( 2022) count:9      ave_time:15211  total_time:136907
	REQUEST_UPDATE_NODE                     ( 3002) count:3      ave_time:13824  total_time:41474
	REQUEST_CANCEL_JOB_STEP                 ( 5005) count:2      ave_time:1985   total_time:3971
	REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:1      ave_time:1782560 total_time:1782560
	REQUEST_RECONFIGURE                     ( 1003) count:1      ave_time:3436794 total_time:3436794

Remote Procedure Call statistics by user
	root            (       0) count:7603   ave_time:204829 total_time:1557318196

*******************************************************
sdiag output at Tue Nov 16 12:28:03 2021 (1637090883)
Data since      Tue Nov 16 12:06:21 2021 (1637089581)
*******************************************************
Server thread count:  4
Agent queue size:     0
Agent count:          0
Agent thread count:   0
DBD Agent queue size: 0

Jobs submitted: 5
Jobs started:   25
Jobs completed: 25
Jobs canceled:  0
Jobs failed:    0

Job states ts:  Tue Nov 16 12:27:35 2021 (1637090855)
Jobs pending:   12848
Jobs running:   1124

Main schedule statistics (microseconds):
	Last cycle:   2498447
	Max cycle:    5995719
	Total cycles: 62
	Mean cycle:   4625937
	Mean depth cycle:  4890
	Cycles per minute: 2
	Last queue length: 15892

Backfilling stats
	Total backfilled jobs (since last slurm start): 19
	Total backfilled jobs (since last stats cycle start): 19
	Total backfilled heterogeneous job components: 0
	Total cycles: 24
	Last cycle when: Tue Nov 16 12:27:36 2021 (1637090856)
	Last cycle: 17526348
	Max cycle:  32448738
	Mean cycle: 19814947
	Last depth cycle: 3412
	Last depth cycle (try sched): 3412
	Depth Mean: 3366
	Depth Mean (try depth): 3366
	Last queue length: 15894
	Queue length mean: 15903
	Last table size: 100
	Mean table size: 96

Latency for 1000 calls to gettimeofday(): 23 microseconds

Remote Procedure Call statistics by message type
	REQUEST_PARTITION_INFO                  ( 2009) count:6394   ave_time:770    total_time:4926461
	MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:5581   ave_time:70044  total_time:390920269
	REQUEST_NODE_INFO_SINGLE                ( 2040) count:5328   ave_time:996617 total_time:5309978964
	REQUEST_FED_INFO                        ( 2049) count:809    ave_time:134    total_time:108950
	REQUEST_JOB_USER_INFO                   ( 2039) count:761    ave_time:858364 total_time:653215595
	REQUEST_JOB_INFO                        ( 2003) count:202    ave_time:694375 total_time:140263862
	REQUEST_STEP_COMPLETE                   ( 5016) count:157    ave_time:701874 total_time:110194292
	REQUEST_NODE_INFO                       ( 2007) count:154    ave_time:550409 total_time:84763090
	REQUEST_JOB_STEP_CREATE                 ( 5001) count:133    ave_time:25808  total_time:3432504
	REQUEST_HET_JOB_ALLOC_INFO              ( 4027) count:133    ave_time:1114015 total_time:148164044
	MESSAGE_EPILOG_COMPLETE                 ( 6012) count:51     ave_time:2489295 total_time:126954083
	REQUEST_COMPLETE_PROLOG                 ( 6018) count:48     ave_time:867411 total_time:41635767
	REQUEST_JOB_INFO_SINGLE                 ( 2021) count:48     ave_time:135    total_time:6505
	REQUEST_CANCEL_JOB_STEP                 ( 5005) count:46     ave_time:7939   total_time:365212
	REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:26     ave_time:844082 total_time:21946152
	REQUEST_SHARE_INFO                      ( 2022) count:26     ave_time:16123  total_time:419198
	REQUEST_STATS_INFO                      ( 2035) count:13     ave_time:141    total_time:1835
	REQUEST_UPDATE_NODE                     ( 3002) count:13     ave_time:579190 total_time:7529475
	REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:4      ave_time:2309448 total_time:9237795
	REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:2      ave_time:915233 total_time:1830467
	REQUEST_RECONFIGURE                     ( 1003) count:1      ave_time:3436794 total_time:3436794
	REQUEST_RESOURCE_ALLOCATION             ( 4001) count:1      ave_time:27574  total_time:27574
	REQUEST_JOB_READY                       ( 4019) count:1      ave_time:51544  total_time:51544

Remote Procedure Call statistics by user
	root            (       0) count:16790  ave_time:362576 total_time:6087658407

Comment 9 Marshall Garey 2021-11-16 13:53:39 MST

> I haven't seen any change, not sure how often the scheduler cycles are,

This is determined by bf_interval (which you have set to 30 seconds).

> but I do have two sdiag's from after the restart and 20 minutes apart:

I'm taking a look at the sdiag.

* Queue length is around 16000-18000.
* Backfill mean time is 19.8 seconds. Max time is 32.4 seconds. That's not even close to the 300 seconds of bf_max_time. Why is it finishing so soon instead of going through all the jobs? You have bf_continue. Maybe the RPC count goes too high? It might be the case - I'm concerned about these logs:

debug:  Note large processing time from _slurm_rpc_dump_node_single: usec=1287017 began=10:06:24.340

Is a script calling squeue or scontrol show job (or another Slurm client command) in a loop? That would be bad. See these lines in sdiag:

	MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:5331   ave_time:72726  total_time:387706843
	REQUEST_PARTITION_INFO                  ( 2009) count:1364   ave_time:2220   total_time:3028628
	REQUEST_NODE_INFO_SINGLE                ( 2040) count:1035   ave_time:991780 total_time:1026492758
	REQUEST_FED_INFO                        ( 2049) count:239    ave_time:129    total_time:30955
	REQUEST_JOB_USER_INFO                   ( 2039) count:224    ave_time:1065989 total_time:238781704

REQUEST_PARTITION_INFO and REQUEST_NODE_INFO_SINGLE are really high compared to the next ones below them. Can you investigate this and stop any script that is doing this?



Can you run one more test for me?

scontrol setdebugflags +backfill,backfillmap

Wait for 20 minutes

scontrol setdebugflags -backfill,backfillmap

Then can you upload this slurmctld log file?

This will show us exactly what the backfill scheduler is doing.

Could you also upload your whole slurm.conf?

Comment 10 Bill Broadley 2021-11-16 14:32:19 MST

Created attachment 22287 [details]
requested slurm.conf

Comment 11 Bill Broadley 2021-11-16 14:33:22 MST

slurm.conf uploaded, 20 minute sample being recorded.

Comment 12 Bill Broadley 2021-11-16 14:55:34 MST

Created attachment 22288 [details]
requested 20 minutes with extra debugging

$ date; scontrol setdebugflags +backfill,backfillmap; sleep 1200; scontrol setdebugflags -backfill,backfillmap;date
Tue Nov 16 14:28:27 MST 2021
Tue Nov 16 14:48:28 MST 2021
$

Comment 13 Bill Broadley 2021-11-16 14:58:54 MST

I'll investigate the high number of REQUEST_PARTITION_INFO and REQUEST_NODE_INFO_SINGLE.  I checked the logs, even with debug on, and it doesn't seem to log the IPs for those calls.

Any suggestions on tracking down who/what is causing those lookups?

Comment 14 Marshall Garey 2021-11-17 10:58:31 MST

I don't see the backfill debugging logs. If setting the debug flags worked, then I should see logs like this:

BACKFILL:
_dump_node_space_table:

but I don't see those. Can you double check that the log you uploaded was from that test? Also, when you run that scontrol command, you should see this in the slurmctld log:

[2021-11-17T10:54:06.169] Set DebugFlags to Backfill,BackfillMap


Perhaps, did the scontrol command time out?


(In reply to Bill Broadley from comment #13)
> I'll investigate the high number of REQUEST_PARTITION_INFO and
> REQUEST_NODE_INFO_SINGLE.  I checked the logs, even with debug on, and it
> doesn't seem to log the IPs for those calls.
> 
> Any suggestions on tracking down who/what is causing those lookups?

Look at sdiag. You should see this section:

Remote Procedure Call statistics by user
	root            (       0) count:7603   ave_time:204829 total_time:1557318196

The sdiag that you uploaded only had root. I would expect all the other users to be there. Did you remove all the other users from the output of sdiag? If not, does that meant that everyone on this cluster runs as root?


I think this issue is really important - the slurmctld log you uploaded is still filled with the "very large time processing" messages.

Comment 15 Bill Broadley 2021-11-17 12:30:11 MST

Created attachment 22302 [details]
Logs with Backfill,BackfillMap enabled.

Logs with Backfill,BackfillMap enabled.

Comment 16 Bill Broadley 2021-11-17 12:36:26 MST

Not sure what happened, but I think I got it right this time.  The first line should show debugging enabled and the last line debugging disabled.

$ zcat log.gz | head -1
[2021-11-17T11:49:16.591] Set DebugFlags to Backfill,BackfillMap
$ zcat log.gz | tail -1
[2021-11-17T12:17:21.829] Set DebugFlags to none


> The sdiag that you uploaded only had root. I would expect all the other users to be there. Did you remove all the other users from the output of sdiag? If not, does that meant that everyone on this cluster runs as root?

Sorry, I trimmed the nonroot users, all where substantially less than root.  I wasn't sure if this ticket was private and didn't want to leak usernames.  

> I think this issue is really important - the slurmctld log you uploaded is still filled with the "very large time processing" messages.

We collect partition and node metrics, I'm guessing that's behind the high API issues.  Is there a way to track the socket or IP address of those requests to I can verify them.  I'm seeing something like 10-20 per second, and if there's one call per 2600 nodes per minute that would be 40/sec, so I think that could be related.

Comment 17 Marshall Garey 2021-11-17 12:50:53 MST

(In reply to Bill Broadley from comment #16)
> Not sure what happened, but I think I got it right this time.  The first
> line should show debugging enabled and the last line debugging disabled.
> 
> $ zcat log.gz | head -1
> [2021-11-17T11:49:16.591] Set DebugFlags to Backfill,BackfillMap
> $ zcat log.gz | tail -1
> [2021-11-17T12:17:21.829] Set DebugFlags to none

Thanks, it worked. I'll take a look.

> 
> > The sdiag that you uploaded only had root. I would expect all the other users to be there. Did you remove all the other users from the output of sdiag? If not, does that meant that everyone on this cluster runs as root?
> 
> Sorry, I trimmed the nonroot users, all where substantially less than root. 
> I wasn't sure if this ticket was private and didn't want to leak usernames.  

Not a problem, I suspected that was the case. This bug is public (bugs are public by default) but if you ever want to mark it private you can by checking the "NREL Bugs" checkbox on this bugzilla page.

Generally, root should have the most RPC's because slurmd/slurmstepd runs as root and they make a lot of RPC's to the slurmctld (such as registration messages, job/step complete messages, epilog complete messages). So if you ever see a user who has more RPC's than root, that's usually a bad sign.


> > I think this issue is really important - the slurmctld log you uploaded is still filled with the "very large time processing" messages.
> 
> We collect partition and node metrics, I'm guessing that's behind the high
> API issues.  Is there a way to track the socket or IP address of those
> requests to I can verify them.  I'm seeing something like 10-20 per second,
> and if there's one call per 2600 nodes per minute that would be 40/sec, so I
> think that could be related.

Here's how to track it:

scontrol setdebugflags +protocol

WARNING: it will be very verbose. For example, a single squeue sends 3 RPC's and looks like this:

[2021-11-17T12:40:19.889] PROTOCOL: _slurmctld_rpc_mgr: accept() connection from 127.0.0.1:45896
[2021-11-17T12:40:19.890] slurmctld_req: received opcode REQUEST_FED_INFO from 127.0.0.1:45896 uid 1000
[2021-11-17T12:40:19.892] PROTOCOL: _slurmctld_rpc_mgr: accept() connection from 127.0.0.1:45898
[2021-11-17T12:40:19.893] slurmctld_req: received opcode REQUEST_JOB_INFO from 127.0.0.1:45898 uid 1000
[2021-11-17T12:40:19.894] PROTOCOL: _slurmctld_rpc_mgr: accept() connection from 127.0.0.1:45900
[2021-11-17T12:40:19.894] slurmctld_req: received opcode REQUEST_PARTITION_INFO from 127.0.0.1:45900 uid 1000

You can see the IP address is from my own machine from UID 1000 (which is me).

Just make sure to turn the debug flag off as soon as you're done collecting data.

> We collect partition and node metrics, I'm guessing that's behind the high
> API issues.

That does match up with the RPC's that I see in the logs.

How do you collect partition and node metrics? And how often?

Comment 18 Marshall Garey 2021-11-17 13:17:58 MST

Alright, I think I found the main culprits for the scheduling issues.

Main scheduling:

[2021-11-17T11:52:26.727] debug: sched: Running job scheduler
[2021-11-17T11:52:32.002] debug: sched: loop taking too long, breaking out

This means that the main scheduler isn't getting through all the jobs. It is in fact timing out after max_sched_time seconds (which you have set to 6). I'm wary about increasing this more though because, unlike the backfill scheduler, the main scheduler does not periodically release locks to allow slurmctld to process other RPC's.

So instead I recommend reducing partition_job_depth. Right now you have it set to 10500. But it looks likely that the main scheduler is timing out before it can get to all the partitions. Try setting partition_job_depth to something much lower (maybe 1000) and see if that makes a difference. If you still see "sched: loop taking too long, breaking out" then try reducing this parameter a bit more. You may need to experiment to see what value works best.

Backfill scheduling:

I found why backfill was exiting early - bf_max_job_test is not set, so it is using the default value of 100.

This is an example of the log I'm talking about:

[2021-11-17T11:56:37.757] backfill: table size limit of 100 reached
[2021-11-17T11:56:37.765] backfill: completed testing 3194(2) jobs, usec=191409

So, I recommend increasing bf_max_job_test to something as large as the queue. Once that is done, you should see the backfill scheduler take a lot longer to run and evaluate many more jobs.

I see that in your original post on this bug, bf_max_job_test=130000. But in your most recent post with SchedulerParameters, bf_max_job_test was not set anymore. Was this intentionally removed? You should have both bf_max_job_test and bf_max_time high enough so that all the jobs in the queue (or as many as you want) will be evaluated.

You can also consider setting bf_max_job_part to limit the backfill scheduler to a certain number of jobs in each partition. This can help ensure the backfill scheduler will evaluate jobs in the lower priority partition.

Also, prior to Slurm 21.08, the documentation for this parameter was incorrect (or the parameter was implemented incorrectly).

* Prior to 21.08, this parameter actually reflects the max number of entries in the backfill map.
* In 21.08, we fixed it so this parameter reflects the max number of jobs tested by the backfill scheduler. In addition, we added a new parameter called "bf_node_space_size" which can be set to limit the number of entries in the backfill map.

Comment 19 Marshall Garey 2021-11-24 11:44:18 MST

Hi Bill,

I'm just following up. Have you been able to make any changes per my last comment (comment 18)?

Comment 20 Marshall Garey 2021-12-02 13:14:28 MST

I'm just following up one more time before closing this bug to see how things are going.

Comment 21 Marshall Garey 2021-12-07 12:02:21 MST

I'm closing this bug again as infogiven. Feel free to re-open it if you have more questions.