17961 – sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt

Ticket 17961 - sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt

Summary: sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt

Status:	RESOLVED TIMEDOUT

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Configuration (show other tickets)
Version:	22.05.2
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Felip Moll
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2023-10-19 14:27 MDT by RAMYA ERANNA
Modified:	2023-11-16 09:47 MST (History)
CC List:	1 user (show)

See Also:	7390
Site:	SLAC
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
sdiag (8.15 KB, text/plain) 2023-10-26 12:13 MDT, RAMYA ERANNA	Details
slurm.conf (6.74 KB, text/plain) 2023-10-26 12:13 MDT, RAMYA ERANNA	Details
slurmctld logs (47.32 MB, text/plain) 2023-10-26 12:14 MDT, RAMYA ERANNA	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description RAMYA ERANNA 2023-10-19 14:27:40 MDT

Hi Team,

We see these messages in slurmctld logs. Could you please help us understand what is max_rpc_cnt and how to configure it?

Also please explain about SchedulerParameters we set in our slurm.conf

SchedulerParameters=Ignore_NUMA,batch_sched_delay=10,bf_interval=60,bf_continue,bf_max_job_test=1000,bf_max_job_user=60,bf_resolution=600,pack_serial_at_end,sched_min_interval=500000,sched_max_job_start=1500 

[2023-10-19T12:05:35.515] sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2023-10-19T12:05:36.205] sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2023-10-19T12:05:36.852] sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2023-10-19T12:05:37.645] sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2023-10-19T12:05:38.225] sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2023-10-19T12:05:38.945] sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2023-10-19T12:05:39.658] sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2023-10-19T12:05:40.321] sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2023-10-19T12:05:40.963] sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2023-10-19T12:05:41.559] sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2023-10-19T12:05:42.126] sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2023-10-19T12:05:42.677] sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2023-10-19T12:05:43.252] sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2023-10-19T12:05:43.819] sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2023-10-19T12:05:44.559] sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2023-10-19T12:57:58.273] sched: 153 pending RPCs at cycle end, consider configuring max_rpc_cnt


Thank you
Ramya

Comment 1 Jason Booth 2023-10-20 15:15:02 MDT

Please attach the output of "sdiag" and "sinfo". Are these message seen more when you have large amounts of job submissions or during other times such as job completions? 

Please also attach your current slurm.conf and slurmctld.log file.

Comment 3 RAMYA ERANNA 2023-10-26 12:13:28 MDT

Hi,

A large number of jobs were submitted today. Again we noticed these messages

[reranna@sdfslurmctld ~]$ sudo cat /var/log/slurm/slurmctld.log | grep -E "2023-10-26T10" | grep -i max_rpc | wc -l
480
[reranna@sdfslurmctld ~]$



Along with the above message, we noticed these error messages. Kindly help to analyze. 


[2023-10-26T10:45:50.306] error: _handle_assoc_tres_run_secs: job 31947702: assoc 6399 TRES cpu grp_used_tres_run_secs underflow, tried to remove 211680 seconds when only 191520 remained.
[2023-10-26T10:45:50.306] error: _handle_assoc_tres_run_secs: job 31947702: assoc 6399 TRES mem grp_used_tres_run_secs underflow, tried to remove 867041280 seconds when only 784465920 remained.
[2023-10-26T10:45:50.306] error: _handle_assoc_tres_run_secs: job 31947702: assoc 6399 TRES node grp_used_tres_run_secs underflow, tried to remove 1764 seconds when only 1596 remained.
[2023-10-26T10:45:50.306] error: _handle_assoc_tres_run_secs: job 31947702: assoc 6399 TRES billing grp_used_tres_run_secs underflow, tried to remove 211680 seconds when only 191520 remained.

Thanks
Ramya

Comment 4 RAMYA ERANNA 2023-10-26 12:13:41 MDT

Created attachment 32966 [details]
sdiag

Comment 5 RAMYA ERANNA 2023-10-26 12:13:53 MDT

Created attachment 32967 [details]
slurm.conf

Comment 6 RAMYA ERANNA 2023-10-26 12:14:37 MDT

Created attachment 32968 [details]
slurmctld logs

Comment 7 RAMYA ERANNA 2023-10-26 12:15:50 MDT

[reranna@sdfslurmctld ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
roma*        up 10-00:00:0      1   mix- sdfrome033
roma*        up 10-00:00:0      1  down$ sdfrome003
roma*        up 10-00:00:0      2 drain$ sdfrome[037,041]
roma*        up 10-00:00:0      1  down* sdfrome004
roma*        up 10-00:00:0     13   comp sdfrome[010-011,015-018,023-024,026,034-036,038]
roma*        up 10-00:00:0      1  drain sdfrome005
roma*        up 10-00:00:0     16    mix sdfrome[007,012-014,019-021,027-029,031-032,039-040,042-043]
roma*        up 10-00:00:0      6   idle sdfrome[006,008-009,022,025,030]
milano       up 10-00:00:0     22   resv sdfmilan[007-008,011,014-019,021-022,033,047,051,059,062,067,071,101,103,106,128]
milano       up 10-00:00:0     18   plnd sdfmilan[034,036-038,043-046,053-055,057-058,060-061,063-065]
milano       up 10-00:00:0      4 drain$ sdfmilan[009,041,049,112]
milano       up 10-00:00:0      1   comp sdfmilan220
milano       up 10-00:00:0      1  drain sdfmilan031
milano       up 10-00:00:0     89    mix sdfmilan[001-006,010,012-013,020,023-030,032,035,039-040,042,048,050,052,056,066,068-070,072,102,104-105,107-111,113-127,129-131,201-219,221-232]
ampere       up 10-00:00:0     23    mix sdfampere[001-023]
[reranna@sdfslurmctld ~]$

Comment 8 Felip Moll 2023-10-27 09:07:17 MDT

In this case it seems you have received a burst of job submissions and the RPCs just got accumulated. The sdiag of the moment you captured it shows that the backfill scheduler takes a mean of 3.99 seconds to complete (Mean cycle), while there was one point in time, probably where you get the warning, which took a total of 58 seconds (Max cycle). This is a long time if at this moment you're receiving lots of job completions RPCs for example. The main scheduler max cycle is 0.4 seconds, so we'll not be discussing it here.

Your bf_interval is 60 seconds, and during this 60 seconds the backfill is capable to not release any locks and get accumulating rpcs. If these submission bursts happen too often it is possible you might want to reduce bf_interval a little bit.

But a better option is to set max_rpc_cnt. This one will instruct backfill to defer work when there are too many RPCs pending. The time that the scheduler will defer work will depend on this parameter. It will "sleep" at least as the max_rpc_cnt is reduced by 10 times. For example, if we put here a value of 100, it will defer work until it has reached 10 rpcs. As you know there can be at most 256 pending operations to be served at a time (equivalent to MAX_SERVER_THREADS). This parameter should help responsiveness of the controller:

max_rpc_cnt=100

We then have bf_yield_interval and bf_yield_sleep. These mean: each X seconds release the locks to permit RPCs and other operations to be processed, and for how Y seconds to yield these locks, respectively. Default is 2 seconds and 0.5 seconds. If you continue seeing responsiveness issues, you can make the scheduler to "stop" more frequently decreasing bf_yield_interval, or let it sleep for more time with bf_yield_sleep. Don't touch them for now, but if max_rpc_cnt is not enough (I think it will be ok), use this parameters.

bf_yield_interval=
bf_yield_sleep=


Please, start adding max_rpc_cnt= to for example to 150, see what happens during some days. If the issue is repeating often, decrease it. If you're not able to fix the problem, use bf_yield* parameters. Come back if you still have questions.

Does it sound good?


> Along with the above message, we noticed these error messages. Kindly help
> to analyze. 
> 
> 
> [2023-10-26T10:45:50.306] error: _handle_assoc_tres_run_secs: job 31947702:
> assoc 6399 TRES cpu grp_used_tres_run_secs underflow, tried to remove 211680
> seconds when only 191520 remained.
> [2023-10-26T10:45:50.306] error: _handle_assoc_tres_run_secs: job 31947702:
> assoc 6399 TRES mem grp_used_tres_run_secs underflow, tried to remove
> 867041280 seconds when only 784465920 remained.
> [2023-10-26T10:45:50.306] error: _handle_assoc_tres_run_secs: job 31947702:
> assoc 6399 TRES node grp_used_tres_run_secs underflow, tried to remove 1764
> seconds when only 1596 remained.
> [2023-10-26T10:45:50.306] error: _handle_assoc_tres_run_secs: job 31947702:
> assoc 6399 TRES billing grp_used_tres_run_secs underflow, tried to remove
> 211680 seconds when only 191520 remained.

About this, we it is a mostly harmless error. We have bug 7390 opened and some others but still we don't have a solution. It uses to have when there have been crashes or similar. Is it the first time you see that? If it happens often, please open a new bug or put yourself in CC in bug 7390 and related ones (check See Also field in bugzilla).

Comment 9 Felip Moll 2023-11-07 08:48:37 MST

Hi Ramya,

Please, let me know if my suggestions for tunning the scheduler have been applied and how things look at now.

Thanks

Comment 10 Felip Moll 2023-11-16 09:47:18 MST

Ramya, I am marking this bug as timedout. Please reopen when you have more feedback.

Thanks