| Summary: | sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | RAMYA ERANNA <reranna> |
| Component: | Configuration | Assignee: | Felip Moll <felip.moll> |
| Status: | RESOLVED TIMEDOUT | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | felip.moll |
| Version: | 22.05.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=7390 | ||
| Site: | SLAC | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
sdiag
slurm.conf slurmctld logs |
||
|
Description
RAMYA ERANNA
2023-10-19 14:27:40 MDT
Please attach the output of "sdiag" and "sinfo". Are these message seen more when you have large amounts of job submissions or during other times such as job completions? Please also attach your current slurm.conf and slurmctld.log file. Hi, A large number of jobs were submitted today. Again we noticed these messages [reranna@sdfslurmctld ~]$ sudo cat /var/log/slurm/slurmctld.log | grep -E "2023-10-26T10" | grep -i max_rpc | wc -l 480 [reranna@sdfslurmctld ~]$ Along with the above message, we noticed these error messages. Kindly help to analyze. [2023-10-26T10:45:50.306] error: _handle_assoc_tres_run_secs: job 31947702: assoc 6399 TRES cpu grp_used_tres_run_secs underflow, tried to remove 211680 seconds when only 191520 remained. [2023-10-26T10:45:50.306] error: _handle_assoc_tres_run_secs: job 31947702: assoc 6399 TRES mem grp_used_tres_run_secs underflow, tried to remove 867041280 seconds when only 784465920 remained. [2023-10-26T10:45:50.306] error: _handle_assoc_tres_run_secs: job 31947702: assoc 6399 TRES node grp_used_tres_run_secs underflow, tried to remove 1764 seconds when only 1596 remained. [2023-10-26T10:45:50.306] error: _handle_assoc_tres_run_secs: job 31947702: assoc 6399 TRES billing grp_used_tres_run_secs underflow, tried to remove 211680 seconds when only 191520 remained. Thanks Ramya Created attachment 32966 [details]
sdiag
Created attachment 32967 [details]
slurm.conf
Created attachment 32968 [details]
slurmctld logs
[reranna@sdfslurmctld ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST roma* up 10-00:00:0 1 mix- sdfrome033 roma* up 10-00:00:0 1 down$ sdfrome003 roma* up 10-00:00:0 2 drain$ sdfrome[037,041] roma* up 10-00:00:0 1 down* sdfrome004 roma* up 10-00:00:0 13 comp sdfrome[010-011,015-018,023-024,026,034-036,038] roma* up 10-00:00:0 1 drain sdfrome005 roma* up 10-00:00:0 16 mix sdfrome[007,012-014,019-021,027-029,031-032,039-040,042-043] roma* up 10-00:00:0 6 idle sdfrome[006,008-009,022,025,030] milano up 10-00:00:0 22 resv sdfmilan[007-008,011,014-019,021-022,033,047,051,059,062,067,071,101,103,106,128] milano up 10-00:00:0 18 plnd sdfmilan[034,036-038,043-046,053-055,057-058,060-061,063-065] milano up 10-00:00:0 4 drain$ sdfmilan[009,041,049,112] milano up 10-00:00:0 1 comp sdfmilan220 milano up 10-00:00:0 1 drain sdfmilan031 milano up 10-00:00:0 89 mix sdfmilan[001-006,010,012-013,020,023-030,032,035,039-040,042,048,050,052,056,066,068-070,072,102,104-105,107-111,113-127,129-131,201-219,221-232] ampere up 10-00:00:0 23 mix sdfampere[001-023] [reranna@sdfslurmctld ~]$ In this case it seems you have received a burst of job submissions and the RPCs just got accumulated. The sdiag of the moment you captured it shows that the backfill scheduler takes a mean of 3.99 seconds to complete (Mean cycle), while there was one point in time, probably where you get the warning, which took a total of 58 seconds (Max cycle). This is a long time if at this moment you're receiving lots of job completions RPCs for example. The main scheduler max cycle is 0.4 seconds, so we'll not be discussing it here. Your bf_interval is 60 seconds, and during this 60 seconds the backfill is capable to not release any locks and get accumulating rpcs. If these submission bursts happen too often it is possible you might want to reduce bf_interval a little bit. But a better option is to set max_rpc_cnt. This one will instruct backfill to defer work when there are too many RPCs pending. The time that the scheduler will defer work will depend on this parameter. It will "sleep" at least as the max_rpc_cnt is reduced by 10 times. For example, if we put here a value of 100, it will defer work until it has reached 10 rpcs. As you know there can be at most 256 pending operations to be served at a time (equivalent to MAX_SERVER_THREADS). This parameter should help responsiveness of the controller: max_rpc_cnt=100 We then have bf_yield_interval and bf_yield_sleep. These mean: each X seconds release the locks to permit RPCs and other operations to be processed, and for how Y seconds to yield these locks, respectively. Default is 2 seconds and 0.5 seconds. If you continue seeing responsiveness issues, you can make the scheduler to "stop" more frequently decreasing bf_yield_interval, or let it sleep for more time with bf_yield_sleep. Don't touch them for now, but if max_rpc_cnt is not enough (I think it will be ok), use this parameters. bf_yield_interval= bf_yield_sleep= Please, start adding max_rpc_cnt= to for example to 150, see what happens during some days. If the issue is repeating often, decrease it. If you're not able to fix the problem, use bf_yield* parameters. Come back if you still have questions. Does it sound good? > Along with the above message, we noticed these error messages. Kindly help > to analyze. > > > [2023-10-26T10:45:50.306] error: _handle_assoc_tres_run_secs: job 31947702: > assoc 6399 TRES cpu grp_used_tres_run_secs underflow, tried to remove 211680 > seconds when only 191520 remained. > [2023-10-26T10:45:50.306] error: _handle_assoc_tres_run_secs: job 31947702: > assoc 6399 TRES mem grp_used_tres_run_secs underflow, tried to remove > 867041280 seconds when only 784465920 remained. > [2023-10-26T10:45:50.306] error: _handle_assoc_tres_run_secs: job 31947702: > assoc 6399 TRES node grp_used_tres_run_secs underflow, tried to remove 1764 > seconds when only 1596 remained. > [2023-10-26T10:45:50.306] error: _handle_assoc_tres_run_secs: job 31947702: > assoc 6399 TRES billing grp_used_tres_run_secs underflow, tried to remove > 211680 seconds when only 191520 remained. About this, we it is a mostly harmless error. We have bug 7390 opened and some others but still we don't have a solution. It uses to have when there have been crashes or similar. Is it the first time you see that? If it happens often, please open a new bug or put yourself in CC in bug 7390 and related ones (check See Also field in bugzilla). Hi Ramya, Please, let me know if my suggestions for tunning the scheduler have been applied and how things look at now. Thanks Ramya, I am marking this bug as timedout. Please reopen when you have more feedback. Thanks |