Ticket 10287

Summary: High cpu load after 20.11.0 upgrade
Product: Slurm Reporter: lhuang
Component: slurmdbdAssignee: Marshall Garey <marshall>
Status: RESOLVED CANNOTREPRODUCE QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: cblack
Version: 20.11.0   
Hardware: Linux   
OS: Linux   
Site: NY Genome Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmdbd
Updated slurmdbd log
slurmctld log

Description lhuang 2020-11-24 15:33:48 MST
Created attachment 16812 [details]
slurmdbd

Noticed these entries in the logs and it's been running 100% on mysql process for the past few hours.

[2020-11-23T17:32:51.048] Unable to remove pidfile '/var/run/slurmdbd.pid': Permission denied
[2020-11-23T17:35:16.448] accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version is: 5.5.65-MariaDB
[2020-11-23T17:35:17.229] accounting_storage/as_mysql: init: Accounting storage MYSQL plugin loaded
[2020-11-23T17:35:18.042] slurmdbd version 20.11.0 started
[2020-11-23T18:00:08.340] error: We have more allocated time than is possible (2168 > 1464) for cluster pe2(1) from 2020-11-23T17:00:00 - 2020-11-23T18:00:00 tres 1002
[2020-11-23T18:00:08.347] Warning: Note very large processing time from hourly_rollup for pe2: usec=8271877 began=18:00:00.075
[2020-11-23T19:00:08.364] error: We have more allocated time than is possible (2168 > 1464) for cluster pe2(1) from 2020-11-23T17:00:00 - 2020-11-23T18:00:00 tres 1002
[2020-11-23T19:00:08.388] Warning: Note very large processing time from hourly_rollup for pe2: usec=7488833 began=19:00:00.899
[2020-11-23T20:00:07.857] error: We have more allocated time than is possible (2168 > 1464) for cluster pe2(1) from 2020-11-23T17:00:00 - 2020-11-23T18:00:00 tres 1002
[2020-11-23T20:00:07.913] Warning: Note very large processing time from hourly_rollup for pe2: usec=7782274 began=20:00:00.131
[2020-11-23T21:00:07.906] error: We have more allocated time than is possible (2168 > 1464) for cluster pe2(1) from 2020-11-23T17:00:00 - 2020-11-23T18:00:00 tres 1002
[2020-11-23T21:00:07.969] Warning: Note very large processing time from hourly_rollup for pe2: usec=7304279 began=21:00:00.665
[2020-11-24T03:49:01.254] Logrotate signal (SIGUSR2) received
[2020-11-24T13:00:04.065] error: We have more allocated time than is possible (2168 > 1464) for cluster pe2(1) from 2020-11-23T17:00:00 - 2020-11-23T18:00:00 tres 1002
Comment 2 Marshall Garey 2020-11-25 09:47:28 MST
Looking at slurm.conf on site_info, slurmdbd and slurmctld run on the same machine. Is that correct? Not that this is a problem, I'm just checking.

Questions that come to mind:

* Is the mysql process running at 100% all the time? Or only during rollups (once per hour for a few minutes)?

* What are the mysql processes that are occurring? Can you upload the output of this command in mysql?

mysql> show full processlist




* I'd like to find out why we see the "more time than is possible" errors. I'll ask for more detailed logs once we figure out the first issue.
Comment 3 Chris Black 2020-11-25 13:09:28 MST
Created attachment 16834 [details]
Updated slurmdbd log

Updated slurmdbd log.
Lingering jobs that were not passing status correctly to slurmctld have been reset, load on pe2-slurm01 has been low <1.0 since 11/24 11pm US/Eastern.
Load may have been related to now-corrected https://bugs.schedmd.com/show_bug.cgi?id=10275
Comment 4 Chris Black 2020-11-25 13:10:03 MST
Created attachment 16835 [details]
slurmctld log
Comment 5 Marshall Garey 2020-11-25 13:42:10 MST
Thanks for the update. I think that the "more allocated time than is possible" errors were also due to bug 10275 with the job completions not being recognized, and hence jobs running longer than they were supposed to. If any jobs were started after you upgraded slurmctld to 20.11 but before you applied the fix in bug 10275, those jobs will probably also create these issues. But once those jobs are all done, I expect these errors to go away.

Let's give it another week and see if these errors go away once the buggy jobs from 10275 are done.

You can also check if there are runaway jobs with this command:

sacctmgr show runaway
Comment 6 lhuang 2020-11-30 12:51:36 MST
It looks like the error has cleared and we no longer see the high load.