Ticket 12263

Summary: sreport cluster AccountUtilizationByUser over long date range puts slurmdbd/MariaDB daemon into a bad state
Product: Slurm Reporter: Troy Baer <troy>
Component: slurmdbdAssignee: Albert Gil <albert.gil>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: tdockendorf
Version: 20.11.7   
Hardware: Linux   
OS: Linux   
Site: Ohio State OSC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: RHEL
Machine Name: Owens, Pitzer CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmdbd.conf with DB password removed
Output of "sacctmgr show stats"

Description Troy Baer 2021-08-12 09:09:12 MDT
We have an internal workload analysis workflow that we would like to have do the following:

CLUSTER=owens
START=2020-12-15T00:00:00
END=2021-03-15T23:59:59
sreport cluster AccountUtilizationByUser -M $CLUSTER -t percent start=$START end=$END format=cluster,account%45,login,used --noheader --parsable2

As our numbers of accounts/users/associations/jobs in the database goes up, we're finding that running this sreport command over longer date ranges much longer than a month can result in slurmdbd (or more accurately its backing MariaDB database server) going into a state where it's overwhelmed and unusable.  In at least one situation, this has resulted in a period where users couldn't submit jobs because the backing database was so wedged that slurmdbd couldn't check that the users' job account requests were valid.

Is this just an inevitable consequence of having a large number of accounts, users, associations, and jobs, or is it possible that we've misconfigured something on either the slurmdbd or mariadb sides?
Comment 1 Albert Gil 2021-08-18 09:24:03 MDT
Hi Troy,

> We have an internal workload analysis workflow that we would like to have do
> the following:
> 
> CLUSTER=owens
> START=2020-12-15T00:00:00
> END=2021-03-15T23:59:59
> sreport cluster AccountUtilizationByUser -M $CLUSTER -t percent start=$START
> end=$END format=cluster,account%45,login,used --noheader --parsable2

This makes perfect sense, that's why sreport is made for.

> As our numbers of accounts/users/associations/jobs in the database goes up,
> we're finding that running this sreport command over longer date ranges much
> longer than a month can result in slurmdbd (or more accurately its backing
> MariaDB database server) going into a state where it's overwhelmed and
> unusable.

This shouldn't happen.
At least not if you run such sreport commands with a rational frequency, if you ask for a whole year info and you do that more than one time per second.. yes, the DB will face a performance degradation.

In our experience, the usual reason of performance degradation on the DB is not because the queries becomes too big (Slurm is designed to handle *a lot* of users, jobs, nodes...), but because too many queries are sent (maybe even not that big) from some uncontrolled source/script/webpage or similar.

You can run "sacctmgr show stats" to see some metrics that can help you to identify if there some uncontrolled source of queries.

> In at least one situation, this has resulted in a period where
> users couldn't submit jobs because the backing database was so wedged that
> slurmdbd couldn't check that the users' job account requests were valid.

This shouldn't happen neither.
Note that Slurm is designed in a way that slurmctld can run just fine if slurmdbd is just down.
The controller (slurmctld) has always a cache of all the information that it needs to operate (users, accounts, qos, etc...) and the synchronization with the slurmdbd is kind async.

If you face such a problem again, please report it so we can try to identify the root cause of the issue.

> Is this just an inevitable consequence of having a large number of accounts,
> users, associations, and jobs, or is it possible that we've misconfigured
> something on either the slurmdbd or mariadb sides?

As mentioned above and in regards of DB backend, I would say that the large number of accounts, users, jobs, etc.. shouldn't be a problem, but a too big number of queries could.
And on regards of slurmctld not allowing job submissions, I would say that a degradation of DB performance shouldn't be the root cause neither.
I don't see what kind of misconfiguration could lead to those scenarios neither, but if you want please provide your .conf files so we can double-check them. You can send the logs too to see if we detect something wrong going on.

Being said that, if the slurmctld and slurmdbd host has not enough RAM and they start swapping, yes, that could lead to what you explained.

In summary, the good news are that I think that you should be able to do the type of queries that you want to do, no matter how big your cluster becomes, and that we'll help you if you face problems! ;-)

Regards,
Albert
Comment 2 Troy Baer 2021-08-18 14:27:01 MDT
I'm going to add a task for our next scheduled downtime (9/28) to run "sreport cluster AccountUtilizationByUser" over a long date range under controlled circumstances outside regular production to verify that this is repeatable.

Sanitized slurmdbd.conf file forthcoming.
Comment 3 Troy Baer 2021-08-18 14:35:20 MDT
Created attachment 20898 [details]
slurmdbd.conf with DB password removed
Comment 4 Albert Gil 2021-08-19 04:17:08 MDT
Hi Troy,

> Sanitized slurmdbd.conf file forthcoming.

Configuration looks fine.
I don't seen any kind of misconfiguration.

Just to mention the options that could have some relation to slow sreports:
- PrivateData=usage
- The SQL backend in a different host (slurmdbd01.infra.osc.edu and dbsys02.infra.osc.ed).

Please note that both options are totally fine, I'm just mention them because they may be relevant to reproduce the issue.

If you want, please also sent the rest of .conf.
Also, please make sure that your SQL backend has this configuration for innodb_buffer_pool_size, innodb_log_file_size and innodb_lock_wait_timeout:

https://slurm.schedmd.com/accounting.html#slurm-accounting-configuration-before-build

> I'm going to add a task for our next scheduled downtime (9/28) to run
> "sreport cluster AccountUtilizationByUser" over a long date range under
> controlled circumstances outside regular production to verify that this is
> repeatable.

If you prefer to wait until 28-Nov, then I would suggest to temporarily close this ticket as infogiven and, in case you reproduce the issue, just reopen it.
But it shouldn't be a problem if you run some sreport queries now, maybe not that big if you don't want to, and then you share the logs and metrics that you can gather from them so we can check if they are all good.

What do you prefer?
Have you run "sacctmgr show stats"?

Regards,
Albert
Comment 5 Troy Baer 2021-08-19 06:43:27 MDT
Created attachment 20911 [details]
Output of "sacctmgr show stats"
Comment 6 Albert Gil 2021-08-19 08:40:10 MDT
Hi Troy,

The attached stats looks quite fine.
The only thing worth to mention is about the user wxops.
As you can see, these are the top users consuming time quering on slurmdbd:

	root                (         0) count:1142072  ave_time:78903    total_time:90112931860
	wxops               (     30211) count:86911734 ave_time:559      total_time:48586713791
	slurm               (        93) count:14826359 ave_time:2200     total_time:32621788850
	prometheus          (        94) count:207024   ave_time:100664   total_time:20839965654
	troy                (      6624) count:3752     ave_time:3528476  total_time:13238845490

The "slurm" user is expected to be there, along with some admins like I guess that "troy" is.. ;-)
Based on the names I guess the other are also admins/service users.
What seems suspicious to me is the *count* value of the user wxops, note that it's doing a lot of (small) queries.
I don't know how often it does that, but I cannot discard that it may be triggering some sort of "queries bomb" that may lead to a degradation of slurmdbd/sql in short periods.
Maybe some script/code run by wxops has some too demanding loop or similar?
Maybe that triggered the issues that you mentioned initially?

Regards,
Albert
Comment 7 Troy Baer 2021-08-19 10:09:36 MDT
wxops is a customer's service account that runs their workflows.  I will ask them if any of their workflows include repeated invocations of sacct or sreport.
Comment 8 Troy Baer 2021-08-19 10:54:52 MDT
(In reply to Troy Baer from comment #7)
> wxops is a customer's service account that runs their workflows.  I will ask
> them if any of their workflows include repeated invocations of sacct or
> sreport.

It just hit me that most of this user's job scripts do something like "sacct -j $SLURM_JOB_ID.batch <args>" at the end so that they get usage information included in their job output, and they run a pretty large number of jobs.  That's likely where that count is coming from.
Comment 9 Albert Gil 2021-08-20 07:20:31 MDT
Hi Troy,

> > wxops is a customer's service account that runs their workflows.  I will ask
> > them if any of their workflows include repeated invocations of sacct or
> > sreport.
> 
> It just hit me that most of this user's job scripts do something like "sacct
> -j $SLURM_JOB_ID.batch <args>" at the end so that they get usage information
> included in their job output, and they run a pretty large number of jobs. 
> That's likely where that count is coming from.

If it's only 1 sacct per job script, that wouldn't lead to such big numbers.
Anyway, I wouldn't recomend to use sacct inside a job script, probably sstat is a better option.
Note that it could be a delay between job's activity and it's accounting on the DB/sacct, so running sacct inside a job may lead even to empty info if the delay is big enough. On the other side sstat doesn't interact with the DB, but directly with the daemons, it gets the latest data gathered from the job.

Regards,
Albert
Comment 10 Troy Baer 2021-08-26 14:46:34 MDT
> If you prefer to wait until 28-Nov, then I would suggest to temporarily close this ticket as infogiven and, in case you reproduce the issue, just reopen it.

I will do that.
Comment 11 Troy Baer 2021-08-26 14:48:25 MDT
[*comment to appease Bugzilla gods*]