Ticket 13667

Summary:	Size limit when using AccountingStoreFlags=job_script
Product:	Slurm	Reporter:	Juergen Salk <juergen.salk>
Component:	Configuration	Assignee:	Jason Booth <jbooth>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	21.08.6
Hardware:	Linux
OS:	Linux
Site:	Ulm University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Juergen Salk 2022-03-21 08:52:19 MDT

Hi,

I vaguely remember to have read something about a configurable job script size limit to prevent bloating the database with exceedingly large job scripts when using AccountingStoreFlags=job_script. Am I wrong? I can't find any reference in the documentation for Slurm 21.08.6, but maybe I am searching the wrong places. Is there any internal limit for the maximum size of a job script that will be stored in the accounting data base?

Best regards
Jürgen

Comment 1 Jason Booth 2022-03-21 10:47:56 MDT

This was mentioned in the field notes during SLUG. It is set as a SchedulerParameters.

https://slurm.schedmd.com/slurm.conf.html#OPT_max_script_size=#


>  Is there any internal limit for the maximum size of a job script that will be stored in the accounting data base?

No, however, we recommend you consider limiting what goes in as this can significantly increase the database size.


We do have improvements in 22.05 that will help with the size of the database and job scripts.

https://github.com/SchedMD/slurm/commit/fd6fef3e14a0c6d1484230744289749c0e4b19d0

Comment 2 Juergen Salk 2022-03-22 11:53:16 MDT

Hi Jason,

thank you. That was the right pointer. I've indeed seen that on your slides and it somehow got stuck in my head that there is a separate size limit for what goes into the database and what is rejected for storage in the database. 

Now I am trying to get a rough estimate of the impact of activating `AccountingStoreFlags=job_script´ on the growth of the database. 

Given that we have around 10.000 jobs per day and an average size of 4 kB per job script, does that mean that we have to expect an *additional* growth of 14 GB (= 365 x 10.000 x 4 kB = 14.600.000 kB) per year (i.e. on top of what would be stored without the job scripts)? Or is this too naive?

After having stored around 5.400.000 jobs our productive slurm_acct_db is currently at 100 GB in size (without having stored job scripts up to now). Assuming that size grows linearly with job count this translates to an average growth of  around 20 kB (= 100 GB / 5.400.000) per job or a growth of 70 GB (= 365 x 10.000 x 20 kB) per year without storage of job scripts but an expected growth of 84 GB (= 70 GB + 14 GB) per year with job script storage enabled. Does that make sense or is that too naive even for a rough estimate? 

Now I am also wondering if a database size of 100 GB for 5.400.000 jobs represents a reasonable or insane value. I am asking this, because I have noticed that most of the total size can probably be attributed to the job step table but I am unsure if this is to be expected (to that extent) or not.

$  ls -l /var/lib/mysql/slurm_acct_db/ | sort -rn -k4 | head -10
total 92537175
-rw-rw---- 1 mysql mysql 102374572032 Mar 22 18:41 justus2_step_table.ibd
-rw-rw---- 1 mysql mysql   4894752768 Mar 22 18:40 justus2_job_table.ibd
-rw-rw---- 1 mysql mysql    494927872 Mar 22 18:00 justus2_assoc_usage_hour_table.ibd
-rw-rw---- 1 mysql mysql     50331648 Mar 22 00:00 justus2_assoc_usage_day_table.ibd
-rw-rw---- 1 mysql mysql     33554432 Mar 22 14:38 justus2_event_table.ibd
-rw-rw---- 1 mysql mysql     27262976 Mar 22 18:00 justus2_usage_hour_table.ibd
-rw-rw---- 1 mysql mysql     10485760 Mar  1 00:00 justus2_assoc_usage_month_table.ibd
-rw-rw---- 1 mysql mysql      9437184 Mar 22 18:00 txn_table.ibd
-rw-rw---- 1 mysql mysql      9437184 Mar 22 00:00 justus2_usage_day_table.ibd
$

Any comment on that would also be highly appreciated.
Thank you in advance.

Best regards
Jürgen

Comment 3 Jason Booth 2022-03-22 15:22:28 MDT

Jürgen - Your calculation is mostly correct for 21.08. In the case of array jobs, you might expect 1 job script for all indices within an array, however, in 21.08 each array job, and its' children have a copy of the job script. This greatly increases the footprint in the database for array jobs.  

As of 22.05, this is no longer the case. In 22.05 we create a hash-based ok the job script and each job just points to that hash value as its job script. This has several advantages.

1. If a user submits jobs with the same script for most or all of their jobs, only one script is stored in the database for all of those jobs.
2. Job array's no longer duplicate the job script because of this feature.
3. The user's environment is also hashed and only stored if that environment/hash changes.


So, my suggestion would be to move to 22.05 after its release. I generally recommend sites wait for the .1 or .2 before making the jump. These maintenance releases generally follow within the first few weeks.

Comment 4 Jason Booth 2022-03-22 15:28:25 MDT

Regarding your other question about database size:

> Now I am also wondering if a database size of 100 GB for 5.400.000 jobs 
> represents a reasonable or insane value. I am asking this, because I have 
> noticed that most of the total size can probably be attributed to the job step 
> table but I am unsure if this is to be expected (to that extent) or not.

We have sites that fall in the larger range of 100-300 GB. Depending on your needs as a site this may be acceptable to you, however, caution should be taken when running large queries for jobs as these can take some time depending on the underlying storage (NVME/SSD/HardDrive), CPU, and RAM. 100GB does not look at all unreasonable for a database, however, I would suggest archiving and purging if you are able to do so.

Comment 5 Jason Booth 2022-04-04 13:54:02 MDT

Please let me know if you have any further questions. For now, I am resolving
this out.