| Summary: | Size limit when using AccountingStoreFlags=job_script | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Juergen Salk <juergen.salk> |
| Component: | Configuration | Assignee: | Jason Booth <jbooth> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 21.08.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Ulm University | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Juergen Salk
2022-03-21 08:52:19 MDT
This was mentioned in the field notes during SLUG. It is set as a SchedulerParameters. https://slurm.schedmd.com/slurm.conf.html#OPT_max_script_size=# > Is there any internal limit for the maximum size of a job script that will be stored in the accounting data base? No, however, we recommend you consider limiting what goes in as this can significantly increase the database size. We do have improvements in 22.05 that will help with the size of the database and job scripts. https://github.com/SchedMD/slurm/commit/fd6fef3e14a0c6d1484230744289749c0e4b19d0 Hi Jason, thank you. That was the right pointer. I've indeed seen that on your slides and it somehow got stuck in my head that there is a separate size limit for what goes into the database and what is rejected for storage in the database. Now I am trying to get a rough estimate of the impact of activating `AccountingStoreFlags=job_script´ on the growth of the database. Given that we have around 10.000 jobs per day and an average size of 4 kB per job script, does that mean that we have to expect an *additional* growth of 14 GB (= 365 x 10.000 x 4 kB = 14.600.000 kB) per year (i.e. on top of what would be stored without the job scripts)? Or is this too naive? After having stored around 5.400.000 jobs our productive slurm_acct_db is currently at 100 GB in size (without having stored job scripts up to now). Assuming that size grows linearly with job count this translates to an average growth of around 20 kB (= 100 GB / 5.400.000) per job or a growth of 70 GB (= 365 x 10.000 x 20 kB) per year without storage of job scripts but an expected growth of 84 GB (= 70 GB + 14 GB) per year with job script storage enabled. Does that make sense or is that too naive even for a rough estimate? Now I am also wondering if a database size of 100 GB for 5.400.000 jobs represents a reasonable or insane value. I am asking this, because I have noticed that most of the total size can probably be attributed to the job step table but I am unsure if this is to be expected (to that extent) or not. $ ls -l /var/lib/mysql/slurm_acct_db/ | sort -rn -k4 | head -10 total 92537175 -rw-rw---- 1 mysql mysql 102374572032 Mar 22 18:41 justus2_step_table.ibd -rw-rw---- 1 mysql mysql 4894752768 Mar 22 18:40 justus2_job_table.ibd -rw-rw---- 1 mysql mysql 494927872 Mar 22 18:00 justus2_assoc_usage_hour_table.ibd -rw-rw---- 1 mysql mysql 50331648 Mar 22 00:00 justus2_assoc_usage_day_table.ibd -rw-rw---- 1 mysql mysql 33554432 Mar 22 14:38 justus2_event_table.ibd -rw-rw---- 1 mysql mysql 27262976 Mar 22 18:00 justus2_usage_hour_table.ibd -rw-rw---- 1 mysql mysql 10485760 Mar 1 00:00 justus2_assoc_usage_month_table.ibd -rw-rw---- 1 mysql mysql 9437184 Mar 22 18:00 txn_table.ibd -rw-rw---- 1 mysql mysql 9437184 Mar 22 00:00 justus2_usage_day_table.ibd $ Any comment on that would also be highly appreciated. Thank you in advance. Best regards Jürgen Jürgen - Your calculation is mostly correct for 21.08. In the case of array jobs, you might expect 1 job script for all indices within an array, however, in 21.08 each array job, and its' children have a copy of the job script. This greatly increases the footprint in the database for array jobs. As of 22.05, this is no longer the case. In 22.05 we create a hash-based ok the job script and each job just points to that hash value as its job script. This has several advantages. 1. If a user submits jobs with the same script for most or all of their jobs, only one script is stored in the database for all of those jobs. 2. Job array's no longer duplicate the job script because of this feature. 3. The user's environment is also hashed and only stored if that environment/hash changes. So, my suggestion would be to move to 22.05 after its release. I generally recommend sites wait for the .1 or .2 before making the jump. These maintenance releases generally follow within the first few weeks. Regarding your other question about database size:
> Now I am also wondering if a database size of 100 GB for 5.400.000 jobs
> represents a reasonable or insane value. I am asking this, because I have
> noticed that most of the total size can probably be attributed to the job step
> table but I am unsure if this is to be expected (to that extent) or not.
We have sites that fall in the larger range of 100-300 GB. Depending on your needs as a site this may be acceptable to you, however, caution should be taken when running large queries for jobs as these can take some time depending on the underlying storage (NVME/SSD/HardDrive), CPU, and RAM. 100GB does not look at all unreasonable for a database, however, I would suggest archiving and purging if you are able to do so.
Please let me know if you have any further questions. For now, I am resolving this out. |