We have the job_env and job_script flags set in slurm.conf. An unusually large (300k in under a day) coupled with a bug in the user's job environment (a recursive sbatch submission with a LD_LIBRARY_PATH that was only appended) caused our chicoma production data base to grow from ~10Gb (typical for the past 6-8 months) to 82Gb with 72Gb of that associated with the job_env_table. As per tickets 16603 and 14514, we will be testing the procedure in the comments: MariaDB > update <clustername>_job_table set env_vars=NULL; Please confirm that this is the whole procedure. That is, if we want subsequent environment variables to be captured, that this does not need to be reset. Also, please consider some threshold logic for a future resilience enhancement, such as job_env_sizelimit, job_script_sizelimit, or an appropriate throttle to catch this. Thank you.
The tables changed in 22.05. If you are indeed running 22.05.6, that fix won't work. However, I am surprised this issue happened in 22.05. That would mean that each job had a different env_vars string. Though perhaps an env variable was being incremented? -Scott
The environment variables are stored in a table called <clustername>_job_env_table. This is linked to each job in the <clustername>_job_table with env_hash_inx. I recommend you follow the procedure in bug 15383 comment 8 if you want to remove all the data from the <clustername>_job_env_table. -Scott
The LD_LIBRARY_PATH variable was appended to with every job invocation, so it was changing and growing for each job. Multiply by 300k+ jobs and it became a problem. We intend to increase our archiving and purging, within the oversight and accounting policies that our systems are subject to. Are there a parameters such as: PurgeJobEnvAfter and PurgeJobScriptAfter? We cannot use PurgeJobAfter records themselves completely, but we could purge the job scripts and their environments reasonably aggressively, approximately on the order of weeks. Since the tables have changed, please suggest the appropriate "trim" SQL table update command, perhaps resembling (based on the other comment): MariaDB > truncate <clustername>_job_env_table where user=<user-name>; so that only this particular user-relevant data is truncated.
I think you will want something like this: >DELETE from <clustername>_job_env_table where hash_inx IN (SELECT env_hash_inx from <clustername>_job_table where id_user=<uid_of_user>); I did test this, but I recommend you double check it first if some of the data in the job_env_table is important. We currently do not have a feature like PurgeJobEnvAfter and PurgeJobScriptAfter. -Scott
I've done that and then used optimize table <clustername>_job_env_table; but don't see a reduction in size of the table. Guidance appreciated.
I was impatient. The table size did get reduced after ~10+ minutes. Thank you.
Do you have any other questions about this? -Scott
No additional questions, thank you. As always we appreciate the prompt, thorough & information-rich responses you & your team provides.
Glad we could help. -Scott
Please reconsider this as a recurring bug which needs to be fixed.
We now purge the script and env tables in lockstep with the job table as specified in PurgeJobAfter. See commit b2bc5ec5670f85. This is in 23.02 and later. Is this what you are asking about? -Scott
> We now purge the script and env tables in lockstep with the job table as specified in PurgeJobAfter. Not exactly. We really do not want to purge jobs. But we do need to purge job scripts, job environments and job steps. Purging at the job level is too coarse-grained, but is our workaround.
We could probably add PurgeScriptAfter / PurgeEnvAfter options as an NRE project. Is this something you are interested in? -Scott
I see that ticket 16954 is already opened on this issue. Is there a specific reason you choose to reopen this ticket instead of discussing the development request in 16954? -Scott
(In reply to Scott Hilton from comment #18) > I see that ticket 16954 is already opened on this issue. > > Is there a specific reason you choose to reopen this ticket instead of > discussing the development request in 16954? > > -Scott This one was fresher in my history and I found it first. So, no rational or technical reason to use this one vs. 16954. FYI- We have requested funding for 16954 Enhance SOW. *** This ticket has been marked as a duplicate of ticket 16954 ***