Ticket 16746 - CI/CD load (300K jobs in under a day) + job env. bug caused enormous job_env_table
Summary: CI/CD load (300K jobs in under a day) + job env. bug caused enormous job_env_...
Status: RESOLVED DUPLICATE of ticket 16954
Alias: None
Product: Slurm
Classification: Unclassified
Component: Database (show other tickets)
Version: 22.05.6
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Scott Hilton
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-05-16 10:55 MDT by S Senator
Modified: 2025-04-01 13:54 MDT (History)
6 users (show)

See Also:
Site: LANL
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: RHEL
Machine Name: chicoma
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description S Senator 2023-05-16 10:55:07 MDT
We have the job_env and job_script flags set in slurm.conf.
An unusually large (300k in under a day) coupled with a bug in the user's job environment (a recursive sbatch submission with a LD_LIBRARY_PATH that was only appended) caused our chicoma production data base to grow from ~10Gb (typical for the past 6-8 months) to 82Gb with 72Gb of that associated with the job_env_table.

As per tickets 16603 and 14514, we will be testing the procedure in the comments:
  MariaDB > update <clustername>_job_table set env_vars=NULL;

Please confirm that this is the whole procedure. That is, if we want subsequent environment variables to be captured, that this does not need to be reset.

Also, please consider some threshold logic for a future resilience enhancement, such as job_env_sizelimit, job_script_sizelimit, or an appropriate throttle to catch this.

Thank you.
Comment 2 Scott Hilton 2023-05-16 11:42:33 MDT
The tables changed in 22.05. If you are indeed running 22.05.6, that fix won't work. 

However, I am surprised this issue happened in 22.05. That would mean that each job had a different env_vars string. Though perhaps an env variable was being incremented?

-Scott
Comment 3 Scott Hilton 2023-05-16 11:56:21 MDT
The environment variables are stored in a table called <clustername>_job_env_table. This is linked to each job in the <clustername>_job_table with env_hash_inx.

I recommend you follow the procedure in bug 15383 comment 8 if you want to remove all the data from the <clustername>_job_env_table.

-Scott
Comment 4 S Senator 2023-05-16 12:03:26 MDT
The LD_LIBRARY_PATH variable was appended to with every job invocation, so it was changing and growing for each job. Multiply by 300k+ jobs and it became a problem.

We intend to increase our archiving and purging, within the oversight and accounting policies that our systems are subject to.  Are there a parameters such as: PurgeJobEnvAfter and PurgeJobScriptAfter?

We cannot use PurgeJobAfter records themselves completely, but we could purge the job scripts and their environments reasonably aggressively, approximately on the order of weeks.

Since the tables have changed, please suggest the appropriate "trim" SQL table update command, perhaps resembling (based on the other comment):

  MariaDB > truncate <clustername>_job_env_table where user=<user-name>;

so that only this particular user-relevant data is truncated.
Comment 5 Scott Hilton 2023-05-16 14:40:24 MDT
I think you will want something like this:
>DELETE from <clustername>_job_env_table where hash_inx IN (SELECT env_hash_inx from <clustername>_job_table where id_user=<uid_of_user>);
I did test this, but I recommend you double check it first if some of the data in the job_env_table is important.

We currently do not have a feature like PurgeJobEnvAfter and PurgeJobScriptAfter.

-Scott
Comment 8 S Senator 2023-05-17 10:11:52 MDT
I've done that and then used optimize table <clustername>_job_env_table; but don't see a reduction in size of the table.  Guidance appreciated.
Comment 9 S Senator 2023-05-17 10:31:44 MDT
I was impatient. The table size did get reduced after ~10+ minutes. Thank you.
Comment 10 Scott Hilton 2023-05-17 15:19:49 MDT
Do you have any other questions about this?

-Scott
Comment 11 S Senator 2023-05-18 09:28:25 MDT
No additional questions, thank you. As always we appreciate the prompt, thorough & information-rich responses you & your team provides.
Comment 12 Scott Hilton 2023-05-18 10:39:47 MDT
Glad we could help.

-Scott
Comment 13 S Senator 2025-03-31 08:48:55 MDT
Please reconsider this as a recurring bug which needs to be fixed.
Comment 14 Scott Hilton 2025-03-31 14:23:25 MDT
We now purge the script and env tables in lockstep with the job table as specified in PurgeJobAfter. 
See commit b2bc5ec5670f85. This is in 23.02 and later.

Is this what you are asking about?

-Scott
Comment 15 S Senator 2025-03-31 14:37:49 MDT
> We now purge the script and env tables in lockstep with the job table as specified in PurgeJobAfter. 

Not exactly. We really do not want to purge jobs. But we do need to purge job scripts, job environments and job steps.  Purging at the job level is too coarse-grained, but is our workaround.
Comment 17 Scott Hilton 2025-04-01 13:14:50 MDT
We could probably add PurgeScriptAfter / PurgeEnvAfter options as an NRE project. Is this something you are interested in?

-Scott
Comment 18 Scott Hilton 2025-04-01 13:19:02 MDT
I see that ticket 16954 is already opened on this issue.

Is there a specific reason you choose to reopen this ticket instead of discussing the development request in 16954?

-Scott
Comment 19 S Senator 2025-04-01 13:54:12 MDT
(In reply to Scott Hilton from comment #18)
> I see that ticket 16954 is already opened on this issue.
> 
> Is there a specific reason you choose to reopen this ticket instead of
> discussing the development request in 16954?
> 
> -Scott

This one was fresher in my history and I found it first. So, no rational or technical reason to use this one vs. 16954. 

FYI- We have requested funding for 16954 Enhance SOW.

*** This ticket has been marked as a duplicate of ticket 16954 ***