| Summary: | cannot disable WCkeys setting 'TrackWCKey=no' | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Quentin Neill <quentin.neill> |
| Component: | Accounting | Assignee: | Ben Roberts <ben> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | michael.schoenfelder |
| Version: | 20.11.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | SiFive | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm controller log from time of experiment
slurmdbd log from time of experiment |
||
Created attachment 24632 [details]
slurm controller log from time of experiment
Created attachment 24633 [details]
slurmdbd log from time of experiment
Hi Quentin,
The behavior you are showing does look like it's expected. The test job you are submitting is requesting a wckey explicitly, so even though you don't have the tracking of wckeys enabled, jobs are still allowed to request them.
Here's an example where I have the tracking of wckeys disabled for slurmctld and slurmdbd.
$ sacctmgr show configuration | grep -i wckey
TrackWCKey = No
TrackWCKey = No
$ scontrol show config | grep -i wckey
TrackWCKey = No
Even though I have a default wckey enabled for my user, a key is not added to my job unless I explicitly request it.
$ sacctmgr show user ben format=user,defaultwckey
User Def WCKey
---------- ----------
ben test
$ sbatch -n1 --wrap='srun sleep 15'
Submitted batch job 4575
$ sacct -X -o jobid,state,node,elapsed,wckey -j 4575
JobID State NodeList Elapsed WCKey
------------ ---------- --------------- ---------- ----------
4575 RUNNING node13 00:00:08
If I add a wckey request to my job submission, then it does allow the wckey to be added to the job.
$ sbatch -n1 --wckey=test --wrap='srun sleep 15'
Submitted batch job 4576
$ sacct -X -o jobid,state,node,elapsed,wckey -j 4576
JobID State NodeList Elapsed WCKey
------------ ---------- --------------- ---------- ----------
4576 RUNNING node13 00:00:05 test
If you want to enforce a policy that jobs not request a WCKey for their job then you would need to add 'wckeys' to the AccountingStorageEnforce parameter. Enabling this will turn back on the tracking of WCKeys, so you would want to make sure that users don't have a WCKey associated with their user.
$ sacctmgr delete user ben wckey=test
Deleting user WCKeys...
C = knight W = test U = ben
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
$ sacctmgr show user ben format=user,defaultwckey
User Def WCKey
---------- ----------
ben
$ scontrol show config | grep -i wckey
AccountingStorageEnforce = associations,limits,qos,safe,wckeys
TrackWCKey = Yes
With this configuration, if I try to submit a job that requests a WCKey it will be rejected.
$ sbatch -n1 --wckey=test --wrap='srun sleep 15'
sbatch: error: Batch job submission failed: Invalid wckey specification
In your situation, do you have users who have the wckey added to their job scripts? I can see that it would be difficult to have all users update their scripts, removing a wckey request. Are you seeing similar behavior to what I described, where jobs that request a wckey explicitly have it added, but jobs that don't request one don't have it added?
Thanks,
Ben
Hi Ben, For added background go read about our runawayjobs issues in https://bugs.schedmd.com/show_bug.cgi?id=13040 We have jenkins user that starts many slurm jobs, both periodically as a sort of 'release' user, or based on events connected to users (like patches). We use the WCKey to associate groups of related jobs so that administrators, users, and even scripts, can use slurm tools to distinguish between jobs related to one jenkins event vs. another. The problem was that we saw many, many runaway jobs (up to 300 a day). Moreover we had periods of controller slowness (controller offline/unresponsive) which all coincided with busier parts of the day across the company. Anyway, can you speak to what happens when TrackWCKey=no is set, but a job requests a WCKey anyway, does the user->wckey relationship still get stored? (I think the answer is yes based on our experiment as we saw 'insert into xxx_wckey_table' debugging). A little bit more detail: Our jobs use --wckey with a wckey that has not already been defined. There are no default wckeys and no wckeys defined with sacctmgr. All wckeys come from srun --wckey or sbatch --wckey Want *want* to do this. As Quentin pointed out, we are facing a problem with runaway jobs. During our investigation we temporarily set "TrackWCKey=no" to see if that ghelped with the runaway jobs, but were surprised to see that wckeys were still being tracked by slurmdbd. Oddly no more runaway jobs were observed, but that is not the point of this ticket. Quentin's last question is the point of this ticket. Ok, I read through the last part of bug 13040 (from where you started talking about wckeys) and I think I understand what you're trying to work through. If you are requesting a unique wckey for jobs all the time that does explain why you are seeing inserts to the <cluster>_wckey_table so frequently in the logs. When you request a wckey that doesn't exist and the option to track wckeys is enabled, it will add a new entry to that table. Then the job record will have a reference to the newly created wckey record. I did some testing with the TrackWCKey set to no for both slurmctld and slurmdbd and the behavior sounds like it's closer to what you want. With it off, a new wckey will not be added to the <cluster>_wckey_table in the database. Jobs that request a wckey will have the name of it inserted into the job record, but it will not have a reference to a corresponding wckey in the <cluster>_wckey_table. Here's an example of what I see in my testing: I started with TrackWCKey set to 'yes' for both slurmctld and slurmdbd. I submitted a job with a wckey (test) that didn't exist yet. $ sbatch -n1 --wckey=test --wrap='srun sleep 30' Submitted batch job 4585 In the database I can see that this WCKey was added and the job references this with the id_wckey field in the job record. MariaDB [slurm_2108]> select * from knight_wckey_table; +---------------+------------+---------+--------+----------+------------+------+ | creation_time | mod_time | deleted | is_def | id_wckey | wckey_name | user | +---------------+------------+---------+--------+----------+------------+------+ | 1650903471 | 1650903986 | 0 | 0 | 1 | | ben | | 1650903471 | 1650903471 | 0 | 0 | 2 | * | ben | | 1650903958 | 1650913777 | 0 | 1 | 3 | test | ben | +---------------+------------+---------+--------+----------+------------+------+ 3 rows in set (0.000 sec) MariaDB [slurm_2108]> select id_job,id_wckey,wckey from knight_job_table where id_job=4585; +--------+----------+-------+ | id_job | id_wckey | wckey | +--------+----------+-------+ | 4585 | 3 | test | +--------+----------+-------+ 1 row in set (0.000 sec) Then I turn off the TrackWCKey parametres and submit another test job, this time using 'test1' as the wckey since 'test' already exists. $ sbatch -n1 --wckey=test1 --wrap='srun sleep 30' Submitted batch job 4586 This time there isn't a new wckey created. MariaDB [slurm_2108]> select * from knight_wckey_table; +---------------+------------+---------+--------+----------+------------+------+ | creation_time | mod_time | deleted | is_def | id_wckey | wckey_name | user | +---------------+------------+---------+--------+----------+------------+------+ | 1650903471 | 1650903986 | 0 | 0 | 1 | | ben | | 1650903471 | 1650903471 | 0 | 0 | 2 | * | ben | | 1650903958 | 1650913777 | 0 | 1 | 3 | test | ben | +---------------+------------+---------+--------+----------+------------+------+ 3 rows in set (0.000 sec) The job record still includes the name of the wckey, but it doesn't have a reference to a record in the wckey table because it wasn't created there. MariaDB [slurm_2108]> select id_job,id_wckey,wckey from knight_job_table where id_job=4586; +--------+----------+-------+ | id_job | id_wckey | wckey | +--------+----------+-------+ | 4586 | 0 | test1 | +--------+----------+-------+ 1 row in set (0.000 sec) I was also watching the slurmdbd logs for these tests and I can see the insert log statement for the first test, where it creates the wckey. There is not a similar insert statement for the second job. The insert statement that is in the logs is to add a record to the <cluster>_wckey_table, so it makes sense that it shows up in the first case and not the second. In the second case it shouldn't add any real overhead to the handling of the job since it's just adding a string to the existing record creation. In summary, adding a wckey to different jobs without having Slurm configured to track them will allow you to see the different WCKeys without having added overhead of it creating a new record for each of these keys in the WCKey table in the database. You mention that you did see an insert statement in the logs when you had TrackWCKey off. Is that something you can reproduce? I'm not seeing that behavior on my side. Thanks, Ben This is beginning to make more sense to me now. The "Track" in "TrackWCKey = yes" means to track in the wckey_table. User requested wckeys via --wckey are always stored ("tracked") in the job table.
My usage of wckeys is in phrases such as:
sacct -a -W ${wckey} --format=State -n | sort | uniq -c
scancel --wckey ${wckey}
squeue -u $user -h -o "%w" | sort | uniq -c
If we don't need wckey_table to do that, then all is good. (I realize that the sacct is the only one that hits the DB, but I wanted to list all the usages.)
Hi Michael, Those commands should all work with the wckeys being added to the jobs without having TrackWCKey enabled. I did some testing to verify and they worked as expected. The sacct command looks at the job record and the others use the information slurmctld stores about the jobs. I'm glad to hear that your needs can be met without having TrackWCKey enabled since it was causing other problems with your workflow. Let me know if anything else comes up or if this ticket is ok to close. Thanks, Ben Hi Ben,
> You mention that you did see an insert statement in the logs when
> you had TrackWCKey off. Is that something you can reproduce?
Actually I was mistaken, those came from the period where we had re-enabled TrackWCKey, so ignore my statement
Quentin
Ben,
> Let me know if anything else comes up or if this ticket is ok to close.
I closed as "infogiven", adjust if you think otherwise.
Quentin
|
When we did the following: Set TrackWCKey=no in slurmdb.conf on our slurm DB machine Set TrackWCKey=no in slurm.conf on our slurm controller machine Set TrackWCKey=no in slurm.conf on a node NODE=sigma04 Restart slurmdbd Restart slurmctld Restart slurmd from node NODE assigned to node NODE Start a job with a WCKEY on node NODE: srun -w NODE --wckey='test-wckey1' sleep 300 & sacct -o jobid,state,node,elapsed,wckey -j 9456859 JobID State NodeList WCKey --------- -------- --------------- ------------- 9456859 RUNNING sigma04 test-wckey1 We expected a job run like this NOT to have a WCKey associated with it. Instead, we saw jobs running with WCKeys as normal. We tried several things (turn up Debug2 and some DebugFlags), bouncing slurmdbd and slurmctld and slurmd in different order, examining what 'scontrol show config' had for TrackWCKey, but no matter what it seemed that WCKeys kept being assigned to jobs and queries according to 'sacct' output. We observed 'insert into "compute1_wckey_table"' messages in slurmdbd.log. We ran some of our normal flows, all seemed to still be assigned wckey values. Note this was an attempt to help debug https://bugs.schedmd.com/show_bug.cgi?id=13040 and interestingly enough - it did mitigate the runaway jobs pretty clearly. Look for an update there soon.