| Summary: | sreport not showing values for certain accounts | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | ARC Admins <arc-slurm-admins> |
| Component: | User Commands | Assignee: | Oscar Hernández <oscar.hernandez> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | oscar.hernandez |
| Version: | 21.08.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | University of Michigan | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm config
slurmdbd log |
||
Please also attach your slurmdbd.log. I may have you change the loglevel, reproduce and send the updated logs depending on if there is enough information in the logs to draw a conclusion. Created attachment 29201 [details]
slurmdbd log
(In reply to Jason Booth from comment #1) > Please also attach your slurmdbd.log. I may have you change the loglevel, > reproduce and send the updated logs depending on if there is enough > information in the logs to draw a conclusion. Morning, Jason, I've reran the commands and attached the log file. Thanks! David Hi David,
Was taking a look at how sreport does use the account filtering. It seems that before actually aggregating the data for the requested accounts -- with Accounts=XXX -- it checks whether the requested accounts are in not deleted in the cluster.
>accounting_storage/as_mysql: _cluster_get_assocs: DB_ASSOC:
>8(as_mysql_assoc.c:2093) query
>select distinct t1.id_assoc, t1.lft, t1.rgt, t1.user, t1.acct, t1.`partition`,
>.........
>t1.delta_qos, t1.is_def, t1.deleted from "cls_assoc_table" as t1 where
>t1.deleted=0 && (t1.acct='test1') order by lft;
In my case, asking for account "test1", you can appreciate that the where clause includes: t1.deleted=0 && (t1.acct='test1').
To confirm, I did some testing on that. And when querying data for an account that has already been deleted in my cluster, I get a similar behavior than yours.
# with flatview
$ sreport -vvv -nP job SizesByAccount -M cls Start=2023-02-01T00:00:00 End=2023-03-09T12:00:00 PrintJobCount Flatview
$ cls|test1|19|0|0|0|0|0.12%
# filtering by non-exiting test1
$ sreport -vvv -nP job SizesByAccount -M cls Start=2023-02-01T00:00:00 End=2023-03-09T12:00:00 PrintJobCount Account=test1
$ (Blank out)
Is there any chance that "engin1" account does no longer exist in the cluster? What is the output of "sacctmgr show assoc account=engin1"?
Cheers,
Oscar
(In reply to Oscar Hernández from comment #5) Hi, Oscar! > Was taking a look at how sreport does use the account filtering. It seems > that before actually aggregating the data for the requested accounts -- with > Accounts=XXX -- it checks whether the requested accounts are in not deleted > in the cluster. > To confirm, I did some testing on that. And when querying data for an > account that has already been deleted in my cluster, I get a similar > behavior than yours. > > Is there any chance that "engin1" account does no longer exist in the > cluster? What is the output of "sacctmgr show assoc account=engin1"? > That is fascinating. This particular account hasn't been deleted. It belongs to one of our busiest departments, so I'd be surprised if it was. I did just run a basic check: ``` [drhey@gl-build hpc_billing]$ date; sacctmgr show assoc account=engin_root withsub -p | head -n3 Thu Mar 9 06:58:16 EST 2023 Cluster|Account|User|Partition|Share|Priority|GrpJobs|GrpTRES|GrpSubmit|GrpWall|GrpTRESMins|MaxJobs|MaxTRES|MaxTRESPerNode|MaxSubmit|MaxWall|MaxTRESMins|QOS|Def QOS|GrpTRESRunMins| greatlakes|engin_root|||1|||cpu=500,gres/gpu=5,mem=7000G||||||||||interactive,normal||| greatlakes|engin1|||1|||cpu=90,gres/gpu=2,mem=450G|||billing=92189716440|||||||interactive,normal||| ``` David Hi David, Thanks for confirming that. Thought it was worth checking into that, as it is the main difference when adding the filter. While I continue digging into this, could you check for a couple things more? - If you have access to the database, could you run (it should imply minimum load): select distinct t1.id_assoc, t1.lft, t1.rgt, t1.user, t1.acct, t1.`partition`, t1.shares, t1.grp_tres_mins, t1.grp_tres_run_mins, t1.grp_tres, t1.grp_jobs, t1.grp_jobs_accrue, t1.grp_submit_jobs, t1.grp_wall, t1.max_tres_mins_pj, t1.max_tres_run_mins, t1.max_tres_pj, t1.max_tres_pn, t1.max_jobs, t1.max_jobs_accrue, t1.min_prio_thresh, t1.max_submit_jobs, t1.max_wall_pj, t1.parent_acct, t1.priority, t1.def_qos_id, t1.qos, t1.delta_qos, t1.is_def, t1.deleted from cls_assoc_table as t1 where t1.deleted=0 && t1.acct='engin1' order by lft; # cls_assoc_table should be replaced by $CLUSTER_NAME_assoc_table. It should return all associations from account engin1. This query is equivalent to the one sreport triggers. For the moment I am assuming you get data here, but I think it is worth checking it directly in the database. I'm not interested in the particular data, just to know if it returns users or not. - You did mention there are different accounts affected by this. Does it work for some accounts then? Is there anything common that comes to your mind between the accounts that fail here? I'll continue taking a look into this. Will let you know if I find anything relevant. Kind regards, Oscar (In reply to Oscar Hernández from comment #7) Hi Oscar! > - If you have access to the database, could you run (it should imply minimum > load): > > select distinct t1.id_assoc, t1.lft, t1.rgt, t1.user, t1.acct, > t1.`partition`, t1.shares, t1.grp_tres_mins, t1.grp_tres_run_mins, > t1.grp_tres, t1.grp_jobs, t1.grp_jobs_accrue, t1.grp_submit_jobs, > t1.grp_wall, t1.max_tres_mins_pj, t1.max_tres_run_mins, t1.max_tres_pj, > t1.max_tres_pn, t1.max_jobs, t1.max_jobs_accrue, t1.min_prio_thresh, > t1.max_submit_jobs, t1.max_wall_pj, t1.parent_acct, t1.priority, > t1.def_qos_id, t1.qos, t1.delta_qos, t1.is_def, t1.deleted from > cls_assoc_table as t1 where t1.deleted=0 && t1.acct='engin1' order by lft; > > # cls_assoc_table should be replaced by $CLUSTER_NAME_assoc_table. > > It should return all associations from account engin1. This query is > equivalent to the one sreport triggers. For the moment I am assuming you get > data here, but I think it is worth checking it directly in the database. I'm > not interested in the particular data, just to know if it returns users or > not. The query returns 2215 rows, so I think we're good there. > - You did mention there are different accounts affected by this. Does it > work for some accounts then? Is there anything common that comes to your > mind between the accounts that fail here? Initially I thought there might be some linking between the accounts that don't show up. At first it seemed localized to our highly utilized departmental accounts - that contain hundreds of users, have limits set on them regularly, and have a lot of traffic. But then I noticed much smaller, lab-based accounts that don't change much. So, I don't have a good theory yet. The command (i.e. the one not using FlatView) works with the majority of the accounts passed to it. I can't give you an exact percentage, but anecdotally it's 98-99%. I can say that with relative confidence as we have this sreport command wrapped in a script where the accounts that are fed to the commands accounts= parameter are coming from slurm itself (e.g. from sacctmgr). We noticed the discrepancy in the resulting output where things were empty for accounts known to exist. > I'll continue taking a look into this. Will let you know if I find anything > relevant. Thanks! David Hi David, Thanks for the extra details provided. After revising a bit more the code, I have the suspicion this could be related to association lft and rgt values. What happens when you combine flatview and the accounts filter? > sreport -nP job SizesByAccount -M $CLUSTER_NAME Start=2023-02-01T00:00:00 End=2023-03-01T00:00:00 PrintJobCount flatview Accounts=engin1 Do you get the expected ouput? In case you do get data with the previous command, it would be great if you could run a bit more querys on the database, I am limiting each one to the first 3 lines: > select distinct t1.id_assoc, t1.lft, t1.rgt, t1.user, t1.acct, t1.deleted from greatlakes_assoc_table as t1 where t1.acct='engin1' order by lft limit 3; > select distinct t1.id_assoc, t1.lft, t1.rgt, t1.user, t1.acct, t1.deleted from greatlakes_assoc_table as t1 where t1.acct='engin1' order by lft desc limit 3; > select distinct t1.id_assoc, t1.lft, t1.rgt, t1.user, t1.acct, t1.deleted from greatlakes_assoc_table as t1 where t1.acct='engin1' order by rgt limit 3; > select distinct t1.id_assoc, t1.lft, t1.rgt, t1.user, t1.acct, t1.deleted from greatlakes_assoc_table as t1 where t1.acct='engin1' order by rgt desc limit 3; The idea is to check rgt and lft limits. So that it makes coherence with the code that actually checks a job association with an account. Kind regards, Oscar (In reply to Oscar Hernández from comment #10) Hi, Oscar, > Thanks for the extra details provided. After revising a bit more the code, I > have the suspicion this could be related to association lft and rgt values. That makes sense, now that you say it. We had a bug for something else related to users showing up outside of the account hierarchy and it was determined then that the issue was the rgt,lft values. We had planned to address that at an upcoming maintenance. > What happens when you combine flatview and the accounts filter? > > > sreport -nP job SizesByAccount -M $CLUSTER_NAME Start=2023-02-01T00:00:00 End=2023-03-01T00:00:00 PrintJobCount flatview Accounts=engin1 > Do you get the expected ouput? Yes, we get the expected output > In case you do get data with the previous command, it would be great if you > could run a bit more querys on the database, I am limiting each one to the > first 3 lines: Here's the output MariaDB [slurm_acct_db]> select distinct t1.id_assoc, t1.lft, t1.rgt, t1.user, t1.acct, t1.deleted from greatlakes_assoc_table as t1 where t1.acct='engin1' order by lft limit 3; +----------+-------+-------+--------+--------+---------+ | id_assoc | lft | rgt | user | acct | deleted | +----------+-------+-------+--------+--------+---------+ | 34998 | 73784 | 73785 | zcfan | engin1 | 0 | | 8737 | 73787 | 82312 | | engin1 | 0 | | 46928 | 73788 | 73789 | poshao | engin1 | 0 | +----------+-------+-------+--------+--------+---------+ 3 rows in set (0.042 sec) MariaDB [slurm_acct_db]> select distinct t1.id_assoc, t1.lft, t1.rgt, t1.user, t1.acct, t1.deleted from greatlakes_assoc_table as t1 where t1.acct='engin1' order by lft desc limit 3; +----------+-------+-------+----------+--------+---------+ | id_assoc | lft | rgt | user | acct | deleted | +----------+-------+-------+----------+--------+---------+ | 8738 | 82310 | 82311 | aaalkay | engin1 | 0 | | 8739 | 82308 | 82309 | aablove | engin1 | 0 | | 8740 | 82306 | 82307 | aadityal | engin1 | 1 | +----------+-------+-------+----------+--------+---------+ 3 rows in set (0.006 sec) MariaDB [slurm_acct_db]> select distinct t1.id_assoc, t1.lft, t1.rgt, t1.user, t1.acct, t1.deleted from greatlakes_assoc_table as t1 where t1.acct='engin1' order by rgt limit 3; +----------+-------+-------+----------+--------+---------+ | id_assoc | lft | rgt | user | acct | deleted | +----------+-------+-------+----------+--------+---------+ | 34998 | 73784 | 73785 | zcfan | engin1 | 0 | | 46928 | 73788 | 73789 | poshao | engin1 | 0 | | 46915 | 73790 | 73791 | junhaoqi | engin1 | 0 | +----------+-------+-------+----------+--------+---------+ 3 rows in set (0.005 sec) MariaDB [slurm_acct_db]> select distinct t1.id_assoc, t1.lft, t1.rgt, t1.user, t1.acct, t1.deleted from greatlakes_assoc_table as t1 where t1.acct='engin1' order by rgt desc limit 3; +----------+-------+-------+---------+--------+---------+ | id_assoc | lft | rgt | user | acct | deleted | +----------+-------+-------+---------+--------+---------+ | 8737 | 73787 | 82312 | | engin1 | 0 | | 8738 | 82310 | 82311 | aaalkay | engin1 | 0 | | 8739 | 82308 | 82309 | aablove | engin1 | 0 | +----------+-------+-------+---------+--------+---------+ 3 rows in set (0.005 sec) David Hi David, > Here's the output > > MariaDB [slurm_acct_db]> select distinct t1.id_assoc, t1.lft, t1.rgt, > t1.user, t1.acct, t1.deleted from greatlakes_assoc_table as t1 where > t1.acct='engin1' order by lft limit 3; > +----------+-------+-------+--------+--------+---------+ > | id_assoc | lft | rgt | user | acct | deleted | > +----------+-------+-------+--------+--------+---------+ > | 34998 | 73784 | 73785 | zcfan | engin1 | 0 | > | 8737 | 73787 | 82312 | | engin1 | 0 | > | 46928 | 73788 | 73789 | poshao | engin1 | 0 | > +----------+-------+-------+--------+--------+---------+ This one confirms it! zcfan user is the one creating the conflicts. zcfan user lft(73784) value is lower than the engin1 account lft(73787). This tricks sreport, which trusts in lft order to get the account parent lft and rgt values. So, essentially, it is only counting as valid jobs to report the ones that are submitted by an association with lft/rgt values within (73784-73785). But there are actually no jobs matching this, hence the balnk ouput. The correct range would be (73787-82312). Fixing the associations should definitely resolve this issue. >We had a bug for something else related to users showing up outside of the >account hierarchy and it was determined then that the issue was the rgt,lft >values. We had planned to address that at an upcoming maintenance. Just for completion, could you share the bug in which this was discussed? Let me know if you have any other doubt on that. Oscar (In reply to Oscar Hernández from comment #12) Hi Oscar! > This one confirms it! zcfan user is the one creating the conflicts. > > zcfan user lft(73784) value is lower than the engin1 account lft(73787). > This tricks sreport, which trusts in lft order to get the account parent lft > and rgt values. Thanks for confirming! > So, essentially, it is only counting as valid jobs to report the ones that > are submitted by an association with lft/rgt values within (73784-73785). > But there are actually no jobs matching this, hence the balnk ouput. The > correct range would be (73787-82312). > > Fixing the associations should definitely resolve this issue. > > >We had a bug for something else related to users showing up outside of the > >account hierarchy and it was determined then that the issue was the rgt,lft > >values. We had planned to address that at an upcoming maintenance. > Just for completion, could you share the bug in which this was discussed? Most definitely! It was bug 15767 > Let me know if you have any other doubt on that. I think that the evidence is mounting up that we need to fix our rgt,lft things sooner than later :) David Hi David,
> I think that the evidence is mounting up that we need to fix our rgt,lft
> things sooner than later :)
Yeah, looks like it is :). Bear in mind that this situation may leave some users out of control, specially if you have some limits defined at account level.
I have seen Benny already gave you indications on how you can fix lft and rgt values. So I am closing this one. Please, do not hesitate to reopen if you have any followup question.
Cheers,
Oscar
(In reply to Oscar Hernández from comment #14) Oscar > Yeah, looks like it is :). Bear in mind that this situation may leave some > users out of control, specially if you have some limits defined at account > level. What is meant here by "out of control"? Could users with limits, or users of accounts that have limits on the account, be able to circumvent these limits for example? David Hi David,
> What is meant here by "out of control"? Could users with limits, or users of
> accounts that have limits on the account, be able to circumvent these limits
> for example?
Yes. If there is an account that has some Grp limit defined, having some users out of its lft,rgt values would prevent these limits from affecting this users. I did test it to confirm:
E.g:
Let's say I set GrpJobs=1 to account acct:
sacctmgr modify account acct cluster=cls set GrpJobs=1
Listing the limits now:
Cluster|Account|User|Partition|Share|Priority|GrpJobs|GrpTRES|GrpSubmit|GrpWall|GrpTRESMins|MaxJobs|MaxTRES|MaxTRESPerNode|MaxSubmit|MaxWall|MaxTRESMins|QOS|Def QOS|GrpTRESRunMins|
cls|acct|||1|20|1|||||||||||high,low,normal|high||
cls|acct|oscar||1|20||||||||||||high,low,normal|high||
When user oscar submits more than 1 job, you get the expected:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
159 debug wrap oscar PD 0:00 1 (AssocGrpJobsLimit)
158 debug wrap oscar R 0:02 1 c4
However, if I break the database on purpose, setting oscar's lft-rgt values out of acct range, this limit is not applied and user oscar can have many jobs running at the same time. In account engin1, your user "zcfan" would be equivalent to mine in this example. Also tested this with GrpTRESMins, so I am pretty sure that would happen to any Grp limit.
PS: I am leaving the bug as resolved. If you do wish to reopen again, you can change the status to open, that way I will easily notice about the update.
Cheers,
Oscar
|
Created attachment 29195 [details] slurm config Hello, I am noticing some peculiar with sreport and the SizesByAccount report. In particular, certain accounts return nothing if I pass accounts= to the command. But those same accounts will show up if I use FlatView. An example: ``` [root@glctld ~]# date; sreport -nP job SizesByAccount -M $CLUSTER_NAME Start=2023-02-01T00:00:00 End=2023-03-01T00:00:00 PrintJobCount Accounts=engin1 Tue Mar 7 12:31:51 EST 2023 [root@glctld ~]# echo $? 0 [root@glctld ~]# date; sreport -nP job SizesByAccount -M $CLUSTER_NAME Start=2023-02-01T00:00:00 End=2023-03-01T00:00:00 PrintJobCount FlatView | grep engin1 Tue Mar 7 12:37:50 EST 2023 greatlakes|engin1|1760|0|0|0|0|0.15% ```