| Summary: | sacctmgr output | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | ARC Admins <arc-slurm-admins> |
| Component: | User Commands | Assignee: | Director of Support <support> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 21.08.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | University of Michigan | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | slurm.conf | ||
Hi, I will be happy to look into this for you, let me verify your request on my end and I will get back to you as soon as possible. Thanks Hi, Thank you for your patience, if you are familiar with LFT and RGT which are explained here: https://slurm.schedmd.com/sacctmgr.html#OPT_RGT If there is overlap in these left and right associations there are usually discrepancies like in the examples you are describing. What I would suggest is that you do a mysqldump to backup your current database and during your next scheduled maintenance, you do a slurmdbd -R[comma separated cluster name list] as explained here: https://slurm.schedmd.com/slurmdbd.html#OPT_-R[comma-separated-cluster-name-list] To reset the lft and rgt values of the associations in the given cluster list. Thanks Hi, Was I able to answer all your questions regarding your issue? If so can we go ahead and close this ticket? Please let me know if you require anything else concerning this problem. Thanks, Hi, Yes, thanks! We have a couple of additional questions on implementing your suggestion. Would we be able to do this work by *only* taking the database down? In this case, the ctld's for the various clusters would continue running while we took the dbd down, did the backup, and implemented the fix. If we are able to do this, would the fix impact associations in such a way that jobs that were running, or even queued to run once the dbd was brought back online? David This is possible though I would caution you that the slurmctld had a limit to the number of messages it can cache and retain before new ones are discarded. You can view this in the sdiag output as "DBD Agent queue size". https://slurm.schedmd.com/sdiag.html#OPT_DBD-Agent-queue-size I would say this is acceptable as long as they are aware that it should not take place during a time when a large number of jobs are submitted/start/complete. Thanks Hi, Do you have any further questions on this issue? If not can we go ahead and close this ticket? Thanks Hi, Benny, We can close it for now, thanks! David Thanks and have a great day. |
Created attachment 28402 [details] slurm.conf Hello, We have a handful of users that appear to exist outside of the hierarchy in output from sacctmgr and we are wondering if there's any way to fix it - or stop it from happening in the future. For example: ``` $ sacctmgr show assoc account=epid582w23_class withsub tree format=cluster,account,parentname,parentid,User,id,Partition,Cluster,QoS,QoSRaw,LFT,RGT -P | head Account|ParentName|ParentID|User|ID|Partition|Cluster|QOS|QOS_RAW|LFT|RGT epid582w23_class||43150|anuha|44081||greatlakes|class|17|2014|2015 epid582w23_class|epid582w23_class_root|43149||43150||greatlakes|class|17|2017|2094 epid582w23_class||43150|agarret|44103||greatlakes|class|17|2044|2045 epid582w23_class||43150|amandh|43159||greatlakes|class|17|2076|2077 epid582w23_class||43150|apirani|43173||greatlakes|class|17|2048|2049 epid582w23_class||43150|aubahr|43169||greatlakes|class|17|2056|2057 ``` One can see that the user anuha is a child of the account, yet their entry exists outside of the hierarchical structure. And if we look at the _class_root account they don't show at all: ``` $ sacctmgr show assoc account=epid582w23_class_root withsub tree format=cluster,account,parentname,parentid,User,id,Partition,Cluster,QoS,QoSRaw,LFT,RGT -P | head Account|ParentName|ParentID|User|ID|Partition|Cluster|QOS|QOS_RAW|LFT|RGT epid582w23_class_root|root|1||43149||greatlakes|interactive,normal|1,3|2016|2095 epid582w23_class|epid582w23_class_root|43149||43150||greatlakes|class|17|2017|2094 epid582w23_class||43150|agarret|44103||greatlakes|class|17|2044|2045 epid582w23_class||43150|amandh|43159||greatlakes|class|17|2076|2077 epid582w23_class||43150|apirani|43173||greatlakes|class|17|2048|2049 epid582w23_class||43150|aubahr|43169||greatlakes|class|17|2056|2057 ``` Any ideas as to what might be happening? David