Ticket 15767

Summary: sacctmgr output
Product: Slurm Reporter: ARC Admins <arc-slurm-admins>
Component: User CommandsAssignee: Director of Support <support>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 21.08.8   
Hardware: Linux   
OS: Linux   
Site: University of Michigan Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: slurm.conf

Description ARC Admins 2023-01-10 13:36:58 MST
Created attachment 28402 [details]
slurm.conf

Hello,

We have a handful of users that appear to exist outside of the hierarchy in output from sacctmgr and we are wondering if there's any way to fix it - or stop it from happening in the future. 

For example:

```
$ sacctmgr show assoc account=epid582w23_class withsub tree  format=cluster,account,parentname,parentid,User,id,Partition,Cluster,QoS,QoSRaw,LFT,RGT  -P | head
Account|ParentName|ParentID|User|ID|Partition|Cluster|QOS|QOS_RAW|LFT|RGT
epid582w23_class||43150|anuha|44081||greatlakes|class|17|2014|2015
epid582w23_class|epid582w23_class_root|43149||43150||greatlakes|class|17|2017|2094
 epid582w23_class||43150|agarret|44103||greatlakes|class|17|2044|2045
 epid582w23_class||43150|amandh|43159||greatlakes|class|17|2076|2077
 epid582w23_class||43150|apirani|43173||greatlakes|class|17|2048|2049
 epid582w23_class||43150|aubahr|43169||greatlakes|class|17|2056|2057
```

One can see that the user anuha is a child of the account, yet their entry exists outside of the hierarchical structure. And if we look at the _class_root account they don't show at all:

```
$ sacctmgr show assoc account=epid582w23_class_root withsub tree  format=cluster,account,parentname,parentid,User,id,Partition,Cluster,QoS,QoSRaw,LFT,RGT  -P | head
Account|ParentName|ParentID|User|ID|Partition|Cluster|QOS|QOS_RAW|LFT|RGT
epid582w23_class_root|root|1||43149||greatlakes|interactive,normal|1,3|2016|2095
 epid582w23_class|epid582w23_class_root|43149||43150||greatlakes|class|17|2017|2094
  epid582w23_class||43150|agarret|44103||greatlakes|class|17|2044|2045
  epid582w23_class||43150|amandh|43159||greatlakes|class|17|2076|2077
  epid582w23_class||43150|apirani|43173||greatlakes|class|17|2048|2049
  epid582w23_class||43150|aubahr|43169||greatlakes|class|17|2056|2057
```

Any ideas as to what might be happening?
David
Comment 1 Benny Hedayati 2023-01-10 13:50:42 MST
Hi,

I will be happy to look into this for you, let me verify your request on my end and I will get back to you as soon as possible.

Thanks
Comment 2 Benny Hedayati 2023-01-11 11:17:58 MST
Hi,

Thank you for your patience, if you are familiar with LFT and RGT which are explained here:

https://slurm.schedmd.com/sacctmgr.html#OPT_RGT

If there is overlap in these left and right associations there are usually discrepancies like in the examples you are describing.

What I would suggest is that you do a mysqldump to backup your current database and during your next scheduled maintenance, you do a slurmdbd -R[comma separated cluster name list] as explained here:

https://slurm.schedmd.com/slurmdbd.html#OPT_-R[comma-separated-cluster-name-list]

To reset the lft and rgt values of the associations in the given cluster list.

Thanks
Comment 3 Benny Hedayati 2023-01-17 08:58:47 MST
Hi,

Was I able to answer all your questions regarding your issue?  If so can we go ahead and close this ticket?  Please let me know if you require anything else concerning this problem.

Thanks,
Comment 4 ARC Admins 2023-01-18 09:01:53 MST
Hi,

Yes, thanks! We have a couple of additional questions on implementing your suggestion. Would we be able to do this work by *only* taking the database down? In this case, the ctld's for the various clusters would continue running while we took the dbd down, did the backup, and implemented the fix.

If we are able to do this, would the fix impact associations in such a way that jobs that were running, or even queued to run once the dbd was brought back online?

David
Comment 6 Benny Hedayati 2023-01-19 09:27:04 MST
This is possible though I would caution you that the slurmctld had a limit to the number of messages it can cache and retain before new ones are discarded. You can view this in the sdiag output as "DBD Agent queue size".

https://slurm.schedmd.com/sdiag.html#OPT_DBD-Agent-queue-size

I would say this is acceptable as long as they are aware that it should not take place during a time when a large number of jobs are submitted/start/complete.

Thanks
Comment 7 Benny Hedayati 2023-01-25 08:37:01 MST
Hi,

Do you have any further questions on this issue?  If not can we go ahead and close this ticket?

Thanks
Comment 8 ARC Admins 2023-01-25 08:42:38 MST
Hi, Benny,

We can close it for now, thanks!

David
Comment 9 Benny Hedayati 2023-01-25 09:13:05 MST
Thanks and have a great day.