15767 – sacctmgr output

Ticket 15767 - sacctmgr output

Summary: sacctmgr output

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	User Commands (show other tickets)
Version:	21.08.8
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Director of Support
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2023-01-10 13:36 MST by ARC Admins
Modified:	2023-01-25 09:13 MST (History)
CC List:	0 users

See Also:
Site:	University of Michigan
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (6.83 KB, text/plain) 2023-01-10 13:36 MST, ARC Admins	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description ARC Admins 2023-01-10 13:36:58 MST

Created attachment 28402 [details]
slurm.conf

Hello,

We have a handful of users that appear to exist outside of the hierarchy in output from sacctmgr and we are wondering if there's any way to fix it - or stop it from happening in the future. 

For example:

```
$ sacctmgr show assoc account=epid582w23_class withsub tree  format=cluster,account,parentname,parentid,User,id,Partition,Cluster,QoS,QoSRaw,LFT,RGT  -P | head
Account|ParentName|ParentID|User|ID|Partition|Cluster|QOS|QOS_RAW|LFT|RGT
epid582w23_class||43150|anuha|44081||greatlakes|class|17|2014|2015
epid582w23_class|epid582w23_class_root|43149||43150||greatlakes|class|17|2017|2094
 epid582w23_class||43150|agarret|44103||greatlakes|class|17|2044|2045
 epid582w23_class||43150|amandh|43159||greatlakes|class|17|2076|2077
 epid582w23_class||43150|apirani|43173||greatlakes|class|17|2048|2049
 epid582w23_class||43150|aubahr|43169||greatlakes|class|17|2056|2057
```

One can see that the user anuha is a child of the account, yet their entry exists outside of the hierarchical structure. And if we look at the _class_root account they don't show at all:

```
$ sacctmgr show assoc account=epid582w23_class_root withsub tree  format=cluster,account,parentname,parentid,User,id,Partition,Cluster,QoS,QoSRaw,LFT,RGT  -P | head
Account|ParentName|ParentID|User|ID|Partition|Cluster|QOS|QOS_RAW|LFT|RGT
epid582w23_class_root|root|1||43149||greatlakes|interactive,normal|1,3|2016|2095
 epid582w23_class|epid582w23_class_root|43149||43150||greatlakes|class|17|2017|2094
  epid582w23_class||43150|agarret|44103||greatlakes|class|17|2044|2045
  epid582w23_class||43150|amandh|43159||greatlakes|class|17|2076|2077
  epid582w23_class||43150|apirani|43173||greatlakes|class|17|2048|2049
  epid582w23_class||43150|aubahr|43169||greatlakes|class|17|2056|2057
```

Any ideas as to what might be happening?
David

Comment 1 Benny Hedayati 2023-01-10 13:50:42 MST

Hi,

I will be happy to look into this for you, let me verify your request on my end and I will get back to you as soon as possible.

Thanks

Comment 2 Benny Hedayati 2023-01-11 11:17:58 MST

Hi,

Thank you for your patience, if you are familiar with LFT and RGT which are explained here:

https://slurm.schedmd.com/sacctmgr.html#OPT_RGT

If there is overlap in these left and right associations there are usually discrepancies like in the examples you are describing.

What I would suggest is that you do a mysqldump to backup your current database and during your next scheduled maintenance, you do a slurmdbd -R[comma separated cluster name list] as explained here:

https://slurm.schedmd.com/slurmdbd.html#OPT_-R[comma-separated-cluster-name-list]

To reset the lft and rgt values of the associations in the given cluster list.

Thanks

Comment 3 Benny Hedayati 2023-01-17 08:58:47 MST

Hi,

Was I able to answer all your questions regarding your issue?  If so can we go ahead and close this ticket?  Please let me know if you require anything else concerning this problem.

Thanks,

Comment 4 ARC Admins 2023-01-18 09:01:53 MST

Hi,

Yes, thanks! We have a couple of additional questions on implementing your suggestion. Would we be able to do this work by *only* taking the database down? In this case, the ctld's for the various clusters would continue running while we took the dbd down, did the backup, and implemented the fix.

If we are able to do this, would the fix impact associations in such a way that jobs that were running, or even queued to run once the dbd was brought back online?

David

Comment 6 Benny Hedayati 2023-01-19 09:27:04 MST

This is possible though I would caution you that the slurmctld had a limit to the number of messages it can cache and retain before new ones are discarded. You can view this in the sdiag output as "DBD Agent queue size".

https://slurm.schedmd.com/sdiag.html#OPT_DBD-Agent-queue-size

I would say this is acceptable as long as they are aware that it should not take place during a time when a large number of jobs are submitted/start/complete.

Thanks

Comment 7 Benny Hedayati 2023-01-25 08:37:01 MST

Hi,

Do you have any further questions on this issue?  If not can we go ahead and close this ticket?

Thanks

Comment 8 ARC Admins 2023-01-25 08:42:38 MST

Hi, Benny,

We can close it for now, thanks!

David

Comment 9 Benny Hedayati 2023-01-25 09:13:05 MST

Thanks and have a great day.