Created attachment 28402 [details] slurm.conf Hello, We have a handful of users that appear to exist outside of the hierarchy in output from sacctmgr and we are wondering if there's any way to fix it - or stop it from happening in the future. For example: ``` $ sacctmgr show assoc account=epid582w23_class withsub tree format=cluster,account,parentname,parentid,User,id,Partition,Cluster,QoS,QoSRaw,LFT,RGT -P | head Account|ParentName|ParentID|User|ID|Partition|Cluster|QOS|QOS_RAW|LFT|RGT epid582w23_class||43150|anuha|44081||greatlakes|class|17|2014|2015 epid582w23_class|epid582w23_class_root|43149||43150||greatlakes|class|17|2017|2094 epid582w23_class||43150|agarret|44103||greatlakes|class|17|2044|2045 epid582w23_class||43150|amandh|43159||greatlakes|class|17|2076|2077 epid582w23_class||43150|apirani|43173||greatlakes|class|17|2048|2049 epid582w23_class||43150|aubahr|43169||greatlakes|class|17|2056|2057 ``` One can see that the user anuha is a child of the account, yet their entry exists outside of the hierarchical structure. And if we look at the _class_root account they don't show at all: ``` $ sacctmgr show assoc account=epid582w23_class_root withsub tree format=cluster,account,parentname,parentid,User,id,Partition,Cluster,QoS,QoSRaw,LFT,RGT -P | head Account|ParentName|ParentID|User|ID|Partition|Cluster|QOS|QOS_RAW|LFT|RGT epid582w23_class_root|root|1||43149||greatlakes|interactive,normal|1,3|2016|2095 epid582w23_class|epid582w23_class_root|43149||43150||greatlakes|class|17|2017|2094 epid582w23_class||43150|agarret|44103||greatlakes|class|17|2044|2045 epid582w23_class||43150|amandh|43159||greatlakes|class|17|2076|2077 epid582w23_class||43150|apirani|43173||greatlakes|class|17|2048|2049 epid582w23_class||43150|aubahr|43169||greatlakes|class|17|2056|2057 ``` Any ideas as to what might be happening? David
Hi, I will be happy to look into this for you, let me verify your request on my end and I will get back to you as soon as possible. Thanks
Hi, Thank you for your patience, if you are familiar with LFT and RGT which are explained here: https://slurm.schedmd.com/sacctmgr.html#OPT_RGT If there is overlap in these left and right associations there are usually discrepancies like in the examples you are describing. What I would suggest is that you do a mysqldump to backup your current database and during your next scheduled maintenance, you do a slurmdbd -R[comma separated cluster name list] as explained here: https://slurm.schedmd.com/slurmdbd.html#OPT_-R[comma-separated-cluster-name-list] To reset the lft and rgt values of the associations in the given cluster list. Thanks
Hi, Was I able to answer all your questions regarding your issue? If so can we go ahead and close this ticket? Please let me know if you require anything else concerning this problem. Thanks,
Hi, Yes, thanks! We have a couple of additional questions on implementing your suggestion. Would we be able to do this work by *only* taking the database down? In this case, the ctld's for the various clusters would continue running while we took the dbd down, did the backup, and implemented the fix. If we are able to do this, would the fix impact associations in such a way that jobs that were running, or even queued to run once the dbd was brought back online? David
This is possible though I would caution you that the slurmctld had a limit to the number of messages it can cache and retain before new ones are discarded. You can view this in the sdiag output as "DBD Agent queue size". https://slurm.schedmd.com/sdiag.html#OPT_DBD-Agent-queue-size I would say this is acceptable as long as they are aware that it should not take place during a time when a large number of jobs are submitted/start/complete. Thanks
Hi, Do you have any further questions on this issue? If not can we go ahead and close this ticket? Thanks
Hi, Benny, We can close it for now, thanks! David
Thanks and have a great day.