I'm trying to create a new account: [root@slurm5 ~]# sacctmgr create account cluster=summit name=clcsci48300417 parent=ucb-projects fairshare=10 Adding Account(s) clcsci48300417 Settings Description = Account Name Organization = Parent/Account Name Associations A = clcsci4830 C = summit Settings Fairshare = 10 Parent = ucb-projects Would you like to commit changes? (You have 30 seconds to decide) (N/y): y However, the new account is shown in `sshare` at the root level, ie with no parent: Account User RawShares NormShares RawUsage EffectvUsage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------- clcsci48300417 10 0.500000 0 0.000000 root 0.000000 12020693368 1.000000 root root 1 0.000093 0 0.000000 0.952484 csu 2100 0.195695 1408035747 0.117134 csu-general 20 0.181818 231969748 0.164747 csu-projects 80 0.727273 0 0.000000 csu-testing 10 0.090909 1176065999 0.835253 csu-summit-akr 100 0.009319 0 0.000000 csu-summit-bio 40 0.003728 0 0.000000 csu-summit-cfd 20 0.001864 5868326 0.000488 csu-summit-crw 240 0.022365 523982562 0.043590 csu-summit-fhw 40 0.003728 0 0.000000 csu-summit-hal 50 0.004659 0 0.000000 csu-summit-mat 40 0.003728 0 0.000000 rmacc 910 0.084801 0 0.000000 rmacc-general 20 0.111111 0 0.000000 rmacc-projects 80 0.444444 0 0.000000 rmacc-testing 80 0.444444 0 0.000000 ucb 6210 0.578697 9602499989 0.798831 ucb-general 20 0.198020 5773257189 0.601224 ucb-projects 80 0.792079 617361 0.000064 tutorial1 10 0.500000 617361 1.000000 ucb-testing 1 0.009901 3828625438 0.398711 ucb-summit-eav 120 0.011183 29741 0.000002 --- snip --- When I delete the account, it seems like Slurm did think it was a child of ucb-projects: [root@slurm5 ~]# sacctmgr delete account cluster=summit name=clcsci48300417 parent=ucb-projects Deleting account associations... C = summit A = clcsci48300417 of ucb-projects Would you like to commit changes? (You have 30 seconds to decide) (N/y): y it looks like it's making the account a part of the ucb-projects organization, but not a child of the ucb-project account [root@slurm5 ~]# sacctmgr show account clcsci48300417 Account Descr Org ---------- -------------------- -------------------- clcsci483+ clcsci48300417 ucb-projects [root@slurm5 ~]# sacctmgr show association Account=clcsci48300417 Format=Account,ParentID,ParentName Account Par ID Par Name ---------- ------ ---------- To compare: [root@slurm5 ~]# sacctmgr show associations Account=csu-projects Format=Account,ParentID,ParentName,Fairshare Account Par ID Par Name Share ---------- ------ ---------- --------- csu-proje+ 13 csu 80 Might be related to related to https://bugs.schedmd.com/show_bug.cgi?id=2537
That's definitely not a good sign; this tends to indicate issues with the association tree structure. The most common cause of that is attempting to modify the database directly through mysql. Bug 2537 was most likely caused by them manually deleting users from the database which corrupted the hierarchy. Can you attach (or send directly to me through email if you'd rather not have it publicly attached to the bug) the output from 'scontrol show assoc' and your current slurm.conf file?
Tim, (Working alongside Jonathon to add some possible additional info.) Is the issue here possibly just with the way that sshare is presenting the tree? It seems as though every other way I look at the clcsci48300417 makes it seem like is has the correct associations. Here's what I just did as a test: # sacctmgr create account cluster=summit name=clcsci48300417 parent=ucb-projects Associations A = clcsci4830 C = summit Settings Parent = ucb-projects Would you like to commit changes? (You have 30 seconds to decide) (N/y): y # sshare Account User RawShares NormShares RawUsage EffectvUsage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------- clcsci48300417 1 0.090909 0 0.000000 root 0.000000 12136407298 1.000000 root root 1 0.000093 0 0.000000 0.952586 csu 2100 0.195695 1703879163 0.140392 csu-general 20 0.181818 231824664 0.136058 csu-projects 80 0.727273 0 0.000000 csu-testing 10 0.090909 1472054498 0.863942 csu-summit-akr 100 0.009319 0 0.000000 csu-summit-bio 40 0.003728 0 0.000000 csu-summit-cfd 20 0.001864 5649832 0.000466 csu-summit-crw 240 0.022365 504744609 0.041589 csu-summit-fhw 40 0.003728 0 0.000000 csu-summit-hal 50 0.004659 0 0.000000 csu-summit-mat 40 0.003728 0 0.000000 rmacc 910 0.084801 0 0.000000 rmacc-general 20 0.111111 0 0.000000 rmacc-projects 80 0.444444 0 0.000000 rmacc-testing 80 0.444444 0 0.000000 ucb 6210 0.578697 9284230410 0.764992 ucb-general 20 0.198020 5635805992 0.607029 ucb-projects 80 0.792079 592077 0.000064 tutorial1 10 0.909091 592077 1.000000 ucb-testing 1 0.009901 3647832340 0.392907 ucb-summit-eav 120 0.011183 28329 0.000002 ucb-summit-gfd 80 0.007455 0 0.000000 -- snip -- # sacctmgr show account clcsci48300417 Account Descr Org ---------- -------------------- -------------------- clcsci483+ clcsci48300417 ucb-projects # sacctmgr show association Account=clcsci48300417 Format=Account,ParentID,ParentName -p Account|Par ID|Par Name| clcsci48300417|17|ucb-projects| As a comparison: # sacctmgr show association Account=tutorial1 Format=Account,ParentID,ParentName -p Account|Par ID|Par Name| tutorial1|17|ucb-projects| -- snip -- Now I give clcsci48300417 equal fairshare as tutorial1: # sacctmgr modify account where cluster=summit name=clcsci48300417 set fairshare=10 And sshare does now show that it and tutorial1 are sharing equally under ucb-projects: # sshare Account User RawShares NormShares RawUsage EffectvUsage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------- clcsci48300417 10 0.500000 0 0.000000 root 0.000000 12136756948 1.000000 root root 1 0.000093 0 0.000000 0.952586 csu 2100 0.195695 1703622328 0.140369 csu-general 20 0.181818 231820816 0.136075 csu-projects 80 0.727273 0 0.000000 csu-testing 10 0.090909 1471801511 0.863925 csu-summit-akr 100 0.009319 0 0.000000 csu-summit-bio 40 0.003728 0 0.000000 csu-summit-cfd 20 0.001864 5648861 0.000465 csu-summit-crw 240 0.022365 504657863 0.041581 csu-summit-fhw 40 0.003728 0 0.000000 csu-summit-hal 50 0.004659 0 0.000000 csu-summit-mat 40 0.003728 0 0.000000 rmacc 910 0.084801 0 0.000000 rmacc-general 20 0.111111 0 0.000000 rmacc-projects 80 0.444444 0 0.000000 rmacc-testing 80 0.444444 0 0.000000 ucb 6210 0.578697 9283981302 0.764947 ucb-general 20 0.198020 5636183903 0.607087 ucb-projects 80 0.792079 591975 0.000064 tutorial1 10 0.500000 591975 1.000000 ucb-testing 1 0.009901 3647205422 0.392849 ucb-summit-eav 120 0.011183 28325 0.000002 ucb-summit-gfd 80 0.007455 0 0.000000 But its location in the tree output from sshare is confusing. Regards, Pete
Now when I add another account (called "justatest", it shows up at the top of the sshare output and clcsci48300417 is in the tree under ucb-projects as expected: # sacctmgr create account cluster=summit name=justatest parent=ucb-projects Adding Account(s) justatest Settings Description = Account Name Organization = Parent/Account Name Associations A = justatest C = summit Settings Parent = ucb-projects Would you like to commit changes? (You have 30 seconds to decide) (N/y): y # sshare Account User RawShares NormShares RawUsage EffectvUsage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------- justatest 1 0.047619 0 0.000000 root 0.000000 12138563424 1.000000 root root 1 0.000093 0 0.000000 0.952586 csu 2100 0.195695 1702607801 0.140264 csu-general 20 0.181818 231817806 0.136155 csu-projects 80 0.727273 0 0.000000 csu-testing 10 0.090909 1470789994 0.863845 csu-summit-akr 100 0.009319 0 0.000000 csu-summit-bio 40 0.003728 0 0.000000 csu-summit-cfd 20 0.001864 5644979 0.000465 csu-summit-crw 240 0.022365 504311030 0.041546 csu-summit-fhw 40 0.003728 0 0.000000 csu-summit-hal 50 0.004659 0 0.000000 csu-summit-mat 40 0.003728 0 0.000000 rmacc 910 0.084801 0 0.000000 rmacc-general 20 0.111111 0 0.000000 rmacc-projects 80 0.444444 0 0.000000 rmacc-testing 80 0.444444 0 0.000000 ucb 6210 0.578697 9283345808 0.764781 ucb-general 20 0.198020 5638055408 0.607330 ucb-projects 80 0.792079 591568 0.000064 clcsci48300417 10 0.476190 0 0.000000 tutorial1 10 0.476190 591568 1.000000 ucb-testing 1 0.009901 3644698830 0.392606 ucb-summit-eav 120 0.011183 28305 0.000002 ucb-summit-gfd 80 0.007455 0 0.000000
Sorry for not getting back faster. There definitely appears to be an issue with the internal accounts hierarchy. We'd added a way to rebuild that automatically in such cases - if you're able to manually restart slurmdbd with the -R flag (possibly in combination with -D -vvv to keep it in the foreground and increase the verbosity) that should be able to correct this automatically. (Although I'd recommend having a recent database snapshot at hand in case of issues.) I'd certainly like to know what caused that, but unfortunately I suspect the cause of that is lost much further back in the logs somewhere. - Tim
(In reply to Tim Wickberg from comment #5) > Sorry for not getting back faster. > > There definitely appears to be an issue with the internal accounts > hierarchy. We'd added a way to rebuild that automatically in such cases - if > you're able to manually restart slurmdbd with the -R flag (possibly in > combination with -D -vvv to keep it in the foreground and increase the > verbosity) that should be able to correct this automatically. (Although I'd > recommend having a recent database snapshot at hand in case of issues.) > > I'd certainly like to know what caused that, but unfortunately I suspect the > cause of that is lost much further back in the logs somewhere. > > - Tim Did you have a chance to restart slurmdbd with the -R flag to see if that'd resolve the issue?
We recently restarted slurmdbd with -R, but we neglected to capture verbose log output to see if it actually rebuilt anything. We'll try to replicate the issue and see if it's still a problem.
(In reply to Jonathon Anderson from comment #7) > We recently restarted slurmdbd with -R, but we neglected to capture verbose > log output to see if it actually rebuilt anything. We'll try to replicate > the issue and see if it's still a problem. I'm guessing that's fixed this thus far? Is it okay to mark this resolved for now?
Sure, you can close this. We haven't been able to reproduce the problem since slurmdbd -R; but we'll let you know if it reoccurs.
(In reply to Jonathon Anderson from comment #9) > Sure, you can close this. We haven't been able to reproduce the problem > since slurmdbd -R; but we'll let you know if it reoccurs. Okay, glad things have been working smoothly since then. Marking resolved/infogiven.