| Summary: | new slurm account doesn't create parent association | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Jonathon Anderson <jonathon.anderson> |
| Component: | Database | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | peter.ruprecht |
| Version: | 16.05.10 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | University of Colorado | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
That's definitely not a good sign; this tends to indicate issues with the association tree structure. The most common cause of that is attempting to modify the database directly through mysql. Bug 2537 was most likely caused by them manually deleting users from the database which corrupted the hierarchy. Can you attach (or send directly to me through email if you'd rather not have it publicly attached to the bug) the output from 'scontrol show assoc' and your current slurm.conf file? Tim,
(Working alongside Jonathon to add some possible additional info.)
Is the issue here possibly just with the way that sshare is presenting the tree? It seems as though every other way I look at the clcsci48300417 makes it seem like is has the correct associations.
Here's what I just did as a test:
# sacctmgr create account cluster=summit name=clcsci48300417 parent=ucb-projects
Associations
A = clcsci4830 C = summit
Settings
Parent = ucb-projects
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
# sshare
Account User RawShares NormShares RawUsage EffectvUsage FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
clcsci48300417 1 0.090909 0 0.000000
root 0.000000 12136407298 1.000000
root root 1 0.000093 0 0.000000 0.952586
csu 2100 0.195695 1703879163 0.140392
csu-general 20 0.181818 231824664 0.136058
csu-projects 80 0.727273 0 0.000000
csu-testing 10 0.090909 1472054498 0.863942
csu-summit-akr 100 0.009319 0 0.000000
csu-summit-bio 40 0.003728 0 0.000000
csu-summit-cfd 20 0.001864 5649832 0.000466
csu-summit-crw 240 0.022365 504744609 0.041589
csu-summit-fhw 40 0.003728 0 0.000000
csu-summit-hal 50 0.004659 0 0.000000
csu-summit-mat 40 0.003728 0 0.000000
rmacc 910 0.084801 0 0.000000
rmacc-general 20 0.111111 0 0.000000
rmacc-projects 80 0.444444 0 0.000000
rmacc-testing 80 0.444444 0 0.000000
ucb 6210 0.578697 9284230410 0.764992
ucb-general 20 0.198020 5635805992 0.607029
ucb-projects 80 0.792079 592077 0.000064
tutorial1 10 0.909091 592077 1.000000
ucb-testing 1 0.009901 3647832340 0.392907
ucb-summit-eav 120 0.011183 28329 0.000002
ucb-summit-gfd 80 0.007455 0 0.000000
-- snip --
# sacctmgr show account clcsci48300417
Account Descr Org
---------- -------------------- --------------------
clcsci483+ clcsci48300417 ucb-projects
# sacctmgr show association Account=clcsci48300417 Format=Account,ParentID,ParentName -p
Account|Par ID|Par Name|
clcsci48300417|17|ucb-projects|
As a comparison:
# sacctmgr show association Account=tutorial1 Format=Account,ParentID,ParentName -p
Account|Par ID|Par Name|
tutorial1|17|ucb-projects|
-- snip --
Now I give clcsci48300417 equal fairshare as tutorial1:
# sacctmgr modify account where cluster=summit name=clcsci48300417 set fairshare=10
And sshare does now show that it and tutorial1 are sharing equally under ucb-projects:
# sshare
Account User RawShares NormShares RawUsage EffectvUsage FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
clcsci48300417 10 0.500000 0 0.000000
root 0.000000 12136756948 1.000000
root root 1 0.000093 0 0.000000 0.952586
csu 2100 0.195695 1703622328 0.140369
csu-general 20 0.181818 231820816 0.136075
csu-projects 80 0.727273 0 0.000000
csu-testing 10 0.090909 1471801511 0.863925
csu-summit-akr 100 0.009319 0 0.000000
csu-summit-bio 40 0.003728 0 0.000000
csu-summit-cfd 20 0.001864 5648861 0.000465
csu-summit-crw 240 0.022365 504657863 0.041581
csu-summit-fhw 40 0.003728 0 0.000000
csu-summit-hal 50 0.004659 0 0.000000
csu-summit-mat 40 0.003728 0 0.000000
rmacc 910 0.084801 0 0.000000
rmacc-general 20 0.111111 0 0.000000
rmacc-projects 80 0.444444 0 0.000000
rmacc-testing 80 0.444444 0 0.000000
ucb 6210 0.578697 9283981302 0.764947
ucb-general 20 0.198020 5636183903 0.607087
ucb-projects 80 0.792079 591975 0.000064
tutorial1 10 0.500000 591975 1.000000
ucb-testing 1 0.009901 3647205422 0.392849
ucb-summit-eav 120 0.011183 28325 0.000002
ucb-summit-gfd 80 0.007455 0 0.000000
But its location in the tree output from sshare is confusing.
Regards,
Pete
Now when I add another account (called "justatest", it shows up at the top of the sshare output and clcsci48300417 is in the tree under ucb-projects as expected:
# sacctmgr create account cluster=summit name=justatest parent=ucb-projects
Adding Account(s)
justatest
Settings
Description = Account Name
Organization = Parent/Account Name
Associations
A = justatest C = summit
Settings
Parent = ucb-projects
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
# sshare
Account User RawShares NormShares RawUsage EffectvUsage FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
justatest 1 0.047619 0 0.000000
root 0.000000 12138563424 1.000000
root root 1 0.000093 0 0.000000 0.952586
csu 2100 0.195695 1702607801 0.140264
csu-general 20 0.181818 231817806 0.136155
csu-projects 80 0.727273 0 0.000000
csu-testing 10 0.090909 1470789994 0.863845
csu-summit-akr 100 0.009319 0 0.000000
csu-summit-bio 40 0.003728 0 0.000000
csu-summit-cfd 20 0.001864 5644979 0.000465
csu-summit-crw 240 0.022365 504311030 0.041546
csu-summit-fhw 40 0.003728 0 0.000000
csu-summit-hal 50 0.004659 0 0.000000
csu-summit-mat 40 0.003728 0 0.000000
rmacc 910 0.084801 0 0.000000
rmacc-general 20 0.111111 0 0.000000
rmacc-projects 80 0.444444 0 0.000000
rmacc-testing 80 0.444444 0 0.000000
ucb 6210 0.578697 9283345808 0.764781
ucb-general 20 0.198020 5638055408 0.607330
ucb-projects 80 0.792079 591568 0.000064
clcsci48300417 10 0.476190 0 0.000000
tutorial1 10 0.476190 591568 1.000000
ucb-testing 1 0.009901 3644698830 0.392606
ucb-summit-eav 120 0.011183 28305 0.000002
ucb-summit-gfd 80 0.007455 0 0.000000
Sorry for not getting back faster. There definitely appears to be an issue with the internal accounts hierarchy. We'd added a way to rebuild that automatically in such cases - if you're able to manually restart slurmdbd with the -R flag (possibly in combination with -D -vvv to keep it in the foreground and increase the verbosity) that should be able to correct this automatically. (Although I'd recommend having a recent database snapshot at hand in case of issues.) I'd certainly like to know what caused that, but unfortunately I suspect the cause of that is lost much further back in the logs somewhere. - Tim (In reply to Tim Wickberg from comment #5) > Sorry for not getting back faster. > > There definitely appears to be an issue with the internal accounts > hierarchy. We'd added a way to rebuild that automatically in such cases - if > you're able to manually restart slurmdbd with the -R flag (possibly in > combination with -D -vvv to keep it in the foreground and increase the > verbosity) that should be able to correct this automatically. (Although I'd > recommend having a recent database snapshot at hand in case of issues.) > > I'd certainly like to know what caused that, but unfortunately I suspect the > cause of that is lost much further back in the logs somewhere. > > - Tim Did you have a chance to restart slurmdbd with the -R flag to see if that'd resolve the issue? We recently restarted slurmdbd with -R, but we neglected to capture verbose log output to see if it actually rebuilt anything. We'll try to replicate the issue and see if it's still a problem. (In reply to Jonathon Anderson from comment #7) > We recently restarted slurmdbd with -R, but we neglected to capture verbose > log output to see if it actually rebuilt anything. We'll try to replicate > the issue and see if it's still a problem. I'm guessing that's fixed this thus far? Is it okay to mark this resolved for now? Sure, you can close this. We haven't been able to reproduce the problem since slurmdbd -R; but we'll let you know if it reoccurs. (In reply to Jonathon Anderson from comment #9) > Sure, you can close this. We haven't been able to reproduce the problem > since slurmdbd -R; but we'll let you know if it reoccurs. Okay, glad things have been working smoothly since then. Marking resolved/infogiven. |
I'm trying to create a new account: [root@slurm5 ~]# sacctmgr create account cluster=summit name=clcsci48300417 parent=ucb-projects fairshare=10 Adding Account(s) clcsci48300417 Settings Description = Account Name Organization = Parent/Account Name Associations A = clcsci4830 C = summit Settings Fairshare = 10 Parent = ucb-projects Would you like to commit changes? (You have 30 seconds to decide) (N/y): y However, the new account is shown in `sshare` at the root level, ie with no parent: Account User RawShares NormShares RawUsage EffectvUsage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------- clcsci48300417 10 0.500000 0 0.000000 root 0.000000 12020693368 1.000000 root root 1 0.000093 0 0.000000 0.952484 csu 2100 0.195695 1408035747 0.117134 csu-general 20 0.181818 231969748 0.164747 csu-projects 80 0.727273 0 0.000000 csu-testing 10 0.090909 1176065999 0.835253 csu-summit-akr 100 0.009319 0 0.000000 csu-summit-bio 40 0.003728 0 0.000000 csu-summit-cfd 20 0.001864 5868326 0.000488 csu-summit-crw 240 0.022365 523982562 0.043590 csu-summit-fhw 40 0.003728 0 0.000000 csu-summit-hal 50 0.004659 0 0.000000 csu-summit-mat 40 0.003728 0 0.000000 rmacc 910 0.084801 0 0.000000 rmacc-general 20 0.111111 0 0.000000 rmacc-projects 80 0.444444 0 0.000000 rmacc-testing 80 0.444444 0 0.000000 ucb 6210 0.578697 9602499989 0.798831 ucb-general 20 0.198020 5773257189 0.601224 ucb-projects 80 0.792079 617361 0.000064 tutorial1 10 0.500000 617361 1.000000 ucb-testing 1 0.009901 3828625438 0.398711 ucb-summit-eav 120 0.011183 29741 0.000002 --- snip --- When I delete the account, it seems like Slurm did think it was a child of ucb-projects: [root@slurm5 ~]# sacctmgr delete account cluster=summit name=clcsci48300417 parent=ucb-projects Deleting account associations... C = summit A = clcsci48300417 of ucb-projects Would you like to commit changes? (You have 30 seconds to decide) (N/y): y it looks like it's making the account a part of the ucb-projects organization, but not a child of the ucb-project account [root@slurm5 ~]# sacctmgr show account clcsci48300417 Account Descr Org ---------- -------------------- -------------------- clcsci483+ clcsci48300417 ucb-projects [root@slurm5 ~]# sacctmgr show association Account=clcsci48300417 Format=Account,ParentID,ParentName Account Par ID Par Name ---------- ------ ---------- To compare: [root@slurm5 ~]# sacctmgr show associations Account=csu-projects Format=Account,ParentID,ParentName,Fairshare Account Par ID Par Name Share ---------- ------ ---------- --------- csu-proje+ 13 csu 80 Might be related to related to https://bugs.schedmd.com/show_bug.cgi?id=2537