Ticket 3683

Summary: new slurm account doesn't create parent association
Product: Slurm Reporter: Jonathon Anderson <jonathon.anderson>
Component: DatabaseAssignee: Tim Wickberg <tim>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: peter.ruprecht
Version: 16.05.10   
Hardware: Linux   
OS: Linux   
Site: University of Colorado Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Jonathon Anderson 2017-04-10 13:17:38 MDT
I'm trying to create a new account:

[root@slurm5 ~]# sacctmgr create account cluster=summit name=clcsci48300417 parent=ucb-projects fairshare=10
 Adding Account(s)
  clcsci48300417
 Settings
  Description     = Account Name
  Organization    = Parent/Account Name
 Associations
  A = clcsci4830 C = summit    
 Settings
  Fairshare     = 10
  Parent        = ucb-projects
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y

However, the new account is shown in `sshare` at the root level, ie with no parent:

             Account       User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare 
-------------------- ---------- ---------- ----------- ----------- ------------- ---------- 
clcsci48300417                          10    0.500000           0      0.000000            
root                                          0.000000 12020693368      1.000000            
 root                      root          1    0.000093           0      0.000000   0.952484 
 csu                                  2100    0.195695  1408035747      0.117134            
  csu-general                           20    0.181818   231969748      0.164747            
  csu-projects                          80    0.727273           0      0.000000            
  csu-testing                           10    0.090909  1176065999      0.835253            
 csu-summit-akr                        100    0.009319           0      0.000000            
 csu-summit-bio                         40    0.003728           0      0.000000            
 csu-summit-cfd                         20    0.001864     5868326      0.000488            
 csu-summit-crw                        240    0.022365   523982562      0.043590            
 csu-summit-fhw                         40    0.003728           0      0.000000            
 csu-summit-hal                         50    0.004659           0      0.000000            
 csu-summit-mat                         40    0.003728           0      0.000000            
 rmacc                                 910    0.084801           0      0.000000            
  rmacc-general                         20    0.111111           0      0.000000            
  rmacc-projects                        80    0.444444           0      0.000000            
  rmacc-testing                         80    0.444444           0      0.000000            
 ucb                                  6210    0.578697  9602499989      0.798831            
  ucb-general                           20    0.198020  5773257189      0.601224            
  ucb-projects                          80    0.792079      617361      0.000064            
   tutorial1                            10    0.500000      617361      1.000000            
  ucb-testing                            1    0.009901  3828625438      0.398711            
 ucb-summit-eav                        120    0.011183       29741      0.000002 
  --- snip ---

When I delete the account, it seems like Slurm did think it was a child of ucb-projects:

[root@slurm5 ~]# sacctmgr delete account cluster=summit name=clcsci48300417 parent=ucb-projects 
 Deleting account associations...
  C = summit     A = clcsci48300417 of ucb-projects
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y

it looks like it's making the account a part of the ucb-projects organization, but not a child of the ucb-project account

[root@slurm5 ~]# sacctmgr show account clcsci48300417 
   Account                Descr                  Org 
---------- -------------------- -------------------- 
clcsci483+       clcsci48300417         ucb-projects 

[root@slurm5 ~]# sacctmgr show association Account=clcsci48300417 Format=Account,ParentID,ParentName
   Account Par ID   Par Name 
---------- ------ ---------- 

To compare:

[root@slurm5 ~]# sacctmgr show associations Account=csu-projects Format=Account,ParentID,ParentName,Fairshare
   Account Par ID   Par Name     Share 
---------- ------ ---------- --------- 
csu-proje+     13        csu        80 

Might be related to related to https://bugs.schedmd.com/show_bug.cgi?id=2537
Comment 1 Tim Wickberg 2017-04-10 14:26:47 MDT
That's definitely not a good sign; this tends to indicate issues with the association tree structure.

The most common cause of that is attempting to modify the database directly through mysql. Bug 2537 was most likely caused by them manually deleting users from the database which corrupted the hierarchy.

Can you attach (or send directly to me through email if you'd rather not have it publicly attached to the bug) the output from 'scontrol show assoc' and your current slurm.conf file?
Comment 2 peter.ruprecht 2017-04-11 12:47:44 MDT
Tim,

(Working alongside Jonathon to add some possible additional info.)

Is the issue here possibly just with the way that sshare is presenting the tree?  It seems as though every other way I look at the clcsci48300417 makes it seem like is has the correct associations.

Here's what I just did as a test:

# sacctmgr create account cluster=summit name=clcsci48300417 parent=ucb-projects
 Associations
  A = clcsci4830 C = summit    
 Settings
  Parent        = ucb-projects
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y

# sshare
             Account       User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare 
-------------------- ---------- ---------- ----------- ----------- ------------- ---------- 
clcsci48300417                           1    0.090909           0      0.000000            
root                                          0.000000 12136407298      1.000000            
 root                      root          1    0.000093           0      0.000000   0.952586 
 csu                                  2100    0.195695  1703879163      0.140392            
  csu-general                           20    0.181818   231824664      0.136058            
  csu-projects                          80    0.727273           0      0.000000            
  csu-testing                           10    0.090909  1472054498      0.863942            
 csu-summit-akr                        100    0.009319           0      0.000000            
 csu-summit-bio                         40    0.003728           0      0.000000            
 csu-summit-cfd                         20    0.001864     5649832      0.000466            
 csu-summit-crw                        240    0.022365   504744609      0.041589            
 csu-summit-fhw                         40    0.003728           0      0.000000            
 csu-summit-hal                         50    0.004659           0      0.000000            
 csu-summit-mat                         40    0.003728           0      0.000000            
 rmacc                                 910    0.084801           0      0.000000            
  rmacc-general                         20    0.111111           0      0.000000            
  rmacc-projects                        80    0.444444           0      0.000000            
  rmacc-testing                         80    0.444444           0      0.000000            
 ucb                                  6210    0.578697  9284230410      0.764992            
  ucb-general                           20    0.198020  5635805992      0.607029            
  ucb-projects                          80    0.792079      592077      0.000064            
   tutorial1                            10    0.909091      592077      1.000000            
  ucb-testing                            1    0.009901  3647832340      0.392907            
 ucb-summit-eav                        120    0.011183       28329      0.000002            
 ucb-summit-gfd                         80    0.007455           0      0.000000 
   -- snip --

# sacctmgr show account clcsci48300417
   Account                Descr                  Org 
---------- -------------------- -------------------- 
clcsci483+       clcsci48300417         ucb-projects

# sacctmgr show association Account=clcsci48300417 Format=Account,ParentID,ParentName -p
Account|Par ID|Par Name|
clcsci48300417|17|ucb-projects|

As a comparison:

# sacctmgr show association Account=tutorial1 Format=Account,ParentID,ParentName -p
Account|Par ID|Par Name|
tutorial1|17|ucb-projects|
  -- snip --

Now I give clcsci48300417 equal fairshare as tutorial1:

# sacctmgr modify account where cluster=summit name=clcsci48300417 set fairshare=10

And sshare does now show that it and tutorial1 are sharing equally under ucb-projects:

# sshare
             Account       User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare 
-------------------- ---------- ---------- ----------- ----------- ------------- ---------- 
clcsci48300417                          10    0.500000           0      0.000000            
root                                          0.000000 12136756948      1.000000            
 root                      root          1    0.000093           0      0.000000   0.952586 
 csu                                  2100    0.195695  1703622328      0.140369            
  csu-general                           20    0.181818   231820816      0.136075            
  csu-projects                          80    0.727273           0      0.000000            
  csu-testing                           10    0.090909  1471801511      0.863925            
 csu-summit-akr                        100    0.009319           0      0.000000            
 csu-summit-bio                         40    0.003728           0      0.000000            
 csu-summit-cfd                         20    0.001864     5648861      0.000465            
 csu-summit-crw                        240    0.022365   504657863      0.041581            
 csu-summit-fhw                         40    0.003728           0      0.000000            
 csu-summit-hal                         50    0.004659           0      0.000000            
 csu-summit-mat                         40    0.003728           0      0.000000            
 rmacc                                 910    0.084801           0      0.000000            
  rmacc-general                         20    0.111111           0      0.000000            
  rmacc-projects                        80    0.444444           0      0.000000            
  rmacc-testing                         80    0.444444           0      0.000000            
 ucb                                  6210    0.578697  9283981302      0.764947            
  ucb-general                           20    0.198020  5636183903      0.607087            
  ucb-projects                          80    0.792079      591975      0.000064            
   tutorial1                            10    0.500000      591975      1.000000            
  ucb-testing                            1    0.009901  3647205422      0.392849            
 ucb-summit-eav                        120    0.011183       28325      0.000002            
 ucb-summit-gfd                         80    0.007455           0      0.000000     

But its location in the tree output from sshare is confusing.

Regards,
Pete
Comment 3 peter.ruprecht 2017-04-11 12:50:59 MDT
Now when I add another account (called "justatest", it shows up at the top of the sshare output and clcsci48300417 is in the tree under ucb-projects as expected:

# sacctmgr create account cluster=summit name=justatest parent=ucb-projects
 Adding Account(s)
  justatest
 Settings
  Description     = Account Name
  Organization    = Parent/Account Name
 Associations
  A = justatest  C = summit    
 Settings
  Parent        = ucb-projects
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y


# sshare
             Account       User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare 
-------------------- ---------- ---------- ----------- ----------- ------------- ---------- 
justatest                                1    0.047619           0      0.000000            
root                                          0.000000 12138563424      1.000000            
 root                      root          1    0.000093           0      0.000000   0.952586 
 csu                                  2100    0.195695  1702607801      0.140264            
  csu-general                           20    0.181818   231817806      0.136155            
  csu-projects                          80    0.727273           0      0.000000            
  csu-testing                           10    0.090909  1470789994      0.863845            
 csu-summit-akr                        100    0.009319           0      0.000000            
 csu-summit-bio                         40    0.003728           0      0.000000            
 csu-summit-cfd                         20    0.001864     5644979      0.000465            
 csu-summit-crw                        240    0.022365   504311030      0.041546            
 csu-summit-fhw                         40    0.003728           0      0.000000            
 csu-summit-hal                         50    0.004659           0      0.000000            
 csu-summit-mat                         40    0.003728           0      0.000000            
 rmacc                                 910    0.084801           0      0.000000            
  rmacc-general                         20    0.111111           0      0.000000            
  rmacc-projects                        80    0.444444           0      0.000000            
  rmacc-testing                         80    0.444444           0      0.000000            
 ucb                                  6210    0.578697  9283345808      0.764781            
  ucb-general                           20    0.198020  5638055408      0.607330            
  ucb-projects                          80    0.792079      591568      0.000064            
   clcsci48300417                       10    0.476190           0      0.000000            
   tutorial1                            10    0.476190      591568      1.000000            
  ucb-testing                            1    0.009901  3644698830      0.392606            
 ucb-summit-eav                        120    0.011183       28305      0.000002            
 ucb-summit-gfd                         80    0.007455           0      0.000000
Comment 5 Tim Wickberg 2017-04-18 20:26:02 MDT
Sorry for not getting back faster.

There definitely appears to be an issue with the internal accounts hierarchy. We'd added a way to rebuild that automatically in such cases - if you're able to manually restart slurmdbd with the -R flag (possibly in combination with -D -vvv to keep it in the foreground and increase the verbosity) that should be able to correct this automatically. (Although I'd recommend having a recent database snapshot at hand in case of issues.)

I'd certainly like to know what caused that, but unfortunately I suspect the cause of that is lost much further back in the logs somewhere.

- Tim
Comment 6 Tim Wickberg 2017-05-03 10:56:19 MDT
(In reply to Tim Wickberg from comment #5)
> Sorry for not getting back faster.
> 
> There definitely appears to be an issue with the internal accounts
> hierarchy. We'd added a way to rebuild that automatically in such cases - if
> you're able to manually restart slurmdbd with the -R flag (possibly in
> combination with -D -vvv to keep it in the foreground and increase the
> verbosity) that should be able to correct this automatically. (Although I'd
> recommend having a recent database snapshot at hand in case of issues.)
> 
> I'd certainly like to know what caused that, but unfortunately I suspect the
> cause of that is lost much further back in the logs somewhere.
> 
> - Tim

Did you have a chance to restart slurmdbd with the -R flag to see if that'd resolve the issue?
Comment 7 Jonathon Anderson 2017-05-05 09:20:50 MDT
We recently restarted slurmdbd with -R, but we neglected to capture verbose log output to see if it actually rebuilt anything. We'll try to replicate the issue and see if it's still a problem.
Comment 8 Tim Wickberg 2017-05-17 10:40:10 MDT
(In reply to Jonathon Anderson from comment #7)
> We recently restarted slurmdbd with -R, but we neglected to capture verbose
> log output to see if it actually rebuilt anything. We'll try to replicate
> the issue and see if it's still a problem.

I'm guessing that's fixed this thus far? Is it okay to mark this resolved for now?
Comment 9 Jonathon Anderson 2017-05-17 11:44:11 MDT
Sure, you can close this. We haven't been able to reproduce the problem since slurmdbd -R; but we'll let you know if it reoccurs.
Comment 10 Tim Wickberg 2017-05-17 11:47:20 MDT
(In reply to Jonathon Anderson from comment #9)
> Sure, you can close this. We haven't been able to reproduce the problem
> since slurmdbd -R; but we'll let you know if it reoccurs.

Okay, glad things have been working smoothly since then. Marking resolved/infogiven.