Ticket 6813

Summary: Problems with sacctmgr show account
Product: Slurm Reporter: Mark Schmitz <mschmit>
Component: User CommandsAssignee: Albert Gil <albert.gil>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: albert.gil, cvalvin, felip.moll, jbooth, mschmit, tim
Version: 18.08.6   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=6792
Site: Sandia National Laboratories Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Mark Schmitz 2019-04-08 17:03:56 MDT
This is similar to bug 6792, and 5420.

On a cluster with Slurm 17.11.10:

[root@cluster1 ~]# sacctmgr show account where user=user1 withassoc format=user,account,qos,description%40
      User    Account                  QOS                                    Descr
---------- ---------- -------------------- ----------------------------------------
     user1   acct0010          long,normal                     acct0010 description
     user1   acct0020          long,normal                     acct0020 description
     user1       none                 long       default account, no job privileges
   
On a cluster with Slurm 18.08.6-2:

[root@cluster2 ~]# sacctmgr show account user=user1 withassoc format=user,account,qos,description%40
      User    Account                  QOS                                    Descr
---------- ---------- -------------------- ----------------------------------------
                  tl1                                        top level 1 allocation
                  tl2                                        top level 2 allocation
                  tl3                                        top level 3 allocation
             acct0001                                          acct0001 description
               ...
     user1   acct0010          long,normal                     acct0010 description
               ...
     user1   acct0020          long,normal                     acct0002 description
               ...
            308 more accounts
               ...
             acct0328                                          acct0328 description
                  tl4                                        top level 4 allocation
                  tl5                                        top level 5 allocation
     user1       none                 long       default account, no job privileges
                  tl6                                        top level 6 allocation
                other                                              other allocation
                 root                                          default root account

and we get similar results when:

trying to get the top level accounts:
sacctmgr -s show account where parent=root format=account

trying to get sub-accounts of tl3:
sacctmgr -s show account where parent=tl3 format=account

under Slurm 17.11.10 we get just the list of accounts with parent of root and tl3 and
under Slurm 18.08.6-2 we get the full list of accounts with no filtering based on the criteria we gave.

I can find no mention of this change in the release notes, so I must assume either this is a new undocumented feature, or a bug.

Please let me know.

Thanks,
Mark
Comment 1 Albert Gil 2019-04-09 03:23:54 MDT
Hi Mark,

> This is similar to bug 6792,

Yes, I would say that this bug is a duplicate of bug 6792.

> and we get similar results when:
> 
> trying to get the top level accounts:
> sacctmgr -s show account where parent=root format=account
> 
> trying to get sub-accounts of tl3:
> sacctmgr -s show account where parent=tl3 format=account

Yes, it looks like a more generic problem.

> under Slurm 17.11.10 we get just the list of accounts with parent of root
> and tl3 and
> under Slurm 18.08.6-2 we get the full list of accounts with no filtering
> based on the criteria we gave.

Thanks for that bisect.

> I can find no mention of this change in the release notes, so I must assume
> either this is a new undocumented feature, or a bug.

Yes, it looks like a regression.
I'm working on it.

We'll keep you updated.
Albert
Comment 2 Albert Gil 2019-04-09 04:02:38 MDT
Hi Mark,

To try to reduce the severity of the bug, please could you try if these alternatives are working for you?

> [root@cluster1 ~]# sacctmgr show account where user=user1 withassoc format=user,account,qos,description%40

[root@cluster1 ~]# sacctmgr show assoc where user=user1 format=user,account,qos,description%40

> trying to get the top level accounts:
> sacctmgr -s show account where parent=root format=account

sacctmgr show assoc where parent=root format=account

> trying to get sub-accounts of tl3:
> sacctmgr -s show account where parent=tl3 format=account

sacctmgr show assoc where parent=tl3 format=account


On my tests these alternatives are providing the expected output, like in bug 6792.
Note that basically I'm changing "show account withassoc (or -s)" for "show assoc".

I'm investigating why is not working as expected (and as before) in your original commands, but I hope that these alternatives could help you in the meantime.


Albert
Comment 3 Jason Booth 2019-04-09 09:25:25 MDT
Hi Mark - I wanted to remind you that our severity levels are tied to the impact on the production system and as such are also tied to SLA's.


Severity 2 — High Impact

A Severity 2 issue is a high-impact problem that is causing sporadic outages or is consistently encountered by end users with adverse impact to end-user interaction with the system.

https://www.schedmd.com/support.php

I am lowering the severity of this down to a three since this does not seem to be affecting user job submissions or being able to start daemons.

Best regards,
Jason
Comment 4 Mark Schmitz 2019-04-09 10:05:26 MDT
Well I put this on a production system and it broke our production update of users and accounts and display of user accounts. I then had to remove it and regress back to 17.11.12 and a backup of the database, so I would call that a pretty high impact. We had decided to go with 18.08 on of advice from our partners, and problem with missing records in the 17.11 database that are causing a lot of runaway jobs and database issues particularly with regard to usage metrics, and were hoping this would fix that.

So the fact that we can not update users and accounts seems to me to be a pretty high impact, since we can't update to 18.08, and have to wait for a fix. Changing our user and account update programs is not a trivial amount of work. I had assumed that since it had been more than 6 months since release of this version that the major issues would have been worked out, and apparently they haven't.

So if you want to call this medium-to-low impact problem that includes partial non-critical loss of system access or which impairs some operations on the system but allows the end user to continue to function on the system with workarounds, so be it, but by my definition it is high impact, since I can not use this version on our production systems.
Comment 11 Carol Alvin 2019-04-10 10:45:17 MDT
Just to add my $.02 and hopefully some clarification.

We have an automated process that runs every 15 minutes on the clusters to update accounts and users.  It uses the broken "sacctmgr ... where" syntax to compare the any new users and accounts to what is in the slurm database.  It then issues sacctmgr commands to make any necessary updates. We've been doing slurm accounts this way for 7 years, it works perfectly.

We also have a scripted command for users which uses the broken syntax to show them what accounts they have available on a cluster.

We had planned to upgrade to 18.08 on all our clusters this month because we have so many problems with our reporting and metrics and runaway jobs. Another lab who upgraded to 18.08 told us that their database problems have been largely resolved.

So now we are stuck - we either have to live with our ongoing database problems, which is an issue for us, or update to 18.08 and lose the auotmatic account and user updates, and lose the users' ability to see their accounts, both of which would cause a possible revolt with out users.  ;-)

I will reinforce Mark's point that from Sandia's POV, this is a high priority issue.  A patch for 18.08 that would allow us to move forward would be much appreciated.
Comment 12 Albert Gil 2019-04-10 13:24:45 MDT
Hi Mark, Carol,

Please see bug 6792 comment 7 for an explanation about why is the command behaving like this.
Although it's true that the output of the same command became different from 17.11 to 18.08, actually the last one has been always the expected one. The previous one that you where relying on was flawed.
So, it's not a actually bug but expected behavior.
And the workaround provided in comment 2 is actually the real "solution", the right way to obtain the output that you want.

We are sorry if this has caused troubles to your setup, and we do realize that documentation of sacctmgr needs to be improved (I'm working on it), but please note that we cannot guarantee that external tools that relay in the output of our commands keep working after a major upgrade. Although we highlight any major changes in RELEASE_NOTES and some minor corrections in the NEWS, we also encourage to always test major versions before applying them into production, especially if you have such external tools relying on command outputs. If something fails in your testing we can provide even better support than once it's into production.

> We have an automated process that runs every 15 minutes on the clusters to
> update accounts and users.  It uses the broken "sacctmgr ... where" syntax
> to compare the any new users and accounts to what is in the slurm database. 
> It then issues sacctmgr commands to make any necessary updates. We've been
> doing slurm accounts this way for 7 years, it works perfectly.

Thanks for the info!
Unfortunately I'm still not certain of the goal of the process, you query db updates from slurm to update.. the slurm db? Between different clusters I guess...?

> We had planned to upgrade to 18.08 on all our clusters this month because we
> have so many problems with our reporting and metrics and runaway jobs.
> Another lab who upgraded to 18.08 told us that their database problems have
> been largely resolved.

Although we still support 17.11, yes we totally recommend to upgrade to 18.08.
Please note that soon we will release 19.05.

> So now we are stuck - we either have to live with our ongoing database
> problems, which is an issue for us, or update to 18.08 and lose the
> auotmatic account and user updates, and lose the users' ability to see their
> accounts, both of which would cause a possible revolt with out users.  ;-)
> I will reinforce Mark's point that from Sandia's POV, this is a high
> priority issue.  A patch for 18.08 that would allow us to move forward would
> be much appreciated.

I hope that changing your update and user scripts to use "show assoc" instead of "show -s account" wouldn't be too complicated. Do you think so?
We totally recommend you to 18.08.
We wouldn't provide any patch to restore the old/flawed behavior.

The other (bad) alternative that you have but WE DON'T RECOMMEND NOR SUPPORT (and I shouldn't even mention here), is that you are always free to patch yourself the code reverting the commit that changed / fixed the old behavior (da49b8d0d14f1e2def06f2c22a14acb22a733153). It's really small...
Again, WE DON'T RECOMMEND NOR SUPPORT reverting that commit, and I shound't even mention it.. maybe it breaks other stuff.. it's not supported and not recommended and maybe doesn't work.. but yet you can try it if you are really stuck and works for you.
But please, even if it works (I don't know!), update your scripts asap and restore the supported code.


If it is of for you I will close this bug as duplicated of 6792, but I keep that one open to improve (a lot) the documentation of sacctmgr to avoid future confusion on this topic.

Regards,
Albert
Comment 13 Carol Alvin 2019-04-10 14:15:43 MDT
Albert,

We will give the show assoc version a try.  

Documentation is a problem, and not just with sacctmgr.  Information that is given to users in this forum doesn't seem to make its way back to the documentation. Digging through bugs to try to figure out what is going on is not a very effective use of time.

Given that this is a change in behavior of sacctmgr between 17.11 and 18.08, some information about it should have been in the release notes.  When we first saw this behavior, the release notes were the first place we looked.  

> Thanks for the info!
> Unfortunately I'm still not certain of the goal of the process, you query db updates > from slurm to update.. the slurm db? Between different clusters I guess...?

To answer this, we have a script that queries the current account and association status in the slurmdb (using sacctmgr show ... commands), and compares that to a file that has the desired current account/user relationships.  When it finds differences, it changes the slurm accounts via sacctmgr commands.  We know we could use a different mechanism, but this has worked very well for us for a long time.

Thanks,  Carol
Comment 14 Albert Gil 2019-04-11 03:20:21 MDT
Hi Carol,

> We will give the show assoc version a try.

I hope that refactor goes smooth.
Please let us know if you face any other issue.

> Documentation is a problem, and not just with sacctmgr.  Information that is
> given to users in this forum doesn't seem to make its way back to the
> documentation.

Sure.
We always try to improve the documentation, and we are happy to receive any suggestion and contribution to make it better.
If you run "git log -- doc" you will see that we are constantly improving it.
But yes, sure that we can do it better!
As mentioned, now I'm working on sacctmgr manpage, let's see how it ends on bug 6792.

> Digging through bugs to try to figure out what is going on is
> not a very effective use of time.

Sure, and that's why we try to have good manpages and websites documenting as well as possible the most important topics and details of Slurm.
This is a support channel to enhance it all, and save you time and troubles.

> Given that this is a change in behavior of sacctmgr between 17.11 and 18.08,
> some information about it should have been in the release notes.  When we
> first saw this behavior, the release notes were the first place we looked.

Although you have a point here because the behavior changed, it is also true that we did not intend *to change it* but *to fix it*.
In a way any fix is changing some behavior, right?
We are sorry that you were relaying in that flawed behavior (that was there for so long).
Probably it shouldn't validate for the RELEASE_NOTES, but you are right that it probably validates for a better entry in the NEWS. And the docs, yes.
I think that the real problem is that the right behavior is a little bit strange and the documentation is not good enough to explain it, and that led you to use the wrong queries in the first place.
  
> To answer this, we have a script that queries the current account and
> association status in the slurmdb (using sacctmgr show ... commands), and
> compares that to a file that has the desired current account/user
> relationships.  When it finds differences, it changes the slurm accounts via
> sacctmgr commands.  We know we could use a different mechanism, but this has
> worked very well for us for a long time.

Now I got it!
So you have some kind of file in some DSL of yours or similar, you change only that file and your scripts update the Slurm database querying the slurmdb, comparing it with your file and updating slurmdb accordingly.
Cool.

I'm closing the bug as duplicate of bug 6792.
Thanks,
Albert

*** This ticket has been marked as a duplicate of ticket 6792 ***
Comment 16 Mark Schmitz 2019-04-11 08:56:32 MDT
While as you might expect I'm not real happy about the outcome, I do appreciate your prompt attention to this issue and thorough explanation. Thank you Albert.

I hope in the future those responsible for documenting releases will be more sensitive and forthcoming with changes like this. Long standing behavior whether viewed as correct or not, over time becomes a de facto standard and implicit requirement especially when not well documented as in this case. While I understand your desire to correct and "fix" behavior, and that you take no responsibility for third party software, I would hope in the interest of community relations you will be more sensitive to the fact that changes like this cause problems for the user community. At the very least these changes need to be documented and advertised ahead of release.

Thank you,
Mark
Comment 17 Albert Gil 2019-04-16 04:25:58 MDT
Hi Mark, Carol,

Please let me recommend you to upgrade from 18.08.6 to the 18.08.7 that we released last week.
The main reason is the regression bug that we added when working on bug 5717 and that we fixed on bug 6697.

I'm pretty sure that you or your external tools, could be impacted by it.

Hope that helps,
Albert