This is similar to bug 6792, and 5420. On a cluster with Slurm 17.11.10: [root@cluster1 ~]# sacctmgr show account where user=user1 withassoc format=user,account,qos,description%40 User Account QOS Descr ---------- ---------- -------------------- ---------------------------------------- user1 acct0010 long,normal acct0010 description user1 acct0020 long,normal acct0020 description user1 none long default account, no job privileges On a cluster with Slurm 18.08.6-2: [root@cluster2 ~]# sacctmgr show account user=user1 withassoc format=user,account,qos,description%40 User Account QOS Descr ---------- ---------- -------------------- ---------------------------------------- tl1 top level 1 allocation tl2 top level 2 allocation tl3 top level 3 allocation acct0001 acct0001 description ... user1 acct0010 long,normal acct0010 description ... user1 acct0020 long,normal acct0002 description ... 308 more accounts ... acct0328 acct0328 description tl4 top level 4 allocation tl5 top level 5 allocation user1 none long default account, no job privileges tl6 top level 6 allocation other other allocation root default root account and we get similar results when: trying to get the top level accounts: sacctmgr -s show account where parent=root format=account trying to get sub-accounts of tl3: sacctmgr -s show account where parent=tl3 format=account under Slurm 17.11.10 we get just the list of accounts with parent of root and tl3 and under Slurm 18.08.6-2 we get the full list of accounts with no filtering based on the criteria we gave. I can find no mention of this change in the release notes, so I must assume either this is a new undocumented feature, or a bug. Please let me know. Thanks, Mark
Hi Mark, > This is similar to bug 6792, Yes, I would say that this bug is a duplicate of bug 6792. > and we get similar results when: > > trying to get the top level accounts: > sacctmgr -s show account where parent=root format=account > > trying to get sub-accounts of tl3: > sacctmgr -s show account where parent=tl3 format=account Yes, it looks like a more generic problem. > under Slurm 17.11.10 we get just the list of accounts with parent of root > and tl3 and > under Slurm 18.08.6-2 we get the full list of accounts with no filtering > based on the criteria we gave. Thanks for that bisect. > I can find no mention of this change in the release notes, so I must assume > either this is a new undocumented feature, or a bug. Yes, it looks like a regression. I'm working on it. We'll keep you updated. Albert
Hi Mark, To try to reduce the severity of the bug, please could you try if these alternatives are working for you? > [root@cluster1 ~]# sacctmgr show account where user=user1 withassoc format=user,account,qos,description%40 [root@cluster1 ~]# sacctmgr show assoc where user=user1 format=user,account,qos,description%40 > trying to get the top level accounts: > sacctmgr -s show account where parent=root format=account sacctmgr show assoc where parent=root format=account > trying to get sub-accounts of tl3: > sacctmgr -s show account where parent=tl3 format=account sacctmgr show assoc where parent=tl3 format=account On my tests these alternatives are providing the expected output, like in bug 6792. Note that basically I'm changing "show account withassoc (or -s)" for "show assoc". I'm investigating why is not working as expected (and as before) in your original commands, but I hope that these alternatives could help you in the meantime. Albert
Hi Mark - I wanted to remind you that our severity levels are tied to the impact on the production system and as such are also tied to SLA's. Severity 2 — High Impact A Severity 2 issue is a high-impact problem that is causing sporadic outages or is consistently encountered by end users with adverse impact to end-user interaction with the system. https://www.schedmd.com/support.php I am lowering the severity of this down to a three since this does not seem to be affecting user job submissions or being able to start daemons. Best regards, Jason
Well I put this on a production system and it broke our production update of users and accounts and display of user accounts. I then had to remove it and regress back to 17.11.12 and a backup of the database, so I would call that a pretty high impact. We had decided to go with 18.08 on of advice from our partners, and problem with missing records in the 17.11 database that are causing a lot of runaway jobs and database issues particularly with regard to usage metrics, and were hoping this would fix that. So the fact that we can not update users and accounts seems to me to be a pretty high impact, since we can't update to 18.08, and have to wait for a fix. Changing our user and account update programs is not a trivial amount of work. I had assumed that since it had been more than 6 months since release of this version that the major issues would have been worked out, and apparently they haven't. So if you want to call this medium-to-low impact problem that includes partial non-critical loss of system access or which impairs some operations on the system but allows the end user to continue to function on the system with workarounds, so be it, but by my definition it is high impact, since I can not use this version on our production systems.
Just to add my $.02 and hopefully some clarification. We have an automated process that runs every 15 minutes on the clusters to update accounts and users. It uses the broken "sacctmgr ... where" syntax to compare the any new users and accounts to what is in the slurm database. It then issues sacctmgr commands to make any necessary updates. We've been doing slurm accounts this way for 7 years, it works perfectly. We also have a scripted command for users which uses the broken syntax to show them what accounts they have available on a cluster. We had planned to upgrade to 18.08 on all our clusters this month because we have so many problems with our reporting and metrics and runaway jobs. Another lab who upgraded to 18.08 told us that their database problems have been largely resolved. So now we are stuck - we either have to live with our ongoing database problems, which is an issue for us, or update to 18.08 and lose the auotmatic account and user updates, and lose the users' ability to see their accounts, both of which would cause a possible revolt with out users. ;-) I will reinforce Mark's point that from Sandia's POV, this is a high priority issue. A patch for 18.08 that would allow us to move forward would be much appreciated.
Hi Mark, Carol, Please see bug 6792 comment 7 for an explanation about why is the command behaving like this. Although it's true that the output of the same command became different from 17.11 to 18.08, actually the last one has been always the expected one. The previous one that you where relying on was flawed. So, it's not a actually bug but expected behavior. And the workaround provided in comment 2 is actually the real "solution", the right way to obtain the output that you want. We are sorry if this has caused troubles to your setup, and we do realize that documentation of sacctmgr needs to be improved (I'm working on it), but please note that we cannot guarantee that external tools that relay in the output of our commands keep working after a major upgrade. Although we highlight any major changes in RELEASE_NOTES and some minor corrections in the NEWS, we also encourage to always test major versions before applying them into production, especially if you have such external tools relying on command outputs. If something fails in your testing we can provide even better support than once it's into production. > We have an automated process that runs every 15 minutes on the clusters to > update accounts and users. It uses the broken "sacctmgr ... where" syntax > to compare the any new users and accounts to what is in the slurm database. > It then issues sacctmgr commands to make any necessary updates. We've been > doing slurm accounts this way for 7 years, it works perfectly. Thanks for the info! Unfortunately I'm still not certain of the goal of the process, you query db updates from slurm to update.. the slurm db? Between different clusters I guess...? > We had planned to upgrade to 18.08 on all our clusters this month because we > have so many problems with our reporting and metrics and runaway jobs. > Another lab who upgraded to 18.08 told us that their database problems have > been largely resolved. Although we still support 17.11, yes we totally recommend to upgrade to 18.08. Please note that soon we will release 19.05. > So now we are stuck - we either have to live with our ongoing database > problems, which is an issue for us, or update to 18.08 and lose the > auotmatic account and user updates, and lose the users' ability to see their > accounts, both of which would cause a possible revolt with out users. ;-) > I will reinforce Mark's point that from Sandia's POV, this is a high > priority issue. A patch for 18.08 that would allow us to move forward would > be much appreciated. I hope that changing your update and user scripts to use "show assoc" instead of "show -s account" wouldn't be too complicated. Do you think so? We totally recommend you to 18.08. We wouldn't provide any patch to restore the old/flawed behavior. The other (bad) alternative that you have but WE DON'T RECOMMEND NOR SUPPORT (and I shouldn't even mention here), is that you are always free to patch yourself the code reverting the commit that changed / fixed the old behavior (da49b8d0d14f1e2def06f2c22a14acb22a733153). It's really small... Again, WE DON'T RECOMMEND NOR SUPPORT reverting that commit, and I shound't even mention it.. maybe it breaks other stuff.. it's not supported and not recommended and maybe doesn't work.. but yet you can try it if you are really stuck and works for you. But please, even if it works (I don't know!), update your scripts asap and restore the supported code. If it is of for you I will close this bug as duplicated of 6792, but I keep that one open to improve (a lot) the documentation of sacctmgr to avoid future confusion on this topic. Regards, Albert
Albert, We will give the show assoc version a try. Documentation is a problem, and not just with sacctmgr. Information that is given to users in this forum doesn't seem to make its way back to the documentation. Digging through bugs to try to figure out what is going on is not a very effective use of time. Given that this is a change in behavior of sacctmgr between 17.11 and 18.08, some information about it should have been in the release notes. When we first saw this behavior, the release notes were the first place we looked. > Thanks for the info! > Unfortunately I'm still not certain of the goal of the process, you query db updates > from slurm to update.. the slurm db? Between different clusters I guess...? To answer this, we have a script that queries the current account and association status in the slurmdb (using sacctmgr show ... commands), and compares that to a file that has the desired current account/user relationships. When it finds differences, it changes the slurm accounts via sacctmgr commands. We know we could use a different mechanism, but this has worked very well for us for a long time. Thanks, Carol
Hi Carol, > We will give the show assoc version a try. I hope that refactor goes smooth. Please let us know if you face any other issue. > Documentation is a problem, and not just with sacctmgr. Information that is > given to users in this forum doesn't seem to make its way back to the > documentation. Sure. We always try to improve the documentation, and we are happy to receive any suggestion and contribution to make it better. If you run "git log -- doc" you will see that we are constantly improving it. But yes, sure that we can do it better! As mentioned, now I'm working on sacctmgr manpage, let's see how it ends on bug 6792. > Digging through bugs to try to figure out what is going on is > not a very effective use of time. Sure, and that's why we try to have good manpages and websites documenting as well as possible the most important topics and details of Slurm. This is a support channel to enhance it all, and save you time and troubles. > Given that this is a change in behavior of sacctmgr between 17.11 and 18.08, > some information about it should have been in the release notes. When we > first saw this behavior, the release notes were the first place we looked. Although you have a point here because the behavior changed, it is also true that we did not intend *to change it* but *to fix it*. In a way any fix is changing some behavior, right? We are sorry that you were relaying in that flawed behavior (that was there for so long). Probably it shouldn't validate for the RELEASE_NOTES, but you are right that it probably validates for a better entry in the NEWS. And the docs, yes. I think that the real problem is that the right behavior is a little bit strange and the documentation is not good enough to explain it, and that led you to use the wrong queries in the first place. > To answer this, we have a script that queries the current account and > association status in the slurmdb (using sacctmgr show ... commands), and > compares that to a file that has the desired current account/user > relationships. When it finds differences, it changes the slurm accounts via > sacctmgr commands. We know we could use a different mechanism, but this has > worked very well for us for a long time. Now I got it! So you have some kind of file in some DSL of yours or similar, you change only that file and your scripts update the Slurm database querying the slurmdb, comparing it with your file and updating slurmdb accordingly. Cool. I'm closing the bug as duplicate of bug 6792. Thanks, Albert *** This ticket has been marked as a duplicate of ticket 6792 ***
While as you might expect I'm not real happy about the outcome, I do appreciate your prompt attention to this issue and thorough explanation. Thank you Albert. I hope in the future those responsible for documenting releases will be more sensitive and forthcoming with changes like this. Long standing behavior whether viewed as correct or not, over time becomes a de facto standard and implicit requirement especially when not well documented as in this case. While I understand your desire to correct and "fix" behavior, and that you take no responsibility for third party software, I would hope in the interest of community relations you will be more sensitive to the fact that changes like this cause problems for the user community. At the very least these changes need to be documented and advertised ahead of release. Thank you, Mark
Hi Mark, Carol, Please let me recommend you to upgrade from 18.08.6 to the 18.08.7 that we released last week. The main reason is the regression bug that we added when working on bug 5717 and that we fixed on bug 6697. I'm pretty sure that you or your external tools, could be impacted by it. Hope that helps, Albert