Ticket 18150 - default accounts problem after upgrade from 21.08.8-2 to 22.05.10
Summary: default accounts problem after upgrade from 21.08.8-2 to 22.05.10
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 22.05.10
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-11-08 08:02 MST by MarkD
Modified: 2023-11-08 08:06 MST (History)
0 users

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: Rocky Linux
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description MarkD 2023-11-08 08:02:15 MST
This is a duplicate of issues #17270 and #17980, but putting this into a separate ticket as we don't have a support contract. 

We've been seeing a problem after upgrading from 21.08.8-2 to 22.05.10 where sometimes, after changing a users's account and defaultaccount, it results in that user's subsequent sbatch commands being rejected with the message:

  sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

And associated error in slurmctld.log:

  error: User foo(1212) doesn't have a default account
  _job_create: invalid account or partition for user 1212, account '(null)', and partition 'shared'
  _slurm_rpc_submit_batch_job: Invalid account or account/partition combination specified

In our environment, we regularly migrate users between accounts called "active" and "inactive" and have seen quite a few instances of this since a day or two after we upgraded.

I managed to reproduce it for a user with the commands (replicating our migration script):

  sacctmgr remove user foo
  sacctmgr -i add user foo account=active,inactive fairshare=1000 defaultaccount=active
  sacctmgr -i modify user foo set defaultaccount=inactive
  sacctmgr -i remove user foo account=active

The problem went away when our migration script executed a "sacctmgr add user" (on a different user) some time later, as in the ARCHER2 case #17980. Repeating the above remove/add/modify/remove commands for user foo didn't reproduce the problem again.

One might consider it fixed, except that we've seen this happen on 15 different occasions since the upgrade.

We can probably mitigate by issuing a `sacctmgr add user` on a fake user after our migration script runs, but thought you'd appreciate the report as this seems to be a wider issue than just our site.