Ticket 17270

Summary: upgrade from 21.08.8 to 23.02.3 resulted in issue with default accounts
Product: Slurm Reporter: Jake Rundall <rundall>
Component: AccountingAssignee: Tim McMullan <mcmullan>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: bglick, david.gloe, mark.c.dixon
Version: 23.02.6   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=18296
Site: NCSA Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Slurm DB schema (no data)
slurm_acct_db.schema.2023.09.27.sql — before resolution
slurm.conf.mf.2023.10.31
scontrol.show.assoc.before
scontrol.show.assoc.after
logs-2023.11.09

Description Jake Rundall 2023-07-24 12:28:52 MDT
We upgraded a cluster from 21.08.8 to 23.02.3 last week and encountered an issue after where users' default accounts appeared to be set still (according to 'sacctmgr show user') but Slurm acted as if it were not the case when they submitted (as they'd always done in the past) w/o specifying an account:
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

Does SchedMD have any thoughts on this? We'll have a number of upgrades to 23.02 in the coming months, although I think they'll all be from Slurm 22.05. Perhaps the issue we encountered was specific to the jump from 21.08 to 23.02, skipping over 22.05, so maybe we won't run into it again.

I'll not that our resolution was resetting/reapplying the default account for each user, except for one test user, using a command like this:
scctmgr modify user where user="username" set defaultaccount=default'

The situation ended up being resolved for the test user w/o any direct intervention that we can think of. So perhaps the change to the other users fixed some general DB table issue?
Comment 1 Jason Booth 2023-07-24 14:21:14 MDT
Do you have the older schema still available to you, and would it be possible to share that with us?
Comment 2 Jake Rundall 2023-07-24 14:34:00 MDT
We do have a backup. Am I correct that you're just asking for the pre-upgrade schema w/o any data?
Comment 3 Jason Booth 2023-07-24 15:00:45 MDT
> We do have a backup. Am I correct that you're just asking for the pre-upgrade schema w/o any data?

Correct
Comment 4 Tim McMullan 2023-07-26 10:37:10 MDT
Hi Jake,

Can you provide the number of users, accounts, and associations in your system?

Thanks!
--Tim
Comment 5 Jake Rundall 2023-07-26 11:17:24 MDT
users: 868
accounts: 2
associations: 5976

I'm working to get the pre-update schema for you. It's in a pretty restricted environment so there are some hoops to jump through.
Comment 6 Tim McMullan 2023-07-26 11:19:46 MDT
Hey Jake,

If its that difficult, I'll take your word for it that the schema was not modified for now.  If you were running any patches on top of 21.08 that would be good to know though!

Thanks!
--Tim
Comment 7 Jake Rundall 2023-07-26 11:24:56 MDT
Nope, no patches. Would have been 21.08.8-2.
Comment 8 Jake Rundall 2023-07-26 13:19:49 MDT
Created attachment 31462 [details]
Slurm DB schema (no data)

I'm attaching the schema (the only thing I've changed after exporting is the cluster name).
Comment 9 Tim McMullan 2023-08-25 07:18:33 MDT
Hey Jake,

I've been looking at this and having a hard time reproducing.  One thing I didn't ask before was what version of mysql you were running, would you mind letting me know?  Maybe there is a bad interaction on the upgrade that I need to look at, so would you mind letting me know what version you were running for the upgrade?

Thanks!
-Tim
Comment 10 Jake Rundall 2023-08-25 09:18:44 MDT
Thanks! We were running MariaDB 10.6.14 at the time of the Slurm upgrade.

We've got at least a couple of clusters running Slurm 23.02 that didn't run into the issue, although I believe they both went from 22.05 rather than jumping from 21.08.
Comment 11 Tim McMullan 2023-08-29 05:44:58 MDT
(In reply to Jake Rundall from comment #10)
> Thanks! We were running MariaDB 10.6.14 at the time of the Slurm upgrade.
> 
> We've got at least a couple of clusters running Slurm 23.02 that didn't run
> into the issue, although I believe they both went from 22.05 rather than
> jumping from 21.08.

Ok Interesting, Thank you!  I'll match my MariaDB version just in case and see if I can track it down!
Comment 12 Jake Rundall 2023-09-27 16:11:53 MDT
We encountered this issue again today on the same cluster. It's a bit unclear what the trigger was but I'll lay out the scenario.

Note that we updated from 23.02.3 to 23.02.4 a few weeks ago.

As of last night things were working normally — users could submit without specifying any account because they are all associated with an account named 'default' which is set as their default account.

We had shared filesystem (GPFS) problems overnight. This was related to a storage-side IB fabric issue; the scheduler/login nodes/compute nodes do not connect to that fabric, it's simply to connect IO nodes and storage controllers to each other; the scheduler's DB and saved state are on local disk and the scheduler doesn't even mount GPFS.

This morning after the GPFS backend issue was resolved we rebooted the rest of the cluster. The scheduler probably didn't need a reboot since it doesn't even mount GPFS, but it was rebooted anyway. So there was a restart of slurmctld, slurmdbd, and mariadb. And reboots of most or all compute nodes and login nodes.

After the scheduler rebooted, our custom Slurm DB user mgmt script ran and removed several (44) users from the DB, using commands like this (these were just routine changes due to users being removed from LDAP groups):
sacctmgr -i delete user name=USERNAME

We put the cluster back in service (marked nodes back online, etc.).

Around or shortly after this point, users noted that they couldn't submit jobs and we verified this to be the case for us admins as well, e.g.:
[m189274@hn4 ~]$ srun -p cpu-short -t 0:01:00 hostname
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

This type of failure was accompanied by slurmctld logs like the following:
2023-09-27T12:21:26.694750-05:00 mfsched8 slurmctld[55357]: slurmctld: error: User m189274(10107) doesn't have a default account
2023-09-27T12:21:26.695139-05:00 mfsched8 slurmctld[55357]: slurmctld: _job_create: invalid account or partition for user 10107, account '(null)', and partition 'cpu-short'
2023-09-27T12:21:26.695139-05:00 mfsched8 slurmctld[55357]: slurmctld: _slurm_rpc_allocate_resources: Invalid account or account/partition combination specified
2023-09-27T12:21:26.695182-05:00 mfsched8 slurmctld[55357]: error: User m189274(10107) doesn't have a default account
2023-09-27T12:21:26.695221-05:00 mfsched8 slurmctld[55357]: _job_create: invalid account or partition for user 10107, account '(null)', and partition 'cpu-short'
2023-09-27T12:21:26.695249-05:00 mfsched8 slurmctld[55357]: _slurm_rpc_allocate_resources: Invalid account or account/partition combination specified

Specifying an account allowed us to submit jobs:
[m189274@hn4 ~]$ srun -p cpu-short -A default -t 0:01:00 hostname
mf143.local

But again, we have default accounts set, as illustrated by commands I ran while in the failure mode:
[m189274@hn4 ~]$ sacctmgr show assoc user=m189274
   Cluster    Account       User  Partition     Share   Priority GrpJobs       GrpTRES GrpSubmit     GrpWall   GrpTRESMins MaxJobs       MaxTRES MaxTRESPerNode MaxSubmit     MaxWall   MaxTRESMins                  QOS   Def QOS GrpTRESRunMin 
---------- ---------- ---------- ---------- --------- ---------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- ------------- 
    mf    default    m189274    ngs-ext         1                                                                                                                                                            ngs-qos   ngs-qos               
    mf    default    m189274    ngs-sec         1                                                                                                                                                            ngs-qos   ngs-qos               
    mf    default    m189274    ngs-pri         1                                                                                                                                                            ngs-qos   ngs-qos               
    mf    default    m189274   cpu-long         1                                                                                                                                                        general-qos general-+               
    mf    default    m189274    cpu-med         1                                                                                                                                                        general-qos general-+               
    mf    default    m189274  cpu-short         1                                                                                                                                                        general-qos general-+               
    mf    default    m189274                    1                                                                                                                                                             normal                         

and:
[m189274@hn4 ~]$ sacctmgr show user m189274
      User   Def Acct     Admin 
---------- ---------- --------- 
   m189274    default      None 

We used sacctmgr to re-specify the default account for one of us admins, with a command like this:
sscctmgr modify user where user=USERNAME set defaultaccount=default

That user was then able to submit jobs without specifying an account. However it (running the command to fix the one user) didn't fix things for the other admin users the tested after that, including myself (m189274).

We then decided to dump the schema again, which I've attached.

And then increased debug logging:
'/etc/slurm/slurmdbd.conf: DebugLevelSyslog=debug5' and restart slurmdbd
'/etc/slurm/slurm.conf: SlurmctldSyslogDebug=debug5' and restart slurmctld

And then I went to test again...but the problem had gone away — I was able to submit without specifying an account. Another admin user confirmed, and we haven't heard further complaints from anyone.

So...it's not clear what the trigger was for failure or resolution.
- Failure could have been triggered by the reboot of the scheduler in general, restart of its various services, restart of slurmd, or deleting those several users out of the DB. Although none of those things in isolation seems to have caused problems otherwise (e.g., we restart slurmctld when adding nodes to Slurm, our user mgmt script runs via cron and routinely adds and removes users, and I suspect that the scheduler has been rebooted at least once since opening this ticket and prior to today). Or possibly something else.
- Resolution could have been triggered by fixing the one admin's account but with a delay (?), by upping debugging, or by restarts of services, or...?

Note: We also re-dumped the schema after the problem went away, and (as expected, I think) there were no differences except for AUTO_INCREMENT values in a few tables. So I'm not uploading that post-resolution dump.

If this happens in the future I think we will probably approach this with more atomic changes with testing in between, e.g., simply restart slurmctld and then re-test, then adjust debug logging and restart it again and re-test, then restart slurmdbd and re-test, then adjust its debug logging and re-test, etc.

If you have specific suggestions on what you'd like us to try please let us know.

And if there is another way for us to set debug logging that might not require a restart that might be better. We do have Slurm daemons configured to log via syslog.
Comment 13 Jake Rundall 2023-09-27 16:16:37 MDT
Created attachment 32459 [details]
slurm_acct_db.schema.2023.09.27.sql — before resolution
Comment 14 Tim McMullan 2023-10-05 06:53:22 MDT
Hi Jake, 

Thank you for the update and sorry for the long reply delay on the new occurrence of this,  I was out of the office.

I think though that this new occurrence says we've been barking up the wrong tree, and the upgrade probably didn't actually cause it, but more likely that something between 21.08 and 23.02 changed that allows this situation to happen, or if it is a side-effect of a different issue.  Does your removal script remove the users individually or group them up and delete them all at once?

Thanks,
--Tim
Comment 15 Jake Rundall 2023-10-05 07:00:06 MDT
Agreed, and no problem!

The user mgmt script process account and user adds and removals one user or account at a time in this order:
- remove users, one at a time
- remove accounts, one at a time
- create accounts, one at a time
- create users, one at a time

On this particular cluster, users should only ever be associated with a single account, which is their default.
Comment 16 Jake Rundall 2023-10-30 13:39:21 MDT
This happened again this past Saturday after rebooting the cluster (and updating from 23.02.4 to 23.02.6).

We took a DB backup before and after resolution. It sort of seems like running the user mgmt script may have knocked things loose (it processed 7 pending deletions), although perhaps that's just a coincidence.

I can definitely confirm that the entries for my users' associations in the <CLUSTERNAME>_assoc_table are identical before and after, so it really doesn't seem like a matter of something wrong in the DB:
(1663814221,1689861582,0,NULL,1,4,'m189274','default','','',13751,13752,1,NULL,NULL,NULL,NULL,'','','','',NULL,NULL,NULL,NULL,'','','',NULL,NULL,NULL,'','')
(1669673497,1689861582,0,NULL,1,27,'m189274','default','cpu-short','',13713,13714,1,NULL,NULL,NULL,NULL,'','','','',NULL,NULL,NULL,NULL,'','','',NULL,NULL,67,',67,','')
(1669673498,1689861582,0,NULL,1,28,'m189274','default','cpu-med','',13711,13712,1,NULL,NULL,NULL,NULL,'','','','',NULL,NULL,NULL,NULL,'','','',NULL,NULL,67,',67,','')
(1669673499,1689861582,0,NULL,1,29,'m189274','default','cpu-long','',13709,13710,1,NULL,NULL,NULL,NULL,'','','','',NULL,NULL,NULL,NULL,'','','',NULL,NULL,67,',67,','')
(1669673499,1689861582,0,NULL,1,30,'m189274','default','ngs-pri','',13707,13708,1,NULL,NULL,NULL,NULL,'','','','',NULL,NULL,NULL,NULL,'','','',NULL,NULL,68,',68,','')
(1669673500,1689861582,0,NULL,1,31,'m189274','default','ngs-sec','',13705,13706,1,NULL,NULL,NULL,NULL,'','','','',NULL,NULL,NULL,NULL,'','','',NULL,NULL,68,',68,','')
(1669673501,1689861582,0,NULL,1,32,'m189274','default','ngs-ext','',13703,13704,1,NULL,NULL,NULL,NULL,'','','','',NULL,NULL,NULL,NULL,'','','',NULL,NULL,68,',68,','')

I'm associated the default account for all of these rows (some involve specific partitions) and is_def is set to 1 for all of these.
Comment 17 Tim McMullan 2023-10-30 14:04:55 MDT
If you see this happen again, I'd be interested in seeing the output of "scontrol show assoc" from when its broken and the after fixing it (which seems to just require updating an account again?).  The slurmctld maintains a copy of the associations that the slurmdbd pushes updates to when it changes.  I think it is likely that this is getting messed up somewhere, and apparently by dropping default accounts on the floor.

Do you know the order of operations?  What is restart/upgrade, broken, sacctmgr runs, still broken?  Or restart/upgrade, sacctmgr runs, broken?  I totally understand if you aren't sure, it just might help narrow it down!

Thanks!
--Tim
Comment 18 Jake Rundall 2023-10-30 14:11:50 MDT
Thanks, we can definitely pull the associations from slurmctld with scontrol then next time this happens.

The order in this past instance (and in general in the past) was:
1. update Slurm and restart scheduler (the order of those might vary)
2. we find it is broken
3. something seems to fix it (in most cases it seems like making a change to 1+ users/associations using sacctmgr is perhaps the thing that resolves things)
Comment 19 Tim McMullan 2023-10-31 09:01:54 MDT
Thank you!

Would you mind also attaching your slurm.conf?
Comment 20 Jake Rundall 2023-10-31 09:13:01 MDT
Created attachment 33044 [details]
slurm.conf.mf.2023.10.31
Comment 21 MarkD 2023-11-08 08:04:27 MST
We've seen this as well - logged similar report for our site, bug #18150
Comment 22 Tim McMullan 2023-11-10 12:31:05 MST
I've been continuing to try to reproduce this and am having no luck, nor am I seeing an obvious fault that would cause it right now.

Mark, thank you for the report.  I'm taking the details you provided in 18150 under advisement as well, I've been mostly doing "add account, add user, remove user, remove account" over an over and over again trying to get the slurmctld to get confused.

It sounds like this is quite infrequent (relatively speaking) but with enough repetition we should be able to catch it in a bad state.
Comment 23 Jake Rundall 2023-11-27 11:02:35 MST
It seems we had this issue on a different cluster on Thu, Nov 9. (Sorry for the delayed report.) Very similar circumstances but not update at all to Slurm, just a reboot of the scheduler (and everything else).

One notable difference compared to the cluster where we'd been having (or noticing) this originally is that this second cluster's job_submit.lua filter sets the account on all job submissions; users would generally submit without specifying an account and the job_submit.lua script changes the account from their default to something else.

Things unfolded something like this:

1:46pm - scheduler OS registers as "up"; Puppet then runs and reconfigures things on the node, which is stateless (other than the local storage where the DB and Slurm saved state live)

just before 2:11pm - I (rundall) ran a test job and got the usual error: "srun: error: Unable to allocate resources: Invalid account or account/partition combination specified"

2:11pm - I saved output of 'scontrol show assoc' as scontrol.show.assoc.before

2:13pm - I ran our account management script (different than on the other cluster but similar in principal)

2:15pm - a colleague (rbrunner) successfully submitted jobs w/o specifying account

2:24pm — I tested again and was able to run w/o specifying an account (should have tested earlier)

2:30pm — saved output of 'scontrol show assoc' again as scontrol.show.assoc.after (again, wish I'd thought to run this sooner)

You'll see that DefAccount=(null) for rundall in both cases, but perhaps it doesn't matter since our job_submit.lua script sets the account anyway. But what is interesting to me is that references to rundall with some long-ish integer that I don't recognize — rundall(4294967294) — are replaced by references to rundall with my LDAP user ID — UserName=rundall(54135). So I wonder if this isn't playing a role somehow. Maybe the issue is that the sssd cache is emptied during the reboot and then takes a while for this info to repopulate and percolate into slurmctld?
Comment 24 Jake Rundall 2023-11-27 11:03:20 MST
Created attachment 33479 [details]
scontrol.show.assoc.before
Comment 25 Jake Rundall 2023-11-27 11:03:38 MST
Created attachment 33480 [details]
scontrol.show.assoc.after
Comment 26 Tim McMullan 2023-11-27 12:18:52 MST
Thanks for the update!  This is very interesting information.  Your UID being filled in as 4294967294 is basically the max possible UID, which I think suggests that something went wrong looking up your user.  I'll have to dig into where that particular number would get set.

Would you attach both your slurmctld and slurmdbd logs for that period of time when this happened?  I'm taking a look at the assoc information now, but maybe there will be an error in the logs related to the uid lookup in the logs too.

Thanks!
--Tim
Comment 27 Jake Rundall 2023-11-27 13:01:32 MST
Created attachment 33483 [details]
logs-2023.11.09
Comment 28 Jake Rundall 2023-11-27 13:10:30 MST
I've uploaded the requested logs, from 2-3:59pm that day, because I'm not sure how long it took to finally populate DefAccount in the associations (although at this point it seems like the issue may be with user-account associations having issues due to UID lookups failing).

I didn't include anything but slurmctld and slurmdbd logs in what I attached, but casting the net wider on my end I can definitely see that slurmctld starts prior to sssd being healthy:

e.g., from syslog:
[root@mgsched1 log]# egrep -i "slurmctld.*started on cluster magnus|sssd.*Starting up" messages-20231112 
Nov  9 14:00:32 mgsched1 sssd_kcm[77389]: Starting up
Nov  9 14:00:36 mgsched1 slurmctld[77581]: slurmctld: slurmctld version 23.02.6 started on cluster magnus
Nov  9 14:00:36 mgsched1 slurmctld[77581]: slurmctld version 23.02.6 started on cluster magnus
2023-11-09T14:01:47.010128-06:00 mgsched1.internal.ncsa.edu sssd[88097]: Starting up
2023-11-09T14:01:47.022342-06:00 mgsched1.internal.ncsa.edu sssd_be[88098]: Starting up
2023-11-09T14:01:47.045057-06:00 mgsched1.internal.ncsa.edu sssd_nss[88099]: Starting up
2023-11-09T14:01:47.045748-06:00 mgsched1.internal.ncsa.edu sssd_pam[88100]: Starting up
2023-11-09T14:05:28.679812-06:00 mgsched1.internal.ncsa.edu sssd[122147]: Starting up
2023-11-09T14:05:28.689621-06:00 mgsched1.internal.ncsa.edu sssd_be[122148]: Starting up
2023-11-09T14:05:28.716406-06:00 mgsched1.internal.ncsa.edu sssd_pam[122150]: Starting up
2023-11-09T14:05:28.716541-06:00 mgsched1.internal.ncsa.edu sssd_nss[122149]: Starting up

So I think something we should try is creating more dependencies in Puppet to not try to start Slurm daemons unless sssd is fully configured and running. But please let me know if you have other thoughts. Thanks!
Comment 29 Tim McMullan 2023-11-27 13:26:59 MST
I noticed that you opened another ticket about reservations getting purged, I think the slurmctld logs are showing the reason for that, and I think its that the slurmctld is up before sssd.

A sample:

> Nov  9 14:00:37 mgsched1 slurmctld[77581]: error: _get_group_members: Could not find configured group mg_admin
> Nov  9 14:00:37 mgsched1 slurmctld[77581]: error: Reservation maint_20231109 has invalid groups (mg_admin)
> Nov  9 14:00:37 mgsched1 slurmctld[77581]: error: Purging invalid reservation record maint_20231109

The failure to find the group caused us to think the reservation was invalid, so it got purged on restart.  The slurmdbd logs have similar issues where it looks like groups are missing and there are lookup failures.

What I would suggest for now is creating a drop-in systemd unit file for the slurmdbd/slurmctld/slurmd to start it after sssd (After=sssd.service).

Let me know if you aren't familiar with drop-ins or want to do it some other way.

I think this will go a long way toward fixing this issue and bug18269.

Out of curiosity, what does your current unit file look like for the slurmctld?

Thanks!
--Tim
Comment 30 Jake Rundall 2023-11-27 13:33:47 MST
Thanks! Yes, I think you're right.


The unit file looks like this:
[root@mgsched1 ~]# cat /usr/lib/systemd/system/slurmctld.service
[Unit]
Description=Slurm controller daemon
After=network-online.target munge.service
Wants=network-online.target
ConditionPathExists=/etc/slurm/slurm.conf

[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmctld
EnvironmentFile=-/etc/default/slurmctld
ExecStart=/usr/sbin/slurmctld -D -s $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=65536
TasksMax=infinity

# Uncomment the following lines to disable logging through journald.
# NOTE: It may be preferable to set these through an override file instead.
#StandardOutput=null
#StandardError=null

[Install]
WantedBy=multi-user.target


With a (Puppet-managed) drop-in like this:
[root@mgsched1 ~]# cat /etc/systemd/system/slurmctld.service.d/slurmctld-restart.conf 
# File managed by Puppet
[Service]
Restart=on-failure


We can definitely either update the drop-in, or add another, or otherwise add this kind of dependency in between Puppet "resources".


It may be a while before we can test this potential fix in production, but at this point I'm more hopeful I can reproduce it on a test system and help validate the fix.
Comment 31 Tim McMullan 2023-11-27 13:45:19 MST
Well, I got a little dyslexic on the bug number (bug18296) but yes, I think we are on the right track here!

The slurm daemons often assume that uid/gid resolution is working and consistent, so at a glance I think adding the After=sssd.service dependency upstream is reasonable too.  I'll do some further looking on that front, if you are able to confirm that this fixes the issues for you though that would be great!

(In reply to Jake Rundall from comment #30)
> We can definitely either update the drop-in, or add another, or otherwise
> add this kind of dependency in between Puppet "resources".

Sounds good, whatever is best for you! You're clearly familiar with drop-ins, some people aren't so I like to bring them up as I strongly prefer them in most situations like this.

Thanks for both your patience and help on this, let me know how things go!
Comment 33 Jake Rundall 2023-12-04 07:35:54 MST
I meant to update this one last Friday as well. I do confirm that having the sssd cache cleared, stopping sssd, and then restarting slurmctld reproduces this issue. Our approach to avoid this at our site will be to use Puppet dependencies, to ensure that sssd is fully configured and running prior to the start of slurmctld.
Comment 34 Tim McMullan 2023-12-05 09:28:10 MST
*** Ticket 17980 has been marked as a duplicate of this ticket. ***
Comment 35 Tim McMullan 2024-01-04 08:12:25 MST
Hey Jake, 

Has this come up again since making the updates we discussed here?

Thanks!
--Tim
Comment 36 Jake Rundall 2024-01-04 10:09:24 MST
Nope, we should be OK. As long as sssd is fully configured and running prior to the start of slurmctld things work fine. I think we can close this one.
Comment 37 Tim McMullan 2024-01-04 10:10:53 MST
Ok! Sounds good.  Let us know if you need any other assistance!

Thanks!
--Tim