| Summary: | upgrade from 21.08.8 to 23.02.3 resulted in issue with default accounts | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Jake Rundall <rundall> |
| Component: | Accounting | Assignee: | Tim McMullan <mcmullan> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | bglick, david.gloe, mark.c.dixon |
| Version: | 23.02.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=18296 | ||
| Site: | NCSA | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
Slurm DB schema (no data)
slurm_acct_db.schema.2023.09.27.sql — before resolution slurm.conf.mf.2023.10.31 scontrol.show.assoc.before scontrol.show.assoc.after logs-2023.11.09 |
||
|
Description
Jake Rundall
2023-07-24 12:28:52 MDT
Do you have the older schema still available to you, and would it be possible to share that with us? We do have a backup. Am I correct that you're just asking for the pre-upgrade schema w/o any data? > We do have a backup. Am I correct that you're just asking for the pre-upgrade schema w/o any data?
Correct
Hi Jake, Can you provide the number of users, accounts, and associations in your system? Thanks! --Tim users: 868 accounts: 2 associations: 5976 I'm working to get the pre-update schema for you. It's in a pretty restricted environment so there are some hoops to jump through. Hey Jake, If its that difficult, I'll take your word for it that the schema was not modified for now. If you were running any patches on top of 21.08 that would be good to know though! Thanks! --Tim Nope, no patches. Would have been 21.08.8-2. Created attachment 31462 [details]
Slurm DB schema (no data)
I'm attaching the schema (the only thing I've changed after exporting is the cluster name).
Hey Jake, I've been looking at this and having a hard time reproducing. One thing I didn't ask before was what version of mysql you were running, would you mind letting me know? Maybe there is a bad interaction on the upgrade that I need to look at, so would you mind letting me know what version you were running for the upgrade? Thanks! -Tim Thanks! We were running MariaDB 10.6.14 at the time of the Slurm upgrade. We've got at least a couple of clusters running Slurm 23.02 that didn't run into the issue, although I believe they both went from 22.05 rather than jumping from 21.08. (In reply to Jake Rundall from comment #10) > Thanks! We were running MariaDB 10.6.14 at the time of the Slurm upgrade. > > We've got at least a couple of clusters running Slurm 23.02 that didn't run > into the issue, although I believe they both went from 22.05 rather than > jumping from 21.08. Ok Interesting, Thank you! I'll match my MariaDB version just in case and see if I can track it down! We encountered this issue again today on the same cluster. It's a bit unclear what the trigger was but I'll lay out the scenario.
Note that we updated from 23.02.3 to 23.02.4 a few weeks ago.
As of last night things were working normally — users could submit without specifying any account because they are all associated with an account named 'default' which is set as their default account.
We had shared filesystem (GPFS) problems overnight. This was related to a storage-side IB fabric issue; the scheduler/login nodes/compute nodes do not connect to that fabric, it's simply to connect IO nodes and storage controllers to each other; the scheduler's DB and saved state are on local disk and the scheduler doesn't even mount GPFS.
This morning after the GPFS backend issue was resolved we rebooted the rest of the cluster. The scheduler probably didn't need a reboot since it doesn't even mount GPFS, but it was rebooted anyway. So there was a restart of slurmctld, slurmdbd, and mariadb. And reboots of most or all compute nodes and login nodes.
After the scheduler rebooted, our custom Slurm DB user mgmt script ran and removed several (44) users from the DB, using commands like this (these were just routine changes due to users being removed from LDAP groups):
sacctmgr -i delete user name=USERNAME
We put the cluster back in service (marked nodes back online, etc.).
Around or shortly after this point, users noted that they couldn't submit jobs and we verified this to be the case for us admins as well, e.g.:
[m189274@hn4 ~]$ srun -p cpu-short -t 0:01:00 hostname
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified
This type of failure was accompanied by slurmctld logs like the following:
2023-09-27T12:21:26.694750-05:00 mfsched8 slurmctld[55357]: slurmctld: error: User m189274(10107) doesn't have a default account
2023-09-27T12:21:26.695139-05:00 mfsched8 slurmctld[55357]: slurmctld: _job_create: invalid account or partition for user 10107, account '(null)', and partition 'cpu-short'
2023-09-27T12:21:26.695139-05:00 mfsched8 slurmctld[55357]: slurmctld: _slurm_rpc_allocate_resources: Invalid account or account/partition combination specified
2023-09-27T12:21:26.695182-05:00 mfsched8 slurmctld[55357]: error: User m189274(10107) doesn't have a default account
2023-09-27T12:21:26.695221-05:00 mfsched8 slurmctld[55357]: _job_create: invalid account or partition for user 10107, account '(null)', and partition 'cpu-short'
2023-09-27T12:21:26.695249-05:00 mfsched8 slurmctld[55357]: _slurm_rpc_allocate_resources: Invalid account or account/partition combination specified
Specifying an account allowed us to submit jobs:
[m189274@hn4 ~]$ srun -p cpu-short -A default -t 0:01:00 hostname
mf143.local
But again, we have default accounts set, as illustrated by commands I ran while in the failure mode:
[m189274@hn4 ~]$ sacctmgr show assoc user=m189274
Cluster Account User Partition Share Priority GrpJobs GrpTRES GrpSubmit GrpWall GrpTRESMins MaxJobs MaxTRES MaxTRESPerNode MaxSubmit MaxWall MaxTRESMins QOS Def QOS GrpTRESRunMin
---------- ---------- ---------- ---------- --------- ---------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- -------------
mf default m189274 ngs-ext 1 ngs-qos ngs-qos
mf default m189274 ngs-sec 1 ngs-qos ngs-qos
mf default m189274 ngs-pri 1 ngs-qos ngs-qos
mf default m189274 cpu-long 1 general-qos general-+
mf default m189274 cpu-med 1 general-qos general-+
mf default m189274 cpu-short 1 general-qos general-+
mf default m189274 1 normal
and:
[m189274@hn4 ~]$ sacctmgr show user m189274
User Def Acct Admin
---------- ---------- ---------
m189274 default None
We used sacctmgr to re-specify the default account for one of us admins, with a command like this:
sscctmgr modify user where user=USERNAME set defaultaccount=default
That user was then able to submit jobs without specifying an account. However it (running the command to fix the one user) didn't fix things for the other admin users the tested after that, including myself (m189274).
We then decided to dump the schema again, which I've attached.
And then increased debug logging:
'/etc/slurm/slurmdbd.conf: DebugLevelSyslog=debug5' and restart slurmdbd
'/etc/slurm/slurm.conf: SlurmctldSyslogDebug=debug5' and restart slurmctld
And then I went to test again...but the problem had gone away — I was able to submit without specifying an account. Another admin user confirmed, and we haven't heard further complaints from anyone.
So...it's not clear what the trigger was for failure or resolution.
- Failure could have been triggered by the reboot of the scheduler in general, restart of its various services, restart of slurmd, or deleting those several users out of the DB. Although none of those things in isolation seems to have caused problems otherwise (e.g., we restart slurmctld when adding nodes to Slurm, our user mgmt script runs via cron and routinely adds and removes users, and I suspect that the scheduler has been rebooted at least once since opening this ticket and prior to today). Or possibly something else.
- Resolution could have been triggered by fixing the one admin's account but with a delay (?), by upping debugging, or by restarts of services, or...?
Note: We also re-dumped the schema after the problem went away, and (as expected, I think) there were no differences except for AUTO_INCREMENT values in a few tables. So I'm not uploading that post-resolution dump.
If this happens in the future I think we will probably approach this with more atomic changes with testing in between, e.g., simply restart slurmctld and then re-test, then adjust debug logging and restart it again and re-test, then restart slurmdbd and re-test, then adjust its debug logging and re-test, etc.
If you have specific suggestions on what you'd like us to try please let us know.
And if there is another way for us to set debug logging that might not require a restart that might be better. We do have Slurm daemons configured to log via syslog.
Created attachment 32459 [details]
slurm_acct_db.schema.2023.09.27.sql — before resolution
Hi Jake, Thank you for the update and sorry for the long reply delay on the new occurrence of this, I was out of the office. I think though that this new occurrence says we've been barking up the wrong tree, and the upgrade probably didn't actually cause it, but more likely that something between 21.08 and 23.02 changed that allows this situation to happen, or if it is a side-effect of a different issue. Does your removal script remove the users individually or group them up and delete them all at once? Thanks, --Tim Agreed, and no problem! The user mgmt script process account and user adds and removals one user or account at a time in this order: - remove users, one at a time - remove accounts, one at a time - create accounts, one at a time - create users, one at a time On this particular cluster, users should only ever be associated with a single account, which is their default. This happened again this past Saturday after rebooting the cluster (and updating from 23.02.4 to 23.02.6). We took a DB backup before and after resolution. It sort of seems like running the user mgmt script may have knocked things loose (it processed 7 pending deletions), although perhaps that's just a coincidence. I can definitely confirm that the entries for my users' associations in the <CLUSTERNAME>_assoc_table are identical before and after, so it really doesn't seem like a matter of something wrong in the DB: (1663814221,1689861582,0,NULL,1,4,'m189274','default','','',13751,13752,1,NULL,NULL,NULL,NULL,'','','','',NULL,NULL,NULL,NULL,'','','',NULL,NULL,NULL,'','') (1669673497,1689861582,0,NULL,1,27,'m189274','default','cpu-short','',13713,13714,1,NULL,NULL,NULL,NULL,'','','','',NULL,NULL,NULL,NULL,'','','',NULL,NULL,67,',67,','') (1669673498,1689861582,0,NULL,1,28,'m189274','default','cpu-med','',13711,13712,1,NULL,NULL,NULL,NULL,'','','','',NULL,NULL,NULL,NULL,'','','',NULL,NULL,67,',67,','') (1669673499,1689861582,0,NULL,1,29,'m189274','default','cpu-long','',13709,13710,1,NULL,NULL,NULL,NULL,'','','','',NULL,NULL,NULL,NULL,'','','',NULL,NULL,67,',67,','') (1669673499,1689861582,0,NULL,1,30,'m189274','default','ngs-pri','',13707,13708,1,NULL,NULL,NULL,NULL,'','','','',NULL,NULL,NULL,NULL,'','','',NULL,NULL,68,',68,','') (1669673500,1689861582,0,NULL,1,31,'m189274','default','ngs-sec','',13705,13706,1,NULL,NULL,NULL,NULL,'','','','',NULL,NULL,NULL,NULL,'','','',NULL,NULL,68,',68,','') (1669673501,1689861582,0,NULL,1,32,'m189274','default','ngs-ext','',13703,13704,1,NULL,NULL,NULL,NULL,'','','','',NULL,NULL,NULL,NULL,'','','',NULL,NULL,68,',68,','') I'm associated the default account for all of these rows (some involve specific partitions) and is_def is set to 1 for all of these. If you see this happen again, I'd be interested in seeing the output of "scontrol show assoc" from when its broken and the after fixing it (which seems to just require updating an account again?). The slurmctld maintains a copy of the associations that the slurmdbd pushes updates to when it changes. I think it is likely that this is getting messed up somewhere, and apparently by dropping default accounts on the floor. Do you know the order of operations? What is restart/upgrade, broken, sacctmgr runs, still broken? Or restart/upgrade, sacctmgr runs, broken? I totally understand if you aren't sure, it just might help narrow it down! Thanks! --Tim Thanks, we can definitely pull the associations from slurmctld with scontrol then next time this happens. The order in this past instance (and in general in the past) was: 1. update Slurm and restart scheduler (the order of those might vary) 2. we find it is broken 3. something seems to fix it (in most cases it seems like making a change to 1+ users/associations using sacctmgr is perhaps the thing that resolves things) Thank you! Would you mind also attaching your slurm.conf? Created attachment 33044 [details]
slurm.conf.mf.2023.10.31
We've seen this as well - logged similar report for our site, bug #18150 I've been continuing to try to reproduce this and am having no luck, nor am I seeing an obvious fault that would cause it right now. Mark, thank you for the report. I'm taking the details you provided in 18150 under advisement as well, I've been mostly doing "add account, add user, remove user, remove account" over an over and over again trying to get the slurmctld to get confused. It sounds like this is quite infrequent (relatively speaking) but with enough repetition we should be able to catch it in a bad state. It seems we had this issue on a different cluster on Thu, Nov 9. (Sorry for the delayed report.) Very similar circumstances but not update at all to Slurm, just a reboot of the scheduler (and everything else). One notable difference compared to the cluster where we'd been having (or noticing) this originally is that this second cluster's job_submit.lua filter sets the account on all job submissions; users would generally submit without specifying an account and the job_submit.lua script changes the account from their default to something else. Things unfolded something like this: 1:46pm - scheduler OS registers as "up"; Puppet then runs and reconfigures things on the node, which is stateless (other than the local storage where the DB and Slurm saved state live) just before 2:11pm - I (rundall) ran a test job and got the usual error: "srun: error: Unable to allocate resources: Invalid account or account/partition combination specified" 2:11pm - I saved output of 'scontrol show assoc' as scontrol.show.assoc.before 2:13pm - I ran our account management script (different than on the other cluster but similar in principal) 2:15pm - a colleague (rbrunner) successfully submitted jobs w/o specifying account 2:24pm — I tested again and was able to run w/o specifying an account (should have tested earlier) 2:30pm — saved output of 'scontrol show assoc' again as scontrol.show.assoc.after (again, wish I'd thought to run this sooner) You'll see that DefAccount=(null) for rundall in both cases, but perhaps it doesn't matter since our job_submit.lua script sets the account anyway. But what is interesting to me is that references to rundall with some long-ish integer that I don't recognize — rundall(4294967294) — are replaced by references to rundall with my LDAP user ID — UserName=rundall(54135). So I wonder if this isn't playing a role somehow. Maybe the issue is that the sssd cache is emptied during the reboot and then takes a while for this info to repopulate and percolate into slurmctld? Created attachment 33479 [details]
scontrol.show.assoc.before
Created attachment 33480 [details]
scontrol.show.assoc.after
Thanks for the update! This is very interesting information. Your UID being filled in as 4294967294 is basically the max possible UID, which I think suggests that something went wrong looking up your user. I'll have to dig into where that particular number would get set. Would you attach both your slurmctld and slurmdbd logs for that period of time when this happened? I'm taking a look at the assoc information now, but maybe there will be an error in the logs related to the uid lookup in the logs too. Thanks! --Tim Created attachment 33483 [details]
logs-2023.11.09
I've uploaded the requested logs, from 2-3:59pm that day, because I'm not sure how long it took to finally populate DefAccount in the associations (although at this point it seems like the issue may be with user-account associations having issues due to UID lookups failing). I didn't include anything but slurmctld and slurmdbd logs in what I attached, but casting the net wider on my end I can definitely see that slurmctld starts prior to sssd being healthy: e.g., from syslog: [root@mgsched1 log]# egrep -i "slurmctld.*started on cluster magnus|sssd.*Starting up" messages-20231112 Nov 9 14:00:32 mgsched1 sssd_kcm[77389]: Starting up Nov 9 14:00:36 mgsched1 slurmctld[77581]: slurmctld: slurmctld version 23.02.6 started on cluster magnus Nov 9 14:00:36 mgsched1 slurmctld[77581]: slurmctld version 23.02.6 started on cluster magnus 2023-11-09T14:01:47.010128-06:00 mgsched1.internal.ncsa.edu sssd[88097]: Starting up 2023-11-09T14:01:47.022342-06:00 mgsched1.internal.ncsa.edu sssd_be[88098]: Starting up 2023-11-09T14:01:47.045057-06:00 mgsched1.internal.ncsa.edu sssd_nss[88099]: Starting up 2023-11-09T14:01:47.045748-06:00 mgsched1.internal.ncsa.edu sssd_pam[88100]: Starting up 2023-11-09T14:05:28.679812-06:00 mgsched1.internal.ncsa.edu sssd[122147]: Starting up 2023-11-09T14:05:28.689621-06:00 mgsched1.internal.ncsa.edu sssd_be[122148]: Starting up 2023-11-09T14:05:28.716406-06:00 mgsched1.internal.ncsa.edu sssd_pam[122150]: Starting up 2023-11-09T14:05:28.716541-06:00 mgsched1.internal.ncsa.edu sssd_nss[122149]: Starting up So I think something we should try is creating more dependencies in Puppet to not try to start Slurm daemons unless sssd is fully configured and running. But please let me know if you have other thoughts. Thanks! I noticed that you opened another ticket about reservations getting purged, I think the slurmctld logs are showing the reason for that, and I think its that the slurmctld is up before sssd. A sample: > Nov 9 14:00:37 mgsched1 slurmctld[77581]: error: _get_group_members: Could not find configured group mg_admin > Nov 9 14:00:37 mgsched1 slurmctld[77581]: error: Reservation maint_20231109 has invalid groups (mg_admin) > Nov 9 14:00:37 mgsched1 slurmctld[77581]: error: Purging invalid reservation record maint_20231109 The failure to find the group caused us to think the reservation was invalid, so it got purged on restart. The slurmdbd logs have similar issues where it looks like groups are missing and there are lookup failures. What I would suggest for now is creating a drop-in systemd unit file for the slurmdbd/slurmctld/slurmd to start it after sssd (After=sssd.service). Let me know if you aren't familiar with drop-ins or want to do it some other way. I think this will go a long way toward fixing this issue and bug18269. Out of curiosity, what does your current unit file look like for the slurmctld? Thanks! --Tim Thanks! Yes, I think you're right. The unit file looks like this: [root@mgsched1 ~]# cat /usr/lib/systemd/system/slurmctld.service [Unit] Description=Slurm controller daemon After=network-online.target munge.service Wants=network-online.target ConditionPathExists=/etc/slurm/slurm.conf [Service] Type=simple EnvironmentFile=-/etc/sysconfig/slurmctld EnvironmentFile=-/etc/default/slurmctld ExecStart=/usr/sbin/slurmctld -D -s $SLURMCTLD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID LimitNOFILE=65536 TasksMax=infinity # Uncomment the following lines to disable logging through journald. # NOTE: It may be preferable to set these through an override file instead. #StandardOutput=null #StandardError=null [Install] WantedBy=multi-user.target With a (Puppet-managed) drop-in like this: [root@mgsched1 ~]# cat /etc/systemd/system/slurmctld.service.d/slurmctld-restart.conf # File managed by Puppet [Service] Restart=on-failure We can definitely either update the drop-in, or add another, or otherwise add this kind of dependency in between Puppet "resources". It may be a while before we can test this potential fix in production, but at this point I'm more hopeful I can reproduce it on a test system and help validate the fix. Well, I got a little dyslexic on the bug number (bug18296) but yes, I think we are on the right track here! The slurm daemons often assume that uid/gid resolution is working and consistent, so at a glance I think adding the After=sssd.service dependency upstream is reasonable too. I'll do some further looking on that front, if you are able to confirm that this fixes the issues for you though that would be great! (In reply to Jake Rundall from comment #30) > We can definitely either update the drop-in, or add another, or otherwise > add this kind of dependency in between Puppet "resources". Sounds good, whatever is best for you! You're clearly familiar with drop-ins, some people aren't so I like to bring them up as I strongly prefer them in most situations like this. Thanks for both your patience and help on this, let me know how things go! I meant to update this one last Friday as well. I do confirm that having the sssd cache cleared, stopping sssd, and then restarting slurmctld reproduces this issue. Our approach to avoid this at our site will be to use Puppet dependencies, to ensure that sssd is fully configured and running prior to the start of slurmctld. *** Ticket 17980 has been marked as a duplicate of this ticket. *** Hey Jake, Has this come up again since making the updates we discussed here? Thanks! --Tim Nope, we should be OK. As long as sssd is fully configured and running prior to the start of slurmctld things work fine. I think we can close this one. Ok! Sounds good. Let us know if you need any other assistance! Thanks! --Tim |