18296 – reservations disappear after scheduler is rebooted

Ticket 18296 - reservations disappear after scheduler is rebooted

Summary: reservations disappear after scheduler is rebooted

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	reservations (show other tickets)
Version:	23.02.6
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Tim McMullan
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2023-11-27 12:31 MST by Jake Rundall
Modified:	2023-12-07 12:49 MST (History)
CC List:	1 user (show)

See Also:	17270
Site:	NCSA
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	23.11.1 24.08.0rc1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Jake Rundall 2023-11-27 12:31:22 MST

We've had this occur 3+ times over the past couple of years. The circumstances are after the reboot of a Slurm scheduler node (incl. DB). There was no update of Slurm (although we'd rebuilt our RPMs it using a newer version of PMIx). The scheduler and other nodes in the cluster are stateless in general, although the scheduler does have local disk for slurmctld saved state and MariaDB data for the Slurm accounting DB.

The reservations would normally look like this:
$ scontrol show res
...
ReservationName=maint_20230208 StartTime=2024-02-08T07:00:00 EndTime=2025-02-07T07:00:00 Duration=365-00:00:00
   Nodes=mg[001-094,101-132] NodeCnt=126 CoreCnt=5048 Features=(null) PartitionName=(null) Flags=MAINT,SPEC_NODES,ALL_NODES,MAGNETIC
   TRES=cpu=5048
   Users=(null) Groups=mg_admin Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

Reservations are present prior to shutdown. The scheduler processes would be shut down cleanly before the node is rebooted. Slurm daemons start only after the local storage is available. They would be coming up after the proximate maintenance reservation would have started. But this reservation, and all future reservations, are gone:
$ scontrol show res
No reservations in the system

This hasn't happened every time we have maintenance, but I haven't been able to pin down what is different about the circumstances when this does happen.

I do wonder if it is somehow related to the problem we've had where default accounts temporarily are ignored/missing (that's at least how we've interepreted this other problem, see 17270). Except the reservations are not temporarily missing but rather permanently missing.

It does not happen with a simple resertart of slurmctld, which we do fairly routinely (e.g., when adding or removing nodes).

Comment 1 Jake Rundall 2023-11-27 12:56:15 MST

Ok, looking at logs I gathered for the other ticket, I think the underlying issue is self-inflicted:
...
Nov  9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_ncsa_user
Nov  9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_abv_grace
Nov  9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_ncsa_user
Nov  9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_abv_lowprio
Nov  9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_ncsa_user
Nov  9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_abv_grc_be
Nov  9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_ncsa_user
Nov  9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_abv_abc_gpu
Nov  9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_ncsa_user
Nov  9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_abv_abc_gpu
Nov  9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_ncsa_user
Nov  9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_abv_grc
Nov  9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_ncsa_user
Nov  9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_abv_illumina_data
Nov  9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_ncsa_user
Nov  9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_admin
Nov  9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_admin
Nov  9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: Reservation maint_20231109 has invalid groups (mg_admin)
...

I suspect that slurmctld is starting before sssd is fully functional. I would guess we need to add some more dependencies into our configuration management. I think we can probably close this ticket.

Comment 2 Tim McMullan 2023-11-27 13:50:04 MST

Hey Jake,

I'm gonna grab this ticket and keep it open for the moment.  I'm pretty sure the dependency change will solve this issue and will (likely) play a role in 17270. If we choose to add After=sssd.service it may be more appropriate to do so associated with this bug. I don't think it will take too long for us to collectively decide on the right path here.

Thanks!

Comment 5 Jake Rundall 2023-12-01 14:17:08 MST

On a test scheduler I did verify very directly that if I clear the sssd cache and stop sssd and then restart slurmctld that it will purge reservations that involve users and groups that would be known by sssd lookups. As expected.

After some thinking and testing it seems like modifying the slurmctld.service definition won't hurt but it also won't fix things in our environment.

In an environment where services were pre-configured and enabled to start at boot, configuring slurmctld.service with After=sssd.service would probably do the trick as you suggest. (Or in any other environment where sssd and slurmctld might be started with the same ~command.)

But in our environment, with stateless nodes and where Puppet is installing and configuring sssd and slurmctld and starting them individually rather than having them start automatically by systemd at boot, After= is not going to fix things (they key reason is that Puppet starts them individually).

Requires= might do the trick (it might also require that sssd is fully configured before its first start). But looking at the general case, adding Requires=sssd.service for slurmctld.service might not make sense. I.e., you might not want to add this to the unit file that gets built into the slurm-slurmctld RPM.

Adding After=sssd.service shouldn't hurt, IMO, but you'll have to decide whether it's worthwhile.

But on our end we'll need to add some more resource dependencies in Puppet, which is no problem.

Hope that makes sense.

Comment 6 Tim McMullan 2023-12-04 07:59:08 MST

Thanks for the update Jake!

I understand that you'll need to tweak the process/unit file a bit more for your case, but I do think it is worth looking at making a quick change to the unit files, be that adding the service to After= or a comment/documentation on avoiding this class of problem in the future.  I'll let you know what we end up doing here!

Thanks again!

Comment 10 Tim McMullan 2023-12-07 12:44:34 MST

Hi Jake,

We've decided that adding After=sssd.service to the unit files is about the best we could do to try to avoid this for someone else in the future.  The update is in https://github.com/SchedMD/slurm/commit/eb11ddfc4b and will be included in 23.11.1+.

I know for your site this should be handled now in a more robust way, but I wanted to let you know!

Let me know if you have any other questions on this, and if not I'll close this ticket as resolved.

Thanks!
--Tim

Comment 11 Jake Rundall 2023-12-07 12:45:36 MST

Makes sense, go ahead and close this. Thanks!

Comment 12 Tim McMullan 2023-12-07 12:49:05 MST

Thanks Jake!