We've had this occur 3+ times over the past couple of years. The circumstances are after the reboot of a Slurm scheduler node (incl. DB). There was no update of Slurm (although we'd rebuilt our RPMs it using a newer version of PMIx). The scheduler and other nodes in the cluster are stateless in general, although the scheduler does have local disk for slurmctld saved state and MariaDB data for the Slurm accounting DB. The reservations would normally look like this: $ scontrol show res ... ReservationName=maint_20230208 StartTime=2024-02-08T07:00:00 EndTime=2025-02-07T07:00:00 Duration=365-00:00:00 Nodes=mg[001-094,101-132] NodeCnt=126 CoreCnt=5048 Features=(null) PartitionName=(null) Flags=MAINT,SPEC_NODES,ALL_NODES,MAGNETIC TRES=cpu=5048 Users=(null) Groups=mg_admin Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a MaxStartDelay=(null) Reservations are present prior to shutdown. The scheduler processes would be shut down cleanly before the node is rebooted. Slurm daemons start only after the local storage is available. They would be coming up after the proximate maintenance reservation would have started. But this reservation, and all future reservations, are gone: $ scontrol show res No reservations in the system This hasn't happened every time we have maintenance, but I haven't been able to pin down what is different about the circumstances when this does happen. I do wonder if it is somehow related to the problem we've had where default accounts temporarily are ignored/missing (that's at least how we've interepreted this other problem, see 17270). Except the reservations are not temporarily missing but rather permanently missing. It does not happen with a simple resertart of slurmctld, which we do fairly routinely (e.g., when adding or removing nodes).
Ok, looking at logs I gathered for the other ticket, I think the underlying issue is self-inflicted: ... Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_ncsa_user Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_abv_grace Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_ncsa_user Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_abv_lowprio Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_ncsa_user Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_abv_grc_be Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_ncsa_user Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_abv_abc_gpu Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_ncsa_user Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_abv_abc_gpu Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_ncsa_user Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_abv_grc Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_ncsa_user Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_abv_illumina_data Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_ncsa_user Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_admin Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_admin Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: Reservation maint_20231109 has invalid groups (mg_admin) ... I suspect that slurmctld is starting before sssd is fully functional. I would guess we need to add some more dependencies into our configuration management. I think we can probably close this ticket.
Hey Jake, I'm gonna grab this ticket and keep it open for the moment. I'm pretty sure the dependency change will solve this issue and will (likely) play a role in 17270. If we choose to add After=sssd.service it may be more appropriate to do so associated with this bug. I don't think it will take too long for us to collectively decide on the right path here. Thanks!
On a test scheduler I did verify very directly that if I clear the sssd cache and stop sssd and then restart slurmctld that it will purge reservations that involve users and groups that would be known by sssd lookups. As expected. After some thinking and testing it seems like modifying the slurmctld.service definition won't hurt but it also won't fix things in our environment. In an environment where services were pre-configured and enabled to start at boot, configuring slurmctld.service with After=sssd.service would probably do the trick as you suggest. (Or in any other environment where sssd and slurmctld might be started with the same ~command.) But in our environment, with stateless nodes and where Puppet is installing and configuring sssd and slurmctld and starting them individually rather than having them start automatically by systemd at boot, After= is not going to fix things (they key reason is that Puppet starts them individually). Requires= might do the trick (it might also require that sssd is fully configured before its first start). But looking at the general case, adding Requires=sssd.service for slurmctld.service might not make sense. I.e., you might not want to add this to the unit file that gets built into the slurm-slurmctld RPM. Adding After=sssd.service shouldn't hurt, IMO, but you'll have to decide whether it's worthwhile. But on our end we'll need to add some more resource dependencies in Puppet, which is no problem. Hope that makes sense.
Thanks for the update Jake! I understand that you'll need to tweak the process/unit file a bit more for your case, but I do think it is worth looking at making a quick change to the unit files, be that adding the service to After= or a comment/documentation on avoiding this class of problem in the future. I'll let you know what we end up doing here! Thanks again!
Hi Jake, We've decided that adding After=sssd.service to the unit files is about the best we could do to try to avoid this for someone else in the future. The update is in https://github.com/SchedMD/slurm/commit/eb11ddfc4b and will be included in 23.11.1+. I know for your site this should be handled now in a more robust way, but I wanted to let you know! Let me know if you have any other questions on this, and if not I'll close this ticket as resolved. Thanks! --Tim
Makes sense, go ahead and close this. Thanks!
Thanks Jake!