| Summary: | reservations disappear after scheduler is rebooted | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Jake Rundall <rundall> |
| Component: | reservations | Assignee: | Tim McMullan <mcmullan> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | mcmullan |
| Version: | 23.02.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=17270 | ||
| Site: | NCSA | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 23.11.1 24.08.0rc1 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Jake Rundall
2023-11-27 12:31:22 MST
Ok, looking at logs I gathered for the other ticket, I think the underlying issue is self-inflicted: ... Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_ncsa_user Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_abv_grace Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_ncsa_user Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_abv_lowprio Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_ncsa_user Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_abv_grc_be Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_ncsa_user Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_abv_abc_gpu Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_ncsa_user Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_abv_abc_gpu Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_ncsa_user Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_abv_grc Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_ncsa_user Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_abv_illumina_data Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_ncsa_user Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_admin Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: _get_group_members: Could not find configured group mg_admin Nov 9 14:00:37 mgsched1 slurmctld[77581]: slurmctld: error: Reservation maint_20231109 has invalid groups (mg_admin) ... I suspect that slurmctld is starting before sssd is fully functional. I would guess we need to add some more dependencies into our configuration management. I think we can probably close this ticket. Hey Jake, I'm gonna grab this ticket and keep it open for the moment. I'm pretty sure the dependency change will solve this issue and will (likely) play a role in 17270. If we choose to add After=sssd.service it may be more appropriate to do so associated with this bug. I don't think it will take too long for us to collectively decide on the right path here. Thanks! On a test scheduler I did verify very directly that if I clear the sssd cache and stop sssd and then restart slurmctld that it will purge reservations that involve users and groups that would be known by sssd lookups. As expected. After some thinking and testing it seems like modifying the slurmctld.service definition won't hurt but it also won't fix things in our environment. In an environment where services were pre-configured and enabled to start at boot, configuring slurmctld.service with After=sssd.service would probably do the trick as you suggest. (Or in any other environment where sssd and slurmctld might be started with the same ~command.) But in our environment, with stateless nodes and where Puppet is installing and configuring sssd and slurmctld and starting them individually rather than having them start automatically by systemd at boot, After= is not going to fix things (they key reason is that Puppet starts them individually). Requires= might do the trick (it might also require that sssd is fully configured before its first start). But looking at the general case, adding Requires=sssd.service for slurmctld.service might not make sense. I.e., you might not want to add this to the unit file that gets built into the slurm-slurmctld RPM. Adding After=sssd.service shouldn't hurt, IMO, but you'll have to decide whether it's worthwhile. But on our end we'll need to add some more resource dependencies in Puppet, which is no problem. Hope that makes sense. Thanks for the update Jake! I understand that you'll need to tweak the process/unit file a bit more for your case, but I do think it is worth looking at making a quick change to the unit files, be that adding the service to After= or a comment/documentation on avoiding this class of problem in the future. I'll let you know what we end up doing here! Thanks again! Hi Jake, We've decided that adding After=sssd.service to the unit files is about the best we could do to try to avoid this for someone else in the future. The update is in https://github.com/SchedMD/slurm/commit/eb11ddfc4b and will be included in 23.11.1+. I know for your site this should be handled now in a more robust way, but I wanted to let you know! Let me know if you have any other questions on this, and if not I'll close this ticket as resolved. Thanks! --Tim Makes sense, go ahead and close this. Thanks! Thanks Jake! |