Summary: | Issue with "User's group not permitted" Error Post-SSSD/GPFS Upgrades | ||
---|---|---|---|
Product: | Slurm | Reporter: | Barry Chiu <barryc> |
Component: | Scheduling | Assignee: | Connor <connor> |
Status: | OPEN --- | QA Contact: | |
Severity: | 2 - High Impact | ||
Priority: | --- | CC: | cinek, oscar.hernandez |
Version: | 24.05.4 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Northwestern | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
Barry Chiu
2025-03-28 07:31:32 MDT
The failures are very likely related to the issues from sssd you've mentioned. The function responsible for user access verification is validate_group[1], it internally has a negative cache of 5s to prevent frequent recheck of the same user/partition combination if previous attempt failed. When it succeeds the positive entry is cached too. Group membership is verified on job submission attempt, but the cached lists are also updated peridically - this behavior is governed by: > GroupUpdateForce > If set to a non-zero value, then information about which users are members of groups allowed to use a partition will be updated periodically, even when there > have been no changes to the /etc/group file. If set to zero, group member information will be updated only after the /etc/group file is updated. The de‐ > fault value is 1. Also see the GroupUpdateTime parameter. > > GroupUpdateTime > Controls how frequently information about which users are members of groups allowed to use a partition will be updated, and how long user group membership > lists will be cached. The time interval is given in seconds with a default value of 600 seconds. A value of zero will prevent periodic updating of group > membership information. Also see the GroupUpdateForce parameter. > If you want to get additional insights into those failures you may want to increase SlurmctldDebug to debug2 level. Focus on messages commig from the below logging functions: >debug2("%s: uid %u not in group permitted to use this partition (%s). groups allowed: %s", >error("%s: Could not find group with gid %u", >error("%s: Could not find passwd entry for uid %u", >debug("UID %u added to AllowGroup %s of partition %s", I hope that helps. cheers, Marcin [1]https://github.com/SchedMD/slurm/blob/1757f910502d73edb44f69677b64fde438ad3e2a/src/slurmctld/partition_mgr.c#L1857-L1953 Thanks Marcin! I'd shut down slurmctld and slurmdbd when my coworker and I swapped in/out SSSD and the /etc/group file. Does that still make a difference? Thank you! That is, if the servers are down... is there anything to cache. Sorry if I wasn't more clear earlier. The internal slurmctld cache isn't preserved over restart, so if the backend is stable answering calls like `getent passwd ...` `getent group ...` slurmctld shouldn't have any issues checking partition access. cheers, Marcin |