Ticket 22454 - Issue with "User's group not permitted" Error Post-SSSD/GPFS Upgrades
Summary: Issue with "User's group not permitted" Error Post-SSSD/GPFS Upgrades
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 24.05.4
Hardware: Linux Linux
: 2 - High Impact
Assignee: Connor
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2025-03-28 07:31 MDT by Barry Chiu
Modified: 2025-03-28 09:53 MDT (History)
2 users (show)

See Also:
Site: Northwestern
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Barry Chiu 2025-03-28 07:31:32 MDT
Hi SchedMD,

We are experiencing a transient issue in production. After recent GPFS and SSSD updates on our Slurm servers, job submissions with sbatch sometimes return the error "User's group not permitted to use this partition."

The other things that happened in this maintenance period: GPFS and SSSD updates on our Slurm servers, some job submissions with sbatch sometimes return the error "User's group not permitted to use this partition." 

Preliminary investigations indicate that user and group resolution may be temporarily failing due to NSS/SSSD caching. No database modification or allocation upgrade scripts have been run.

I am gathering additional details and logs, and will update this ticket soon. Please let me know if you require any immediate information.

Thank you!,
Barry
Comment 2 Marcin Stolarek 2025-03-28 08:07:51 MDT
The failures are very likely related to the issues from sssd you've mentioned. The function responsible for user access verification is validate_group[1], it internally has a negative cache of 5s to prevent frequent recheck of the same user/partition combination if previous attempt failed.

When it succeeds the positive entry is cached too. Group membership is verified on job submission attempt, but the cached lists are also updated peridically - this behavior is governed by:
>       GroupUpdateForce
>              If set to a non-zero value, then information about which users are members of groups allowed to use a partition will be updated periodically, even when there
>              have been no changes to the /etc/group file.  If set to zero, group member information will be updated only after the /etc/group file is  updated.   The  de‐
>              fault value is 1.  Also see the GroupUpdateTime parameter.
>
>       GroupUpdateTime
>              Controls  how  frequently  information about which users are members of groups allowed to use a partition will be updated, and how long user group membership
>              lists will be cached.  The time interval is given in seconds with a default value of 600 seconds.  A value of zero will prevent periodic  updating  of  group
>              membership information.  Also see the GroupUpdateForce parameter.
>

If you want to get additional insights into those failures you may want to increase SlurmctldDebug to debug2 level. Focus on messages commig from the below logging functions:

>debug2("%s: uid %u not in group permitted to use this partition (%s). groups allowed: %s",
>error("%s: Could not find group with gid %u",                    
>error("%s: Could not find passwd entry for uid %u",              
>debug("UID %u added to AllowGroup %s of partition %s",           

I hope that helps.

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/blob/1757f910502d73edb44f69677b64fde438ad3e2a/src/slurmctld/partition_mgr.c#L1857-L1953
Comment 4 Barry Chiu 2025-03-28 09:45:56 MDT
Thanks Marcin!

I'd shut down slurmctld and slurmdbd when my coworker and I swapped in/out SSSD and the /etc/group file.

Does that still make a difference?

Thank you!
Comment 5 Barry Chiu 2025-03-28 09:49:22 MDT
That is, if the servers are down... is there anything to cache. Sorry if I wasn't more clear earlier.
Comment 6 Marcin Stolarek 2025-03-28 09:51:43 MDT
The internal slurmctld cache isn't preserved over restart, so if the backend is stable answering calls like `getent passwd ...` `getent group ...` slurmctld shouldn't have any issues checking partition access.

cheers,
Marcin