Hi SchedMD, We are experiencing a transient issue in production. After recent GPFS and SSSD updates on our Slurm servers, job submissions with sbatch sometimes return the error "User's group not permitted to use this partition." The other things that happened in this maintenance period: GPFS and SSSD updates on our Slurm servers, some job submissions with sbatch sometimes return the error "User's group not permitted to use this partition." Preliminary investigations indicate that user and group resolution may be temporarily failing due to NSS/SSSD caching. No database modification or allocation upgrade scripts have been run. I am gathering additional details and logs, and will update this ticket soon. Please let me know if you require any immediate information. Thank you!, Barry
The failures are very likely related to the issues from sssd you've mentioned. The function responsible for user access verification is validate_group[1], it internally has a negative cache of 5s to prevent frequent recheck of the same user/partition combination if previous attempt failed. When it succeeds the positive entry is cached too. Group membership is verified on job submission attempt, but the cached lists are also updated peridically - this behavior is governed by: > GroupUpdateForce > If set to a non-zero value, then information about which users are members of groups allowed to use a partition will be updated periodically, even when there > have been no changes to the /etc/group file. If set to zero, group member information will be updated only after the /etc/group file is updated. The deβ > fault value is 1. Also see the GroupUpdateTime parameter. > > GroupUpdateTime > Controls how frequently information about which users are members of groups allowed to use a partition will be updated, and how long user group membership > lists will be cached. The time interval is given in seconds with a default value of 600 seconds. A value of zero will prevent periodic updating of group > membership information. Also see the GroupUpdateForce parameter. > If you want to get additional insights into those failures you may want to increase SlurmctldDebug to debug2 level. Focus on messages commig from the below logging functions: >debug2("%s: uid %u not in group permitted to use this partition (%s). groups allowed: %s", >error("%s: Could not find group with gid %u", >error("%s: Could not find passwd entry for uid %u", >debug("UID %u added to AllowGroup %s of partition %s", I hope that helps. cheers, Marcin [1]https://github.com/SchedMD/slurm/blob/1757f910502d73edb44f69677b64fde438ad3e2a/src/slurmctld/partition_mgr.c#L1857-L1953
Thanks Marcin! I'd shut down slurmctld and slurmdbd when my coworker and I swapped in/out SSSD and the /etc/group file. Does that still make a difference? Thank you!
That is, if the servers are down... is there anything to cache. Sorry if I wasn't more clear earlier.
The internal slurmctld cache isn't preserved over restart, so if the backend is stable answering calls like `getent passwd ...` `getent group ...` slurmctld shouldn't have any issues checking partition access. cheers, Marcin