13385 – nss_slurm plugin unable to resolve user name

Ticket 13385 - nss_slurm plugin unable to resolve user name

Summary: nss_slurm plugin unable to resolve user name

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	nss_slurm (show other tickets)
Version:	21.08.5
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Tim McMullan
QA Contact:

URL:

Duplicates (1):	8836 (view as ticket list)
Depends on:
Blocks:

Reported:	2022-02-09 02:22 MST by Francesco De Martino
Modified:	2022-02-09 13:39 MST (History)
CC List:	3 users (show)

See Also:	13271
Site:	DS9 (PSLA)
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	21.08.6 22.05pre1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Francesco De Martino 2022-02-09 02:22:19 MST

Hi,

after enabling the nss_plugin we are experiencing the following behavior:

* srun whoami returns the correct username.
* sbatch --wrap="whoami" returns nobody
* sbatch --wrap="srun whoami" returns the correct username.

Our nsswitch.conf looks like:
```
...
passwd:         slurm files sss
group:            slurm files sss
shadow:         files sss
...
```

Additional details and logs:

Submitting job from a local user (ec2-user), looks good
```shell
[ec2-user@ip-192-168-39-241 ~]$ sbatch --wrap="id"
Submitted batch job 1
[ec2-user@ip-192-168-39-241 ~]$ cat slurm-1.out 
uid=1000(ec2-user) gid=1000(ec2-user) groups=1000(ec2-user),4(adm),10(wheel),190(systemd-journal)
```

Submitting job with sbatch from a domain user (PclusterUser1), get nobody
```shell
[PclusterUser1@ip-192-168-39-241 ~]$ sbatch --wrap="id"
Submitted batch job 2
[PclusterUser1@ip-192-168-39-241 ~]$ cat slurm-2.out 
uid=1896801142(nobody) gid=1896800513(Domain Users) groups=1896800513(Domain Users)
```

Submitting job with sbatch+srun from a domain user (PclusterUser1), looks good
```shell
[PclusterUser1@ip-192-168-39-241 ~]$ sbatch --wrap="srun id"
Submitted batch job 3
[PclusterUser1@ip-192-168-39-241 ~]$ cat slurm-3.out 
uid=1896801142(PclusterUser1) gid=1896800513(Domain Users) groups=1896800513(Domain Users)
```

Submitting job with srun from a domain user (PclusterUser1), looks good
```shell
[PclusterUser1@ip-192-168-39-241 ~]$ srun id
uid=1896801142(PclusterUser1) gid=1896800513(Domain Users) groups=1896800513(Domain Users)
```

slurmd log on the compute node:
```
[2022-02-02T14:12:07.154] error: Node configuration differs from hardware: CPUs=4:4(hw) Boards=1:1(hw) SocketsPerBoard=4:1(hw) CoresPerSocket=1:2(hw) ThreadsPerCore=1:2(hw)
[2022-02-02T14:12:07.160] CPU frequency setting not configured for this node
[2022-02-02T14:12:07.165] slurmd version 21.08.5 started
[2022-02-02T14:12:07.170] slurmd started on Wed, 02 Feb 2022 14:12:07 +0000
[2022-02-02T14:12:07.170] CPUs=4 Boards=1 Sockets=4 Cores=1 Threads=1 Memory=7623 TmpDisk=35827 Uptime=70 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2022-02-02T14:40:34.888] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 1
[2022-02-02T14:40:34.888] task/affinity: batch_bind: job 1 CPU input mask for node: 0x1
[2022-02-02T14:40:34.888] task/affinity: batch_bind: job 1 CPU final HW mask for node: 0x1
[2022-02-02T14:40:34.888] Launching batch job 1 for UID 1000
[2022-02-02T14:40:34.979] [1.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0
[2022-02-02T14:40:34.981] [1.batch] done with job
[2022-02-02T14:42:20.026] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 2
[2022-02-02T14:42:20.026] task/affinity: batch_bind: job 2 CPU input mask for node: 0x1
[2022-02-02T14:42:20.026] task/affinity: batch_bind: job 2 CPU final HW mask for node: 0x1
[2022-02-02T14:42:20.027] Launching batch job 2 for UID 1896801142
[2022-02-02T14:42:20.065] [2.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0
[2022-02-02T14:42:20.067] [2.batch] done with job
[2022-02-02T14:43:06.094] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 3
[2022-02-02T14:43:06.094] task/affinity: batch_bind: job 3 CPU input mask for node: 0x1
[2022-02-02T14:43:06.094] task/affinity: batch_bind: job 3 CPU final HW mask for node: 0x1
[2022-02-02T14:43:06.094] Launching batch job 3 for UID 1896801142
[2022-02-02T14:43:06.735] launch task StepId=3.0 request from UID:1896801142 GID:1896800513 HOST:192.168.102.117 PORT:45550
[2022-02-02T14:43:06.736] task/affinity: lllp_distribution: JobId=3 implicit auto binding: sockets,one_thread, dist 1
[2022-02-02T14:43:06.736] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic 
[2022-02-02T14:43:06.736] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [3]: mask_cpu,one_thread, 0x1
[2022-02-02T14:43:06.753] [3.0] done with job
[2022-02-02T14:43:06.760] [3.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0
[2022-02-02T14:43:06.762] [3.batch] done with job
[2022-02-02T14:55:00.947] launch task StepId=4.0 request from UID:1896801142 GID:1896800513 HOST:192.168.39.241 PORT:45796
[2022-02-02T14:55:00.947] task/affinity: lllp_distribution: JobId=4 implicit auto binding: sockets,one_thread, dist 8192
[2022-02-02T14:55:00.947] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic 
[2022-02-02T14:55:00.947] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [4]: mask_cpu,one_thread, 0x1
[2022-02-02T14:55:00.987] [4.0] done with job
```

Comment 4 Tim McMullan 2022-02-09 13:39:23 MST

Hi Francesco,

We've landed a patch that will be available starting in 21.08.6 that fixes this issue. (https://github.com/SchedMD/slurm/commit/d567b0c).

Please let us know if you have any other issues!  I'll resolve this ticket for now, but if you find that the problem persists please let us know!

Thanks!
--Tim