Created attachment 25496 [details] /etc/pam.d/password-auth Hi, We are having an issue where a user can SSH into a node with active job and can access all GPUs on the node, i.e. 'nvidia-smi' returns all GPUs on the node. While the CPU fencing works correctly, i.e. I see same number of allocated cores upon ssh via (nproc) and (numactl -show) returns correct (physcpubind). I looked at the bug6411 but couldn't figure it out, so reaching out. Step 1: Start an interactive session requesting 1 GPU salloc -J interact -N 1-1 -n 4 --time=30:00 --gres=gpu:1 --mem=20g -p gpu -C ampere srun --pty bash Step 2: [ccvdemo@gpu2108 ~]$ nvidia-smi -L GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-582f326c-9927-e90a-7e87-e33dcbec2fc9) If I ssh into the node, then [ccvdemo@login006 ~]$ ssh gpu2108 [ccvdemo@gpu2108 ~]$ nvidia-smi -L GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-6da44192-c454-700f-279f-2b1a7a94f302) GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-a38a23eb-5d64-1960-771b-6a07d2b0e706) GPU 2: NVIDIA GeForce RTX 3090 (UUID: GPU-f39da23f-48a4-c9bf-31b0-d101a7f45adb) GPU 3: NVIDIA GeForce RTX 3090 (UUID: GPU-8cfe039b-9b9d-8304-1d9a-3e15f3545f8c) GPU 4: NVIDIA GeForce RTX 3090 (UUID: GPU-582f326c-9927-e90a-7e87-e33dcbec2fc9) GPU 5: NVIDIA GeForce RTX 3090 (UUID: GPU-f2c42d9c-9773-c022-582d-94765eaebcf7) GPU 6: NVIDIA GeForce RTX 3090 (UUID: GPU-914d1d97-3563-4af0-d234-508e8c584781) GPU 7: NVIDIA GeForce RTX 3090 (UUID: GPU-0f190ee8-0829-9871-5568-26c494fdc378) Attachments: /etc/pam.d/sshd /etc/pam.d/password-auth Please let me know if you need any details. Thank you very much!
Hi!, I think you forgot to add the sshd file. I will going to ask also slurm.conf, cgroup.conf and the result of: cat /proc/self/cgroup after you have logged in with ssh into the node with a job.
Created attachment 25508 [details] cgroup.conf
Created attachment 25509 [details] /etc/pam.d/sshd
Created attachment 25510 [details] slurm.conf
Hi, Here is the output (/proc/self/cgroup): From interact allocation: [ccvdemo@gpu717 ~]$ cat /proc/self/cgroup 11:cpuset:/slurm/uid_140447539/job_5367608/step_0 10:perf_event:/ 9:memory:/slurm/uid_140447539/job_5367608/step_0 8:blkio:/system.slice/slurmd.service 7:net_prio,net_cls:/ 6:pids:/system.slice/slurmd.service 5:hugetlb:/ 4:cpuacct,cpu:/system.slice/slurmd.service 3:devices:/slurm/uid_140447539/job_5367608/step_0 2:freezer:/slurm/uid_140447539/job_5367608/step_0 1:name=systemd:/system.slice/slurmd.service From SSH session: cat /proc/self/cgroup 11:cpuset:/slurm/uid_140447539/job_5367608/step_extern 10:perf_event:/ 9:memory:/user.slice 8:blkio:/user.slice 7:net_prio,net_cls:/ 6:pids:/user.slice 5:hugetlb:/ 4:cpuacct,cpu:/user.slice 3:devices:/user.slice 2:freezer:/slurm/uid_140447539/job_5367608/step_extern 1:name=systemd:/user.slice/user-140447539.slice/session-15153.scope
Hello Prabhjyot, I see that in your password-auth file you have the following line: -session optional pam_systemd.so This will load the pam_systemd.so file, which will "steal" the processes from slurm cgroup into systemd ones, those cgroups do not have devices limitation, thus will not enforce the GPU restriction. What is needed is to completely remove (or comment) all the systemd lines in the pam files. I guess that this was your intention at inclusing the - character in front of it, but what this does it to not make pam fail if that module is not found. Greetings.
Thank you so much! Exactly with '-' that was the intention but didn't realize it needs to be #commented out. That did the trick. Appreciate your help. Regards, Singh
Hello, I'm happy to help, closing this bug as infogiven. Do not hesitate to contact us if you encounter more issues. Regards.