| Summary: | '--gres gpu:<N>' is not handled upon separate SSH Session | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Prabhjyot Saluja <prabhjyot_saluja> |
| Component: | GPU | Assignee: | Oriol Vilarrubi <jvilarru> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | felip.moll |
| Version: | 20.02.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Brown Univ | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
/etc/pam.d/password-auth
cgroup.conf /etc/pam.d/sshd slurm.conf |
||
|
Description
Prabhjyot Saluja
2022-06-14 07:13:54 MDT
Hi!, I think you forgot to add the sshd file. I will going to ask also slurm.conf, cgroup.conf and the result of: cat /proc/self/cgroup after you have logged in with ssh into the node with a job. Created attachment 25508 [details]
cgroup.conf
Created attachment 25509 [details]
/etc/pam.d/sshd
Created attachment 25510 [details]
slurm.conf
Hi, Here is the output (/proc/self/cgroup): From interact allocation: [ccvdemo@gpu717 ~]$ cat /proc/self/cgroup 11:cpuset:/slurm/uid_140447539/job_5367608/step_0 10:perf_event:/ 9:memory:/slurm/uid_140447539/job_5367608/step_0 8:blkio:/system.slice/slurmd.service 7:net_prio,net_cls:/ 6:pids:/system.slice/slurmd.service 5:hugetlb:/ 4:cpuacct,cpu:/system.slice/slurmd.service 3:devices:/slurm/uid_140447539/job_5367608/step_0 2:freezer:/slurm/uid_140447539/job_5367608/step_0 1:name=systemd:/system.slice/slurmd.service From SSH session: cat /proc/self/cgroup 11:cpuset:/slurm/uid_140447539/job_5367608/step_extern 10:perf_event:/ 9:memory:/user.slice 8:blkio:/user.slice 7:net_prio,net_cls:/ 6:pids:/user.slice 5:hugetlb:/ 4:cpuacct,cpu:/user.slice 3:devices:/user.slice 2:freezer:/slurm/uid_140447539/job_5367608/step_extern 1:name=systemd:/user.slice/user-140447539.slice/session-15153.scope Hello Prabhjyot, I see that in your password-auth file you have the following line: -session optional pam_systemd.so This will load the pam_systemd.so file, which will "steal" the processes from slurm cgroup into systemd ones, those cgroups do not have devices limitation, thus will not enforce the GPU restriction. What is needed is to completely remove (or comment) all the systemd lines in the pam files. I guess that this was your intention at inclusing the - character in front of it, but what this does it to not make pam fail if that module is not found. Greetings. Thank you so much! Exactly with '-' that was the intention but didn't realize it needs to be #commented out. That did the trick. Appreciate your help. Regards, Singh Hello, I'm happy to help, closing this bug as infogiven. Do not hesitate to contact us if you encounter more issues. Regards. |