Created attachment 9017 [details] Generic resource management configuration Hi, We have recently added 18 new GPU nodes (2 CPU sockets of 14 cores each, 4 Nvidia V100 cards) within our `iris` cluster and aligned both the configuration of the generic resource management and the general configuration. Ex: ``` $> grep -i GRES /etc/slurm/slurm.conf GresTypes=gpu NodeName=iris-[169-186] CPUs=28 Sockets=2 CoresPerSocket=14 ThreadsPerCore=1 RealMemory=772614 Feature=skylake,volta Gres=gpu:volta:4 State=UNKNOWN $> cat /etc/slurm/gres.conf # COMPUTE NODES WITH 4xVOLTA V100 32GB SXM2 NodeName=iris-[169-186] Name=gpu Type=volta File=/dev/nvidia[0-3] ``` Things are running fine except when attempting to join / connect to a running job involving GRES gpu reservations. For instance, assuming an interactive job running on a GPU node reserving at least one of the GPU cards (below example: 2), we can see that the number of allocated cards is indeed restricted: ``` (access) $> srun -p gpu -N 1 --ntasks-per-node 2 -c 14 --gres gpu:2 --pty bash -i (gpunode) (249438 1N/2T/28CN) $> echo $CUDA_VISIBLE_DEVICES 0,1 (gpunode) (249438 1N/2T/28CN) $> nvidia-smi Sun Jan 27 22:42:52 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.66 Driver Version: 410.66 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:1A:00.0 Off | 0 | | N/A 41C P0 44W / 300W | 0MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:1C:00.0 Off | 0 | | N/A 37C P0 45W / 300W | 0MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ ``` Now when trying to access from a separate terminal the reserved node using either `srun --jobid [...]` or a direct SSH, these reservations constraints are no longer effective or even usable. More specifically: * Direct SSH does not restrict the GPU access -- could it be a problem with slurm_pam_adopt ? ``` (access) $> squeue -u $USER JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 249438 gpu bash svarrett R 0:08 1:59:52 1 iris-183 (access) $> ssh iris-183 ( iris-183) $> nvidia-smi # show access to ALL 4 GPUs ???? Sun Jan 27 22:43:35 2019 -----------------------------------------------------------------------------+ | NVIDIA-SMI 410.66 Driver Version: 410.66 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:1A:00.0 Off | 0 | | N/A 41C P0 44W / 300W | 0MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:1C:00.0 Off | 0 | | N/A 37C P0 45W / 300W | 0MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:1D:00.0 Off | 0 | | N/A 34C P0 42W / 300W | 0MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:1E:00.0 Off | 0 | | N/A 38C P0 43W / 300W | 0MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ ``` * Access with `sjoin $JOBID` (see below definition, or `srun --jobid $JOBID --pty bash -i`) to initiate a job step under an already allocated job with job id $JOBID does not work at all (unlike for any other job reserved without a generic resource: ``` (access) $> srun -v --jobid 249438 --pty bash -i [...] srun: remote command : `bash -i' srun: jobid 249438: nodes(1):`iris-183', cpu counts: 28(x1) # No GRES mention ? srun: Job 249438 step creation temporarily disabled, retrying # message appear after some timeout srun: Job 249438 step creation still disabled, retrying srun: Job 249438 step creation still disabled, retrying srun: Job 249438 step creation still disabled, retrying srun: error: Unable to create step for job 249438: Job/step already completing or completed ``` Could it be linked to the absence of `/dev/nvidia*` within the `cgroup_allowed_devices_file.conf` (we have set `ConstrainDevices=yes` in `/etc/slurm/slurm.conf`) ? We have also noticed a potential inconsistencies that might be related to this problem: the stored allocated GRES (see `AllocGRES` below vs. `ReqGRES`) within the slurm database is not consistent -- it might relate to Bug #6366 – AllocGRES id recorded as 7696487? For instance for the above mentioned job: ``` $> sacct -j 249438 --format User,JobID,Jobname,partition,state,time,start,elapsed,ReqGRES,AllocGRES User JobID JobName Partition State Timelimit Start Elapsed ReqGRES AllocGRES --------- ------------ ---------- ---------- ---------- ---------- ------------------- ---------- ------------ ------------ svarrette 249438 bash gpu RUNNING 02:00:00 2019-01-27T22:41:58 00:07:07 gpu:2 7696487:2 249438.exte+ extern RUNNING 2019-01-27T22:41:58 00:07:07 gpu:2 7696487:2 249438.0 bash RUNNING 2019-01-27T22:41:58 00:07:07 gpu:2 7696487:2 ``` _Note_: we have the following definition for the `sjoin` utility mentioned above: ``` sjoin(){ if [[ -z $1 ]]; then echo "Job ID not given." else JOBID=$1 [[ -n $2 ]] && NODE="-w $2" srun --jobid $JOBID $NODE --pty bash -i fi } ```
Created attachment 9018 [details] cgroup support configuration file
Created attachment 9019 [details] cgroup allowed devices
Hi slurm_pam_adopt doesn't set CUDA_VISIBLE_DEVICES. In all cases, we recommend based on cgroups, relying only on environment variables is not safe. Could you send me content of: /sys/fs/cgroup/devices/slurm/uid_<uid>/job_<job_id>/devices.list /sys/fs/cgroup/devices/slurm/uid_<uid>/job_<job_id>/step_0/devices.list /sys/fs/cgroup/devices/slurm/uid_<uid>/job_<job_id>/step_extern/devices.list Could you check content of
Here is the content for the jobs with `--gres gpu:2`: ``` [root@iris-183 ~]# cat /sys/fs/cgroup/devices/slurm/uid_${uid}/job_${job_id}/devices.list a *:* rwm [root@iris-183 ~]# cat /sys/fs/cgroup/devices/slurm/uid_${uid}/job_${job_id}/step_0/devices.list a *:* rwm [root@iris-183 ~]# cat /sys/fs/cgroup/devices/slurm/uid_${uid}/job_${job_id}/step_extern/devices.list a *:* rwm ```
Hi Could you check in which cgroup is your process after using ssh to the node with an existing job? cat /proc/self/cgroup My tests of cgroup devices and pam slurm shows that both work fine, even if devices.list contains "a *:* rwm" (after this commit values inside this file can be incomplete https://git.sphere.ly/santhosh/kernel_cyanogen_msm8916/commit/ad676077a2ae4af4bb6627486ce19ccce04f1efe). Dominik
From within a separate SSH connection to the reserved node: ``` $> cat /proc/self/cgroup 22:hugetlb:/ 21:memory:/ 20:net_prio,net_cls:/ 19:freezer:/slurm/uid_5000/job_252130/step_extern 18:perf_event:/ 17:blkio:/ 16:pids:/user.slice 15:cpuacct,cpu:/ 14:devices:/ 13:cpuset:/slurm/uid_5000/job_252130/step_extern 1:name=systemd:/user.slice/user-5000.slice/session-23452.scope ``` For the precised ssh process: ``` # parent pid (third firld) extracted from the PID of the current shell (`$$`) $> cat /proc/$(ps --no-headers -fp $$ | awk '{print $3}')/cgroup 22:hugetlb:/ 21:memory:/ 20:net_prio,net_cls:/ 19:freezer:/slurm/uid_5000/job_252130/step_extern 18:perf_event:/ 17:blkio:/ 16:pids:/user.slice 15:cpuacct,cpu:/ 14:devices:/ 13:cpuset:/slurm/uid_5000/job_252130/step_extern 1:name=systemd:/user.slice/user-5000.slice/session-23452.scope ``` As a comparison, here is the content of /proc/self/cgroup from within the native job: ``` $> cat /proc/self/cgroup 22:hugetlb:/ 21:memory:/slurm/uid_5000/job_252130/step_0/task_0 20:net_prio,net_cls:/ 19:freezer:/slurm/uid_5000/job_252130/step_0 18:perf_event:/ 17:blkio:/ 16:pids:/system.slice/slurmd.service 15:cpuacct,cpu:/slurm/uid_5000/job_252130/step_0/task_0 14:devices:/slurm/uid_5000/job_252130/step_0 13:cpuset:/slurm/uid_5000/job_252130/step_0 1:name=systemd:/system.slice/slurmd.service ```
Hi Could you send me /etc/pam.d/sshd and all included files? Maybe the problem is somewhere in the pam_adopt config. Dominik
Sure. ``` $> pdsh -g gpu "rpm -qa | grep slurm-pam" | dshbak -c ---------------- iris-[169-186] ---------------- slurm-pam_slurm-17.11.12-1.el7.x86_64 ``` ``` $> pdsh -g gpu "rpm -ql slurm-pam_slurm" | dshbak -c ---------------- iris-[169-186] ---------------- /lib64/security/pam_slurm.so /lib64/security/pam_slurm_adopt.so ``` ``` $> pdsh -g gpu "cat /etc/pam.d/sshd" | dshbak -c ---------------- iris-[169-186] ---------------- account required pam_slurm_adopt.so action_adopt_failure=deny action_generic_failure=deny account sufficient pam_access.so #%PAM-1.0 auth required pam_sepermit.so auth substack password-auth auth include postlogin # Used with polkit to reauthorize users in remote sessions -auth optional pam_reauthorize.so prepare account required pam_nologin.so account include password-auth password include password-auth # pam_selinux.so close should be the first session rule session required pam_selinux.so close session required pam_loginuid.so # pam_selinux.so open should only be followed by sessions to be executed in the user context session required pam_selinux.so open env_params session required pam_namespace.so session optional pam_keyinit.so force revoke session include password-auth session include postlogin # Used with polkit to reauthorize users in remote sessions -session optional pam_reauthorize.so prepare # This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE # END AUTOGENERATED SECTION -- DO NOT REMOVE ``` Is there any file you want specifically ?
Hi /etc/pam.d/password-auth will be enough
``` $> pdsh -g gpu "cat /etc/pam.d/password-auth" | dshbak -c ---------------- iris-[169-186] ---------------- #%PAM-1.0 # This file is auto-generated. # User changes will be destroyed the next time authconfig is run. auth required pam_env.so auth sufficient pam_unix.so nullok try_first_pass auth requisite pam_succeed_if.so uid >= 1000 quiet_success auth sufficient pam_ldap.so use_first_pass auth required pam_deny.so account required pam_unix.so broken_shadow account sufficient pam_localuser.so account sufficient pam_succeed_if.so uid < 1000 quiet account [default=bad success=ok user_unknown=ignore] pam_ldap.so account required pam_permit.so password requisite pam_pwquality.so try_first_pass local_users_only retry=3 authtok_type= password sufficient pam_unix.so sha512 shadow nullok try_first_pass use_authtok password sufficient pam_ldap.so use_authtok password required pam_deny.so session optional pam_keyinit.so revoke session required pam_limits.so -session optional pam_systemd.so session [success=1 default=ignore] pam_succeed_if.so service in crond quiet use_uid session required pam_unix.so session optional pam_ldap.so ```
Hi Could you try stop systemd-logind and check if after that ssh process will be properly attached to cgroups? systemctl stop systemd-logind systemctl mask systemd-logind Dominik
SO assuming I did it in the correct order: 1. reservation to restrict to a single node to test $> scontrol create res Reservation=slurmbug6411 StartTime=2019-01-31T18:55:00 Duration=1:00:00 Flags=Maint,Ignore_Jobs PartitionName=gpu Accounts=ulhpc Nodes=iris-181 2. make an interactive job with `--gres gpu:2` as before (job id: 252576) $> srun --reservation slurmbug6411 -p gpu --qos qos-gpu -N 1 --ntasks-per-node 2 -c 14 --gres gpu:2 --pty bash -i 3. check separate ssh and proper cgroup $> ssh iris-181 $> cat /proc/$(ps --no-headers -fp $$ | awk '{print $3}')/cgroup 22:hugetlb:/ 21:memory:/ 20:net_prio,net_cls:/ 19:freezer:/slurm/uid_5000/job_252576/step_extern 18:perf_event:/ 17:blkio:/ 16:pids:/user.slice 15:cpuacct,cpu:/ 14:devices:/ 13:cpuset:/slurm/uid_5000/job_252576/step_extern 1:name=systemd:/user.slice/user-5000.slice/session-24400.scope 4. logout from the separate SSH 5. login as root to the node and stop `systemd-logind` as indicated $> ssh iris-181 $> systemctl stop systemd-logind.service $> systemctl mask systemd-logind.service Created symlink from /etc/systemd/system/systemd-logind.service to /dev/null. 6 login again as regular user from the frontend with a separate ssh to the reserve node and check and proper cgroup. Step 6 is depicted below and seems ok: ``` $> ssh iris-181 ~> cat /proc/$(ps --no-headers -fp $$ | awk '{print $3}')/cgroup 22:hugetlb:/ 21:memory:/slurm/uid_5000/job_252576/step_extern/task_0 20:net_prio,net_cls:/ 19:freezer:/slurm/uid_5000/job_252576/step_extern 18:perf_event:/ 17:blkio:/ 16:pids:/system.slice/sshd.service 15:cpuacct,cpu:/slurm/uid_5000/job_252576/step_extern/task_0 14:devices:/slurm/uid_5000/job_252576/step_extern 13:cpuset:/slurm/uid_5000/job_252576/step_extern 1:name=systemd:/system.slice/sshd.service ~> systemctl status systemd-logind.service ● systemd-logind.service Loaded: masked (/dev/null; bad) Active: inactive (dead) since Thu 2019-01-31 18:57:18 CET; 6min ago Main PID: 41010 (code=killed, signal=TERM) Status: "Processing requests..." ``` The same happens if (while leaving the `systemd-logind` service masked) I repeat the procedure from step 2 after having killed the job 252576
Hi You can also comment this line in /etc/pam.d/password-auth: -session optional pam_systemd.so I assume that after this nvidia-smi should work fine in ssh session. I will check if it is possible to add setting of CUDA_VISIBLE_DEVICES environment variable to slurm adapt plugin. But as I mentioned before, relying only on environment variables is not safe. Dominik
Dear Dominik, We can confirm it seems to solve the separate SSH case (the nvidia-smi command is restricted to the cards sharing the same busID as in the initial reservation). Note that the sjoin case remains unsolved. However this fix does look like having a potential huge side-effect from the description of the service: https://www.freedesktop.org/software/systemd/man/systemd-logind.service.html Are you sure this is recommended ? How are the other centers deal with this issue ?
Hi Yes, this is recommended and required for properly working of pam_slurm_adopt. Check documentation: https://slurm.schedmd.com/pam_slurm_adopt.html Some functions of logind are handled by slurm, other like polkit are not commonly used on clusters. Dominik
Indeed, thanks for the reference, looks like we missed this guidelines. We will deploy it across the cluster. It just leaves the sjoin issue.
Hi By default, a job step tries to allocate all of the generic resources that have been allocated to the job, because you have already one step which keeps all gres. The next step waits for resources. You can add "--gres=gpu:0" or "--gres=none". then this step should start, but you won't have access to GPUs. If you describe your needs better, maybe I will be able to give you some solution. Dominik
Traditionally on our site (and hopefully on others), people use `sjoin $jobid` / `srun --jobid [...]` to be able to 1. connect to the job while setting the correct SLURM_* variable as done in the job (which not set in the regular ssh) 2. monitor (with `htop`, `nvidia-smi`, `nvtop` etc.) the running job and eventually perform complementary tests based on the SLURM_* variables). 3. (in a very few case) run an additional job within the existing job allocation It is quite important to keep that workflow also for GPU nodes.
Hi, I think there's a side effect when disabling pam_systemd.so, it creates the directory $XDG_RUNTIME_DIR (/run/user/$UID). After disabling pam_systemd.so, the directory is not created anymore, but the variable is still set (I think it's passed from the cluster frontend server), which cause some user applications to fail (unless XDG_RUNTIME_DIR is unset explicitly). Have you also noticed this issue ? Do you have a proper solution ? Thank you
Hi You can unset XDG_* using task prolog (https://slurm.schedmd.com/prolog_epilog.html). Answering the previous questions: Currently, there is no way to overallocation gres, that means you can't create (eg.: sjoin) step with access to GPUs which is bind to another step. To keep sjoin working with gres job you can add "--gres=none" Currently, I am working on adding some slurm envs to ssh session but I think this will be available in 19.05 Dominik
Hi I want to inform you that we handle XDG_* problem in a separate bug. In bug 5920 we consider splitting pam_adopt module into two contexts. This will allow us to set right cgroups after pam_systemd and avoids overwriting cgroup. Dominik
*** Ticket 6538 has been marked as a duplicate of this ticket. ***
Hi I will change this bug to enhancement. We are working on adding some environment to ssh session but this will not be done before 19.05 Dominik
Any update for the planned 19.06 release ?
Hi Sorry I didn't inform you earlier. This commit injects the SLURM_JOB_ID environment variable into adopted processes. https://github.com/SchedMD/slurm/commit/65fb9dfa10a8763d7 This patch is included in all 19.05 and 20.02 releases. Currently, we don't plan to add more environment variables to pam. But SLURM_JOB_ID should be sufficient for handling properly srun after ssh to a node. Dominik
Several users wish to access at least the nvidia-smi utility when joining their running job. Any further way to authorize it?
This email address is no longer active. Please use the email address valentin.plugaru@gmail.com for future correspondence. Your message is not forwarded automatically. The IT service at the University of Luxembourg
More specifically, ssh grant access and view to all GPU cards. Any way to limit it?
Hi Currently, we have no plan to change the way we handle ssh connections. pam_slurm_adopt already bind ssh connection to external steps cgroups. If you use "ConstrainDevices=yes" this should limit access to GPUs. Dominik
Now we have the configure the high-priority qos with 500 priority, and slurm.conf by following conf: SchedulerType=sched/backfill UnkillableStepTimeout=180 PriorityType=priority/basic PreemptType=preempt/qos PreemptMode=SUSPEND,GANG SchedulerTimeSlice=180 and qos details: Name Preempt PreemptMode Priority ---------- ---------- ----------- ---------- normal gang,suspe+ 100 datascien+ cluster 100 highprior+ normal gang,suspe+ 500 Current Issue: When submitting a job from the normal QoS and then a job from the highpriority QoS, it results in time-slicing between the two jobs. However, we need the highpriority job to run to completion first, and only then should the normal job resume. This configuration currently only works for CPU-based jobs. Requirement: How can we extend this configuration to include GPU-based jobs, ensuring the same behavior (high-priority jobs preempt normal jobs and resume them after completion)?