Sorry for the barrage of submissions today, but as you probably guessed, we've upgraded to 17.11. And transition is not as smooth as we expected, unfortunately. :\ It looks like ContraintDevices doesn't work anymore in 17.11. Using the exact same (functional) config as we used in 17.02, we can't seem to make device access restrictions work. For instance, when requesting a single GPU, I can see all the GPUs on the node: $ srun --pty --gres gpu:1 -w sh-112-07 bash [kilian@sh-112-07 ~]$ nvidia-smi -L GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-27db8534-9b2b-8b1a-5889-9c77c0c7be4e) GPU 1: Tesla P100-PCIE-16GB (UUID: GPU-b84691c7-e6e4-33c1-f367-1831e42cf4c6) GPU 2: Tesla P100-PCIE-16GB (UUID: GPU-0d9a859c-ce19-78f3-2f87-aade11d14bae) GPU 3: Tesla P100-PCIE-16GB (UUID: GPU-59405c9d-6554-d857-1cbd-8b9487464684) The devices cgroups seem to have been created properly: [kilian@sh-112-07 ~]$ echo $SLURM_JOBID 4542325 [kilian@sh-112-07 ~]$ lscgroup | grep devices devices:/ devices:/slurm devices:/slurm/uid_215845/job_4542325 devices:/slurm/uid_215845/job_4542325/step_0 devices:/slurm/uid_215845/job_4542325/step_extern [kilian@sh-112-07 ~]$ tree /sys/fs/cgroup/devices/slurm/uid_215845/job_4542065/ /sys/fs/cgroup/devices/slurm/uid_215845/job_4542065/ ├── cgroup.clone_children ├── cgroup.event_control ├── cgroup.procs ├── devices.allow ├── devices.deny ├── devices.list ├── notify_on_release ├── step_extern │ ├── cgroup.clone_children │ ├── cgroup.event_control │ ├── cgroup.procs │ ├── devices.allow │ ├── devices.deny │ ├── devices.list │ ├── notify_on_release │ └── tasks └── tasks The GPUs are correctly detected on the node: slurmd[109644]: Gres Name=gpu Type=(null) Count=2 slurmd[109644]: Gres Name=gpu Type=(null) Count=2 slurmd[109644]: gpu device number 0(/dev/nvidia0):c 195:0 rwm slurmd[109644]: gpu device number 1(/dev/nvidia1):c 195:1 rwm slurmd[109644]: gpu device number 2(/dev/nvidia2):c 195:2 rwm slurmd[109644]: gpu device number 3(/dev/nvidia3):c 195:3 rwm The cgroup config is loaded, including the devices part: slurmd[109644]: debug: Reading cgroup.conf file /etc/slurm/cgroup.conf slurmd[109644]: debug: CPUs:20 Boards:1 Sockets:2 CoresPerSocket:10 ThreadsPerCore:1 slurmd[109644]: debug: Reading cgroup.conf file /etc/slurm/cgroup.conf slurmd[109644]: debug: task/cgroup: now constraining jobs allocated cores slurmd[109644]: debug: task/cgroup/memory: total:515705M allowed:100%(enforced), swap:0%(enforced), max:100%(515705M) max+swap:100%(1031410M) min:30M kmem:100%(515705M permissive) min:30M swappiness:18446744073709551614(set) slurmd[109644]: debug: task/cgroup: now constraining jobs allocated memory slurmd[109644]: debug: task/cgroup: now constraining jobs allocated devices slurmd[109644]: debug: task/cgroup: loaded Yet, there's no mention of GRES devices when the job start: slurmd[109653]: debug: Checking credential with 448 bytes of sig data slurmd[109653]: debug: Reading cgroup.conf file /etc/slurm/cgroup.conf slurmd[109653]: debug: Calling /usr/sbin/slurmstepd spank prolog slurmd[109653]: debug: [job 4542325] attempting to run prolog [/etc/slurm/scripts/prolog.sh] slurmd[109653]: _run_prolog: run job script took usec=38237 slurmd[109653]: _run_prolog: prolog with lock for job 4542325 ran for 0 seconds slurmstepd[109874]: task/cgroup: /slurm/uid_215845/job_4542325: alloc=12800MB mem.limit=12800MB memsw.limit=12800MB slurmstepd[109874]: task/cgroup: /slurm/uid_215845/job_4542325/step_extern: alloc=12800MB mem.limit=12800MB memsw.limit=12800MB slurmd[109653]: launch task 4542325.0 request from 215845.32264@10.10.0.61 (port 54987) slurmd[109653]: debug: Checking credential with 492 bytes of sig data slurmd[109653]: debug: Waiting for job 4542325's prolog to complete slurmd[109653]: debug: Finished wait for job 4542325's prolog to complete slurmstepd[109881]: task/cgroup: /slurm/uid_215845/job_4542325: alloc=12800MB mem.limit=12800MB memsw.limit=12800MB slurmstepd[109881]: task/cgroup: /slurm/uid_215845/job_4542325/step_0: alloc=12800MB mem.limit=12800MB memsw.limit=12800MB slurmstepd[109881]: in _window_manager In previous versions, there was a part that said: slurmstepd[6248]: Allowing access to device c 195:0 rwm slurmstepd[6248]: Allowing access to device c 195:1 rwm slurmstepd[6248]: Not allowing access to device c 195:2 rwm slurmstepd[6248]: Not allowing access to device c 195:3 rwm which doesn't appear in the logs anymore. Is there any additional configuration required in 17.11 to use the devices cgroup constraints? Thanks! -- Kilian
Hey Killian, This was found and fixed in Bug 4455. Specifically in commits: ee68721350dc46d62bebc64e86378b06fd95f4a5 0ed03cda5bcf4e0bd5ef8117d4d5ce7fa84a71e3 434acb17c8526bc209626084587303cd5c5b79fa I'm going to mark this closed as a duplicate of Bug 4455. If these patches don't fix it for you, please reopen the bug. Thanks, Brian *** This ticket has been marked as a duplicate of ticket 4455 ***
Hi Brian, Thanks for pointing this out, I'll give a try at the patches. Cheers, -- Kilian
I confirm that the mentioned commits indeed fix the issue. Thanks! -- Kilian