| Summary: | cgroup devices white listing not enforced | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | foufou33 |
| Component: | Other | Assignee: | Jacob Jenson <jacob> |
| Status: | RESOLVED WONTFIX | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | ||
| Version: | 17.02.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | cgroup.conf | ||
Before this ticket can be sent to the support team we need to put a support contract in place for your site. Please let me know if you would like to discuss Slurm support. |
Created attachment 3430 [details] cgroup.conf Hi, I'm trying to setup a cluster of gpu nodes with slurm. I've hit a snag, I don't really know If I'm missing something. The operating system is Debian jessie, I built the 16.05 dist from sources (using SId's sources package). Everything works just fine except device isolation with Cgroups. On the surface it seams to be working, but nothing is enforced. I.E, it can easily bypassed by setting CUDA_VISIBLE_DEVICES supp@bart4:~$ srun --gres=gpu:1 --pty bash supp@bart4:~$ ./cuda/NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/deviceQuery | grep "^Device" Device 0: "GeForce GTX TITAN Black" supp@bart4:~$ echo $CUDA_VISIBLE_DEVICES 0 supp@bart4:~$ export CUDA_VISIBLE_DEVICES=0,1 supp@bart4:~$ ./cuda/NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/deviceQuery | grep "^Device" Device 0: "GeForce GTX TITAN Black" Device 1: "GeForce GTX TITAN Black" log show that slurmd is allowing access only to one GPU. looking through the cgroup file system it seams that devices.list always contains "a *:* rwm" no matter the level. going behind it's back and doing the following the the steps cgroup device dir (/sys/fs/cgroup/devices/uid_XXX/job_XXX/step_XX/devices.*) echo a >devices.deny echo "c XXX:XXX rwm" > devices.allow (fore all devices I want to allow) works just fine and rewriting CUDA_VISIBLE_DEVICES doesn't grant access to the second GPU. kernel is 3.16.x I tried echoing "x XXX:XXX rwm" in devices.{allow| deny } on a 2.26 kernel, and it behaves differently, the minutes something is written to devices.deny the "a *:* rwm" entry disappears from devices.list (which is not the case on 3.16). Am I doing something wrong ? thanks.