Ticket 3015 - cgroup devices white listing not enforced
Summary: cgroup devices white listing not enforced
Status: RESOLVED WONTFIX
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 17.02.x
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2016-08-23 16:52 MDT by foufou33
Modified: 2017-11-03 14:59 MDT (History)
0 users

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
cgroup.conf (243 bytes, text/x-matlab)
2016-08-23 16:52 MDT, foufou33
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description foufou33 2016-08-23 16:52:45 MDT
Created attachment 3430 [details]
cgroup.conf

Hi, 
I'm trying to setup a cluster of gpu nodes with slurm. I've hit a snag, I don't really know If I'm missing something.
The operating system is Debian jessie, I built the 16.05 dist from sources (using SId's sources package). Everything works just fine except device isolation with Cgroups.

On the surface it seams to be working, but nothing is enforced. I.E, it can easily bypassed by setting CUDA_VISIBLE_DEVICES

supp@bart4:~$ srun --gres=gpu:1 --pty bash 
supp@bart4:~$ ./cuda/NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/deviceQuery | grep "^Device"
Device 0: "GeForce GTX TITAN Black"
supp@bart4:~$ echo $CUDA_VISIBLE_DEVICES 
0
supp@bart4:~$ export  CUDA_VISIBLE_DEVICES=0,1
supp@bart4:~$ ./cuda/NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/deviceQuery | grep "^Device"
Device 0: "GeForce GTX TITAN Black"
Device 1: "GeForce GTX TITAN Black"


log show that slurmd is allowing access only to one GPU.

looking through the cgroup file system it seams that  devices.list always contains "a *:* rwm"  no matter the level.

going behind it's back and
doing the following the the steps cgroup device dir (/sys/fs/cgroup/devices/uid_XXX/job_XXX/step_XX/devices.*)
echo a >devices.deny
echo "c XXX:XXX rwm" > devices.allow (fore all devices I want to allow) works just fine and rewriting CUDA_VISIBLE_DEVICES doesn't grant access to the second GPU.


kernel is 3.16.x

I tried echoing "x XXX:XXX rwm" in devices.{allow| deny } on a  2.26 kernel, and it behaves differently, the minutes something is written to devices.deny the "a *:* rwm" entry disappears from devices.list (which is not the case on 3.16).

Am I doing something wrong ?

thanks.
Comment 1 Jacob Jenson 2017-11-03 14:59:03 MDT
Before this ticket can be sent to the support team we need to put a support contract in place for your site. Please let me know if you would like to discuss Slurm support.