Ticket 3015

Summary:	cgroup devices white listing not enforced
Product:	Slurm	Reporter:	foufou33
Component:	Other	Assignee:	Jacob Jenson <jacob>
Status:	RESOLVED WONTFIX	QA Contact:
Severity:	6 - No support contract
Priority:	---
Version:	17.02.x
Hardware:	Linux
OS:	Linux
Site:	-Other-	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	cgroup.conf

Description foufou33 2016-08-23 16:52:45 MDT

Created attachment 3430 [details]
cgroup.conf

Hi, 
I'm trying to setup a cluster of gpu nodes with slurm. I've hit a snag, I don't really know If I'm missing something.
The operating system is Debian jessie, I built the 16.05 dist from sources (using SId's sources package). Everything works just fine except device isolation with Cgroups.

On the surface it seams to be working, but nothing is enforced. I.E, it can easily bypassed by setting CUDA_VISIBLE_DEVICES

supp@bart4:~$ srun --gres=gpu:1 --pty bash 
supp@bart4:~$ ./cuda/NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/deviceQuery | grep "^Device"
Device 0: "GeForce GTX TITAN Black"
supp@bart4:~$ echo $CUDA_VISIBLE_DEVICES 
0
supp@bart4:~$ export  CUDA_VISIBLE_DEVICES=0,1
supp@bart4:~$ ./cuda/NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/deviceQuery | grep "^Device"
Device 0: "GeForce GTX TITAN Black"
Device 1: "GeForce GTX TITAN Black"


log show that slurmd is allowing access only to one GPU.

looking through the cgroup file system it seams that  devices.list always contains "a *:* rwm"  no matter the level.

going behind it's back and
doing the following the the steps cgroup device dir (/sys/fs/cgroup/devices/uid_XXX/job_XXX/step_XX/devices.*)
echo a >devices.deny
echo "c XXX:XXX rwm" > devices.allow (fore all devices I want to allow) works just fine and rewriting CUDA_VISIBLE_DEVICES doesn't grant access to the second GPU.


kernel is 3.16.x

I tried echoing "x XXX:XXX rwm" in devices.{allow| deny } on a  2.26 kernel, and it behaves differently, the minutes something is written to devices.deny the "a *:* rwm" entry disappears from devices.list (which is not the case on 3.16).

Am I doing something wrong ?

thanks.

Comment 1 Jacob Jenson 2017-11-03 14:59:03 MDT

Before this ticket can be sent to the support team we need to put a support contract in place for your site. Please let me know if you would like to discuss Slurm support.