Created attachment 8672 [details] slurm.conf Pretty much the same as bug #3015. We just upgraded to 17.11.8 (bright 8.1) and CentOS 7.6 (3.10.0-957.1.3.el7) and turned on ConstrainDevices. (We were already using CPU and memory cgroups fine.) I see that it is correctly parsing the default device list and gres devices and adding/removing them: ... [2018-12-17T14:31:31.521] [167647.extern] debug2: Default access allowed to device c 202:0 rwm for job [2018-12-17T14:31:31.521] [167647.extern] debug3: xcgroup_set_param: parameter 'devices.allow' set to 'c 202:0 rwm' for '/sys/fs/cgroup/devices/slurm/uid_1135/job_167647' [2018-12-17T14:31:31.521] [167647.extern] debug2: Default access allowed to device c 5:2 rwm for job [2018-12-17T14:31:31.521] [167647.extern] debug3: xcgroup_set_param: parameter 'devices.allow' set to 'c 5:2 rwm' for '/sys/fs/cgroup/devices/slurm/uid_1135/job_167647' [2018-12-17T14:31:31.521] [167647.extern] debug: Allowing access to device c 195:0 rwm(/dev/nvidia0) for job [2018-12-17T14:31:31.521] [167647.extern] debug3: xcgroup_set_param: parameter 'devices.allow' set to 'c 195:0 rwm' for '/sys/fs/cgroup/devices/slurm/uid_1135/job_167647' [2018-12-17T14:31:31.521] [167647.extern] debug: Not allowing access to device c 195:1 rwm(/dev/nvidia1) for job [2018-12-17T14:31:31.521] [167647.extern] debug3: xcgroup_set_param: parameter 'devices.deny' set to 'c 195:1 rwm' for '/sys/fs/cgroup/devices/slurm/uid_1135/job_167647' However, devices.list still shows "a *:* rwm", which is the setting it inherts from the root devices cgroup, so all devices are still whitelisted. Manually adding "a" to devices.deny to clear the whitelist seems to work. Is there somewhere this default deny-all setting needs to be put either in slurm or the system (systemd?) to make this work? Does this need bug #5361 to be fixed first? I feel like I'm just missing some obvious setting.
Created attachment 8673 [details] cgroup.conf
On further testing, the gres deny rules for unallocated /dev/nvidia devices may be working, but the default whitelist is definitely not -- unlisted devices can still be used.
At one point in time, the way constraining devices in cgroups works changed. You'll find some helpful background here: https://bugs.schedmd.com/show_bug.cgi?id=5361https://www.kernel.org/doc/Documentation/cgroup-v1/devices.txt The second paragraph: "The root device cgroup starts with rwm to 'all'. A child device cgroup gets a copy of the parent. Administrators can then remove devices from the whitelist or add new entries. A child cgroup can never receive a device access which is denied by its parent." The last paragraph: "device cgroups is implemented internally using a behavior (ALLOW, DENY) and a list of exceptions. The internal state is controlled using the same user interface to preserve compatibility with the previous whitelist-only implementation. Removal or addition of exceptions that will reduce the access to devices will be propagated down the hierarchy. For every propagated exception, the effective rules will be re-evaluated based on current parent's access rules." It sounds like there was a previous behavior ("previous whitelist-only implementation"). But with the new behavior, ***the cgroup_allowed_devices file doesn't do anything anymore.*** We haven't yet included the contribution in bug 5361. There is additional work that needs to be done to cleanup the task/cgroup plugin. Currently the plugin whitelists everything in the cgroup_allowed_devices file (which isn't needed, since everything is already whitelisted). Then it whitelists any GRES the job has in its allocation (also not needed), and blacklists every GRES the job does not have in its allocation (this is how devices are constrained). So, unless you specify a device in your gres.conf, it will be available to be used. Does that answer your question?
That makes sense and sounds like what we're seeing, yes -- only explicit gres devices are removed from the whitelist. We can explicitly set the whitelist in some parent cgroup, or this may be improved in some future release. Thanks.
You're welcome. Closing