Created attachment 7195 [details] git patch against current git master branch Background Info =============== The devices cgroup implementation to which the task/cgroup plugin was written is now very old. E.g. as of CentOS 6.5 (kernel 2.6.32-431.el6) the modern implementation of the devices cgroup had been backported and replaced the "simple whitelist" implementation that had existed prior to that. Through the life of EL6, RedHat cited the devices cgroup as a technology preview. The older implementation in the kernel defaulted to having an "allow all" rule present by default, which would be inherited by child cgroups. Any entry written to devices.allow would be added to the whitelist (or augment a matching existing entry); any entry written to devices.deny would _remove a matching rule from the whitelist_. So the two files/behaviors would have been better called "devices.add" and "devices.remove" rather than allow versus deny. Thus, on e.g. RHEL kernels prior to 2.6.32-431.el6 the task/cgroup devices functionality would not have worked at all: to effect any change in behavior, the "allow all" rule would first have to be removed xcgroup_set_param(&job_devices_cg, "devices.deny", "a *:* rwm"); before rules written to devices.allow would actually limit access. Even with this change, GRES-based rules written to devices.deny would be NOOPs since a whitelist rule would not have existed in the cgroup. The modern implementation of the devices cgroup has adopted a "default disposition" augmented by a list of exceptions. The hierarchy defaults to a disposition of "allow all" with an empty exception list; writing rules to devices.allow has no effect, writing rules to devices.deny adds exceptions. On the other hand, writing "a *:* rwm" to devices.deny clears the exception list and the disposition flips to "deny all" and the effect of writing to devices.allow and devices.deny flips. The devices cgroup has always allowed a finer-grain control over _what_ permissions are allowed/denied: the "rwm" suffix shown above indicates [r]ead, [w]rite, or [m]knod permissions to be allowed or denied. To date, the task/cgroup plugin has assumed allow/deny of all three permissions. Extended Grammar ================ This patch augments the existing grammar of the cgroup_allowed_devices_file.conf. To date, the plugin: - blindly attempted a glob() on _every_ line in the file - from that file produced an in-memory list of strings to be written to devices.allow with a fixed limit of PATH_MAX in a statically-allocated pointer array of that size - explicitly augmented the device list on both the job cgroup _and_ the step cgroup (which is a child of the job cgroup and thus inherits its disposition and exception list) - added all devices from cgroup_allowed_devices_file.conf as "rwm" permissions The patch ignores blank lines and comment lines. Comment lines use the hash (#) character a'la Bash et al. The patch continues to accept lines containing paths/path patterns with implied "allow, rwm" disposition and permission. Lines containing paths/path patterns can be prefixed with a plus (+) or minus (-) to indicate disposition of the exception -- basically whether the rule is written to devices.allow or devices.deny, respectively. An optional permissions mask can follow the plus or minus using the characters [r]ead, [w]rite, or [m]knod. When used, whitespace should appear between the disposition/permissions specification and the path/path pattern. The special path "all" or "*" is used for "all devices" rules. For example: # # Comment lines are ignored, as are blank lines # # Change default behavior to deny: -rwm all # Add back devices needed by all jobs: +rwm /dev/null +rwm /dev/urandom +rwm /dev/zero +rwm /dev/sda* +rwm /dev/cpu/*/* +rwm /dev/pts/* # For nVidia UVM and query capabilities: +rwm /dev/nvidia-uvm* +rwm /dev/nvidiactl (The use of "rwm" is illustrative and could have been omitted. The prefix "+rwm" or "+" could also be omitted completely since they both equate to the default [compatibility] behavior of the plugin.) On both the older and modern devices cgroup implementation, the configuration above would whitelist that specific set of devices and -- when allocated to the job -- additional GRES devices would also be whitelisted; any GRES devices not allocated to the job would be unusable (denied). On an OS with the modern implementation of the devices cgroup (e.g. CentOS 6.5 and up, CentOS 7) since the default disposition is to allow, the cgroup_allowed_devices_file.conf file can remain empty; any GRES devices not allocated to the job would be written to devices.deny as exceptions to the "allow all" disposition and be unusable. Additional Changes ================== The patch removes the production of in-memory copies of the devices and whitelist rules. It instead applies rules to the job cgroup as they are parsed from the cgroup_allowed_devices_file.conf file. Since the step (child) cgroup is created after the job cgroup has been configured, it will inherit an appropriate baseline device access list anyway. The patch also includes more error checking as the plugin writes rules to devices.allow and devices.deny. Failure to fully augment the job cgroup's disposition and exception list results in plugin failure.
Updating metadata. This won't be in 18.08 but something of this nature may be usable in 19.05. I believe we did - although without necessarily documenting it correctly - change it so that this file is no longer required. There is further cleanup of these cgroup subsystems planned at some future date, and this is best considered in light of that later refactoring. - Tim
Hi Jeff, I know it's been a long time after your proposal for this enhancement. We are working at the moment with this part of the code and we were evaluating the convenience of having the cgroup_allowed_devices_file.conf. At the moment, it is doing nothing, because the default policy is to allow everything and we don't flip this policy, so adding "allow" exceptions when all devices are allowed is useless. I see your patch is interesting, and it would really work, but going further I am thinking about possible use cases, and I am not finding one where we want to have such file. Do you have any example where it would be useful? If the sysadmin wants to deny some devices by default, then there are other ways to do so and not rely on slurm. Denying devices only in the scheduler is possible from gres.conf, and if it is outside gres.conf then there's no point on always imposing the restrictions since it is an static one. At the moment I have no strong reason to keep this file, but before dropping this I would like your opinion since you did an interesting refactor of the logic behind this patch. Thanks!
Hi Jeff, Unfortunately we haven't seen any relevant use case that could get a benefit from creating such a logic for dealing with devices. We encourage people to use gres.conf for managing access to devices, and if a device must be denied at a system level, then use system facilities rather than the job scheduler. A deprecation process has started and the file will have no effect anymore starting from 22.05: commit ddad85fbed84428bb0257b67af863b2aa0744c2e Best regards.