Summary: | Cgroup enhancements | ||
---|---|---|---|
Product: | Slurm | Reporter: | Moe Jette <jette> |
Component: | Other | Assignee: | Danny Auble <da> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 5 - Enhancement | ||
Priority: | --- | CC: | da, kilian, ryan_cox |
Version: | 15.08.2 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | BYU - Brigham Young University | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 15.08.5 16.05.0-pre1 | Target Release: | 16.05 |
DevPrio: | 3 - High | Emory-Cloud Sites: | --- |
Attachments: |
pam_slurm_adopt_2.diff
pam_slurm_adopt_3.diff |
Description
Moe Jette
2013-10-25 11:32:24 MDT
The Slurm team at Bull has been planning to implement a PAM cgroups feature for some time. I'm not sure if this is consistent with your proposal above, but here is a description of what we plan to implement. We're not planning to do anything with sshrc. Our initial goal is to use cgroups with PAM to restrict compute node login access to users with Slurm resources (cpus) allocated on that node, and restrict login sessions to the set of cpus allocated to the user. This is similar to the functionality provided by the code in contribs/pam. Any new use of cgroups in Slurm needs to be compatible with the existing cgroups code in the proctrack, task and jobacct_gather plugins. The new feature will work as follows: There will be a new plugin, PAMPlugin=pam/cgroup. The plugin will be loaded by slurmd. The plugin API defines the following functions: pam_g_add_user_resources() pam_g_delete_user_resources() pam_g_add_user_resources() will be called when a job is allocated resources, on each node in the allocation, and will do the following: Create cpuset cgroups for the job and user (user cgroup may already exist). By default, the path will be /cgroup/cpuset/slurm/user_%uid/job_%jobid. Add the set of cpus allocated to the job on this node to cpuset.cpus in the user and job cgroups. pam_g_delete_user_resources() will be called when a job terminates, and will do the following: Delete the job cgroup and update cpuset.cpus in the user cgroup from the remaining jobs for this user, if any. If there are no remaining jobs, kill any tasks attached to the user cgroup (i.e. active logins for this user) and delete the user cgroup. The plugin will also contain an implementation of pam_sm_acct_mgmt(). This implementation will do the following: Get user from PAM handle If the user does not have a user cpuset cgroup on this node return denied status else attach PID to user cpuset cgroup return allowed status There will be code to build a new PAM slurm library (pam_slurm_cgroup.so) containing this implementation of pam_sm_acct_mgmt. Admins can then include this library in the PAM configuration file under /etc/pam.d for the appropriate services (e.g. sshd). Martin, Your plan does implement a subset of my proposal. However, I'm wondering if there is a reason why only the cpuset cgroup is planned? If you're creating a per-user cgroup, why not also restrict the memory? Where it differs is that it doesn't allow for accounting like normal. "Adopting" a process through a call from sshrc would allow for that since sshrc does have access to the SLURM_* variables if the ssh{,d}_config files allow for it. At that point you know what job the process belongs to. Contents of recent relevant emails: Andy Wettstein <wettstein@uchicago.edu> writes: [Hide Quoted Text] I use pam_exec module to blindly put an ssh session into the user's most recent task cgroup. This does seem to work fine for users that just log in to a node to where there job is running, but multinode jobs have sever limitations. The cgroup isn't created unless a slurm task has actually been started. For programs that launch with ssh, this doesn't work at all really. If multinode jobs worked correctly, then it would obviously be better to somehow detect the slurm job id and use that. A fully functional 'pam_slurm' module would do just that (it has to check for an active allocation on the node anyway), and would create the uid cgroup as needed if one does not already exist. This is how the old pam_slurm_cpuset[1] operated and eventually the cgroups code was supposed to work similarly. (You also have to make sure resources are added and subtracted from the UID cgroups as jobs are created and destroyed) [1] https://code.google.com/p/slurm-spank-plugins/wiki/CPUSET mark [Hide Quoted Text] Here is what I currently use for pam_exec: #!/bin/sh [ "$PAM_USER" = "root" ] && exit 0 [ "$PAM_TYPE" = "open_session" ] || exit 0 . /etc/sysconfig/slurm squeue=$BINDIR/squeue if [ ! -x $squeue ]; then exit 0 fi uidnumber=$(id -u $PAM_USER) host=$(hostname -s) # last job the user started is where these tasks will go jobid=$($squeue --noheader --format=%i --user=$PAM_USER --node=localhost | tail -1) [ -z "$jobid" ] && exit 0 for system in freezer cpuset; do cgdir=/cgroup/$system/slurm_$host # if the cgdir doesn't exist skip it [ -d $cgdir ] || continue # first job step is where we'll put these tasks cgtasks=$(find $cgdir/uid_$uidnumber/job_$jobid -mindepth 2 -type f -name tasks -print -quit) [ -f $cgtasks ] && echo $PPID > $cgtasks done exit 0 On Wed, Jan 01, 2014 at 09:43:55PM -0800, Christopher Samuel wrote: At SC13 a few of us were talking about the issue about what to do when you have to allow users to SSH into nodes where there jobs are running. Currently SLURM doesn't put SSH sessions permitted by the current pam_slurm module into any control groups, it would be nice if it would at least put them into the top level /cgroup/cpuset/slurm/uid_$UID for the user so they can only affect their own jobs. The problem is that With SSH's privilege separation the current SLURM PAM module cannot do this as it will run as an unprivileged process (not the user) prior to authentication and so cannot have the permissions to move processes around. However, it appears that something like a pam_slurm *session* library would (I believe) run as the user in question - as long as it it can learn the PID of the shell being spawned. The problem then is how to allow that to securely put itself into the users top level cgroup. If slurmd just made the tasks file for that top level cgroup owned by the user then the process could do it, though there would be a small risk that the user could then move processes from lower, insulated containers into the top level, though they'd only be affecting their own stuff. Thoughts? All the best, Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -- andy wettstein hpc system administrator research computing center university of chicago 773.702.1104 For related work, see bug 1593 http://bugs.schedmd.com/show_bug.cgi?id=1593 Created attachment 2330 [details] pam_slurm_adopt_2.diff Here is the updated code for pam_slurm_adopt.c that was submitted in bug 1593. Once tested and merged, this should resolve this bug report. I have tested it and it Works For Me(TM). The README shows all the options for the pam module. Since this will likely be the most debated topic, I will clarify some things about the "action_unknown" option. This is what happens when 1) the user has more than one job on the node and 2) the RPC call cannot identify what job the process belongs to. This is almost exclusively a problem when the user tries to directly connect from a login node to a compute node (on which that user has multiple jobs running). any* = Pick a job in a (somewhat) random fashion. The user can ssh in but may be adopted into a job that exits earlier than the job they intended to check on. The ssh connection will at least be subject to appropriate limits and the user can be informed of better ways to accomplish their objectives if this becomes a problem user = Use the /slurm/uid_$UID cgroups. Not all cgroups set appropriate limits at this level so this may not be very effective. Additionally, job accounting at this level is impossible as is automatic cleanup of stray processes when the job exits allow = Let the connection through without adoption deny = Deny the connection "any" seems to be the most reasonable default for now. It ensures that the user's process is limited by cgroups and can be cleaned up by Slurm. Its usage is also accounted for. It does have the downside of unpredictability, to some degree, but IMO it's no worse than denying the connection. From a user's perspective, they won't know if they have a single or multiple jobs on a node unless they explicitly check for it. If you deny the connection, they will "randomly" not be able to log into some nodes and there's nothing they can do about it. If "any" is set, they will be able to get in but may "randomly" get kicked out when the job exits. To me, it seems better to be randomly kicked off than randomly denied access. Again, they're almost certainly logging in directly from a login node to run something like top or strace, so it shouldn't be that long, hopefully. I did implement Matthieu's idea to adopt into the /slurm/uid_$UID cgroups (action_unknown=user) but that has the limitation that the memory limits aren't aggregated (unless we have something misconfigured here), so it's almost worse than nothing at the moment, IMO. I do like the idea of choosing the most recently started job or the job with the longest remaining time but that seems harder to implement since I don't think the slurmd has easy access to that information without querying the ctld, correct? If so, that seems pretty expensive but at least it should be a rare thing. Also, the extern step seems to stay around and any ssh-launched processes live on when a job exits. The extern cgroups do appear to be cleaned up after the processes exit. At some point I make also make an HTML page for documentation. (In reply to Ryan Cox from comment #5) > I do like the idea of choosing the most recently started job or the job with > the longest remaining time but that seems harder to implement since I don't > think the slurmd has easy access to that information without querying the > ctld, correct? If so, that seems pretty expensive but at least it should be > a rare thing. We don't have the expected job end times available in slurmd, but those would just be guesses anyway. We could stat() the cgroup directories to see what was most recently created, that seems like the best option using readily available information. In any case, it comes down to guesswork what will be "best". Created attachment 2336 [details]
pam_slurm_adopt_3.diff
That was a good idea. I replaced "any" with "newest". If for some reason "any" is desirable, it would be easy to add back. It compares the cgroup mtimes to pick the newest job.
One thing I have noticed is that processes in the step_extern cgroups are not cleaned up automatically. I do not know if this is intentional or not. I could easily add something to epilog to kill all tasks in step_extern/tasks, if needed, but it would be nice if it gets cleaned up automatically. Otherwise stray tasks can be left behind. Additionally, I'm not seeing any accounting for the extern step through sacct or in the mysql database. Bug 1593 comment 9 makes me think it is meant to be there. Should I file separate bugs for those? That aside, pam_slurm_adopt has handled our testing very well and we are now running it in production. We'll see if users find any bugs. I ended up filing separate bugs: bug 2096 and bug 2097. Also, pam_slurm_adopt has run in production for a week now with no known issues so far. We have seen a fair amount of ssh traffic in various scenarios that have exercised different parts of the code. Hey Ryan, I have committed this to the 15.08 branch. I haven't been able to test it as much as I wanted, but hope to do that soon. I'll check out the new bugs as well. Setting DevPrio and TargetRelease flags. I think this is fixed, please reopen of you feel otherwise. |