I am turning this email into a trouble ticket before it gets accidentally deleted. Here is some information about how we do cgroups on our login nodes and my thoughts on how to set up cgroups better in SLURM. This is on our login nodes: root@m7int02:~# cat /etc/ssh/sshrc # must do this for X forwarding purposes # reads from stdin so be careful what you place before it # see sshd manpage for details under SSHRC if read proto cookie && [ -n "$DISPLAY" ]; then if [ `echo $DISPLAY | cut -c1-10` = 'localhost:' ]; then # X11UseLocalhost=yes echo add unix:`echo $DISPLAY | cut -c11-` $proto $cookie else # X11UseLocalhost=no echo add $DISPLAY $proto $cookie fi | xauth -q - fi /usr/local/sbin/interactive_cgroups_assign_process $PPID root@m7int02:~# cat /usr/local/sbin/interactive_cgroups_assign_process #!/bin/bash pid=$1 username=$(/usr/bin/stat -c %U /proc/$pid) (("$UID"==0)) || (/bin/echo $pid > /cgroup/cpu/users/user_$username/tasks) >/dev/null 2>&1 (("$UID"==0)) || (/bin/echo $pid > /cgroup/memory/users/user_$username/tasks) >/dev/null 2>&1 root@m7int02:~# ls -l /cgroup/memory/users/user_ryancox/tasks --w-rw---- 1 ryancox root 0 Aug 13 16:54 /cgroup/memory/users/user_ryancox/tasks root@m7int02:~# grep SLURM /etc/ssh/*config /etc/ssh/ssh_config: SendEnv SLURM_* /etc/ssh/sshd_config:AcceptEnvSLURM_* Basically, we have a script that creates cgroups for all users in /etc/passwd every few minutes. It sets their "tasks" file to be owned by the user and write-only by the user. It also sets a memory and cpu.shares limit for the user. Users can't pull in processes from other users due to kernel restrictions, though they could reassign their own processes to a different cgroup that they have write access to. That's not really a huge concern for us on our login nodes since there is only one cgroup per subsystem that they have access to at all. /usr/local/sbin/interactive_cgroups_assign_process was designed so we can also cron something to go catch anything that didn't get assigned to a cgroup for whatever reason, so it has extra features like looking up the owner of the process rather than relying on $USER. What I would like to see in SLURM is the following: A PAM module creates the appropriate cgroup for the user exactly as it does now when launched under slurmd or slurmstepd. The pam module would need to ask SLURM what the user has been allocated on that node and create the appropriate cgroup, just like it would through slurmstepd. I'm unsure if PAM has access to environment variables passed via ssh, but my testing shows that it does not. If it does not have access to it, simply assign the ssh-launched process to slurm/uid_$uid. I think this is your current plan. The slurm/uid_$uid cgroups should have the aggregated job allocation limits per user (memory, cpus, etc) if they have multiple jobs on the same node. So job1 (1GB memory) + job2 (2GB memory) results in slurm/uid_$uid/memory.limit_in_bytes of 3G. When job1 exits, reduce it to 2GB. I would like the pam module and the slurm code to create the cgroups then run the following commands per tasks file (uid, job, step, task): chown $USER $tasksfile chmod 200 $tasksfile # or something else, as long as the user can write If you do this, you can use an sshrc file to assign the task to the appropriate cgroup (job, step, or task) as long as AcceptEnv and SendEnv are configured correctly, just like the examples above. This would work in the following scenarios as follows: 1) Job is launched via sbatch or salloc. From that script or shell, the user/code uses ssh to connect to node2 in the list. $SLURM_* variables are sent due to ssh SendEnv. PAM on node2 sees that the user has a job allocated on the node and creates slurm/uid_$uid/job_$job (and maybe step and task information?). PAM assigns the ssh process to the slurm/uid_$uid cgroups since it doesn't have access to the $SLURM_* variables. After the PAM stack is done, ssh calls sshrc. sshrc reassigns the user's processes (starting at $PPID) from the slurm/uid_$uid cgroups to the appropriate slurm/uid_$uid/job_$job cgroups based on $SLURM_JOB_ID. 2) Job is launched via sbatch or salloc. From a different shell completely, the user connects directly to node2 via ssh. Since $SLURM_JOB_ID, etc. were not set, sshrc is unable to assign the process to the correct slurm/uid_$uid/job_$job. However, PAM still assigns the process to slurm/uid_$uid. Even though we lose the per-job accounting, the user is still subject to aggregate job allocation limits on that node. 3) Job is launched via sbatch or salloc. sshrc is not set up and/or AcceptEnv and SendEnv are not set. Since $SLURM_JOB_ID, etc. were not set, sshrc is unable to assign the process to the correct slurm/uid_$uid/job_$job. However, PAM still assigns the process to slurm/uid_$uid. Even though we lose the per-job accounting, the user is still subject to aggregate job allocation limits on that node. 4) Job is launched via sbatch or salloc. The user maliciously sets $SLURM_JOB_ID to an incorrect value either in that shell or from a different host entirely, then uses ssh to connect to node2. In this case, PAM still assigns that process to slurm/uid_$uid. When sshrc runs, it tries to move that process to an incorrect slurm/uid_$uid/job_$job. It will fail since the targeted slurm/uid_$uid/job_$job won't allow the user to write there since sshrc runs as the user. We're still trying to determine whether ssh should pass anything more than $SLURM_JOB_ID or if it should send $SLURM_*. We are actually about to use the cgroup release mechanism to gather stats from the memory and cpuacct cgroups so we can store per job per node data, but that's a different conversation :) Hopefully this makes sense.
The Slurm team at Bull has been planning to implement a PAM cgroups feature for some time. I'm not sure if this is consistent with your proposal above, but here is a description of what we plan to implement. We're not planning to do anything with sshrc. Our initial goal is to use cgroups with PAM to restrict compute node login access to users with Slurm resources (cpus) allocated on that node, and restrict login sessions to the set of cpus allocated to the user. This is similar to the functionality provided by the code in contribs/pam. Any new use of cgroups in Slurm needs to be compatible with the existing cgroups code in the proctrack, task and jobacct_gather plugins. The new feature will work as follows: There will be a new plugin, PAMPlugin=pam/cgroup. The plugin will be loaded by slurmd. The plugin API defines the following functions: pam_g_add_user_resources() pam_g_delete_user_resources() pam_g_add_user_resources() will be called when a job is allocated resources, on each node in the allocation, and will do the following: Create cpuset cgroups for the job and user (user cgroup may already exist). By default, the path will be /cgroup/cpuset/slurm/user_%uid/job_%jobid. Add the set of cpus allocated to the job on this node to cpuset.cpus in the user and job cgroups. pam_g_delete_user_resources() will be called when a job terminates, and will do the following: Delete the job cgroup and update cpuset.cpus in the user cgroup from the remaining jobs for this user, if any. If there are no remaining jobs, kill any tasks attached to the user cgroup (i.e. active logins for this user) and delete the user cgroup. The plugin will also contain an implementation of pam_sm_acct_mgmt(). This implementation will do the following: Get user from PAM handle If the user does not have a user cpuset cgroup on this node return denied status else attach PID to user cpuset cgroup return allowed status There will be code to build a new PAM slurm library (pam_slurm_cgroup.so) containing this implementation of pam_sm_acct_mgmt. Admins can then include this library in the PAM configuration file under /etc/pam.d for the appropriate services (e.g. sshd).
Martin, Your plan does implement a subset of my proposal. However, I'm wondering if there is a reason why only the cpuset cgroup is planned? If you're creating a per-user cgroup, why not also restrict the memory? Where it differs is that it doesn't allow for accounting like normal. "Adopting" a process through a call from sshrc would allow for that since sshrc does have access to the SLURM_* variables if the ssh{,d}_config files allow for it. At that point you know what job the process belongs to.
Contents of recent relevant emails: Andy Wettstein <wettstein@uchicago.edu> writes: [Hide Quoted Text] I use pam_exec module to blindly put an ssh session into the user's most recent task cgroup. This does seem to work fine for users that just log in to a node to where there job is running, but multinode jobs have sever limitations. The cgroup isn't created unless a slurm task has actually been started. For programs that launch with ssh, this doesn't work at all really. If multinode jobs worked correctly, then it would obviously be better to somehow detect the slurm job id and use that. A fully functional 'pam_slurm' module would do just that (it has to check for an active allocation on the node anyway), and would create the uid cgroup as needed if one does not already exist. This is how the old pam_slurm_cpuset[1] operated and eventually the cgroups code was supposed to work similarly. (You also have to make sure resources are added and subtracted from the UID cgroups as jobs are created and destroyed) [1] https://code.google.com/p/slurm-spank-plugins/wiki/CPUSET mark [Hide Quoted Text] Here is what I currently use for pam_exec: #!/bin/sh [ "$PAM_USER" = "root" ] && exit 0 [ "$PAM_TYPE" = "open_session" ] || exit 0 . /etc/sysconfig/slurm squeue=$BINDIR/squeue if [ ! -x $squeue ]; then exit 0 fi uidnumber=$(id -u $PAM_USER) host=$(hostname -s) # last job the user started is where these tasks will go jobid=$($squeue --noheader --format=%i --user=$PAM_USER --node=localhost | tail -1) [ -z "$jobid" ] && exit 0 for system in freezer cpuset; do cgdir=/cgroup/$system/slurm_$host # if the cgdir doesn't exist skip it [ -d $cgdir ] || continue # first job step is where we'll put these tasks cgtasks=$(find $cgdir/uid_$uidnumber/job_$jobid -mindepth 2 -type f -name tasks -print -quit) [ -f $cgtasks ] && echo $PPID > $cgtasks done exit 0 On Wed, Jan 01, 2014 at 09:43:55PM -0800, Christopher Samuel wrote: At SC13 a few of us were talking about the issue about what to do when you have to allow users to SSH into nodes where there jobs are running. Currently SLURM doesn't put SSH sessions permitted by the current pam_slurm module into any control groups, it would be nice if it would at least put them into the top level /cgroup/cpuset/slurm/uid_$UID for the user so they can only affect their own jobs. The problem is that With SSH's privilege separation the current SLURM PAM module cannot do this as it will run as an unprivileged process (not the user) prior to authentication and so cannot have the permissions to move processes around. However, it appears that something like a pam_slurm *session* library would (I believe) run as the user in question - as long as it it can learn the PID of the shell being spawned. The problem then is how to allow that to securely put itself into the users top level cgroup. If slurmd just made the tasks file for that top level cgroup owned by the user then the process could do it, though there would be a small risk that the user could then move processes from lower, insulated containers into the top level, though they'd only be affecting their own stuff. Thoughts? All the best, Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -- andy wettstein hpc system administrator research computing center university of chicago 773.702.1104
For related work, see bug 1593 http://bugs.schedmd.com/show_bug.cgi?id=1593
Created attachment 2330 [details] pam_slurm_adopt_2.diff Here is the updated code for pam_slurm_adopt.c that was submitted in bug 1593. Once tested and merged, this should resolve this bug report. I have tested it and it Works For Me(TM). The README shows all the options for the pam module. Since this will likely be the most debated topic, I will clarify some things about the "action_unknown" option. This is what happens when 1) the user has more than one job on the node and 2) the RPC call cannot identify what job the process belongs to. This is almost exclusively a problem when the user tries to directly connect from a login node to a compute node (on which that user has multiple jobs running). any* = Pick a job in a (somewhat) random fashion. The user can ssh in but may be adopted into a job that exits earlier than the job they intended to check on. The ssh connection will at least be subject to appropriate limits and the user can be informed of better ways to accomplish their objectives if this becomes a problem user = Use the /slurm/uid_$UID cgroups. Not all cgroups set appropriate limits at this level so this may not be very effective. Additionally, job accounting at this level is impossible as is automatic cleanup of stray processes when the job exits allow = Let the connection through without adoption deny = Deny the connection "any" seems to be the most reasonable default for now. It ensures that the user's process is limited by cgroups and can be cleaned up by Slurm. Its usage is also accounted for. It does have the downside of unpredictability, to some degree, but IMO it's no worse than denying the connection. From a user's perspective, they won't know if they have a single or multiple jobs on a node unless they explicitly check for it. If you deny the connection, they will "randomly" not be able to log into some nodes and there's nothing they can do about it. If "any" is set, they will be able to get in but may "randomly" get kicked out when the job exits. To me, it seems better to be randomly kicked off than randomly denied access. Again, they're almost certainly logging in directly from a login node to run something like top or strace, so it shouldn't be that long, hopefully. I did implement Matthieu's idea to adopt into the /slurm/uid_$UID cgroups (action_unknown=user) but that has the limitation that the memory limits aren't aggregated (unless we have something misconfigured here), so it's almost worse than nothing at the moment, IMO. I do like the idea of choosing the most recently started job or the job with the longest remaining time but that seems harder to implement since I don't think the slurmd has easy access to that information without querying the ctld, correct? If so, that seems pretty expensive but at least it should be a rare thing. Also, the extern step seems to stay around and any ssh-launched processes live on when a job exits. The extern cgroups do appear to be cleaned up after the processes exit. At some point I make also make an HTML page for documentation.
(In reply to Ryan Cox from comment #5) > I do like the idea of choosing the most recently started job or the job with > the longest remaining time but that seems harder to implement since I don't > think the slurmd has easy access to that information without querying the > ctld, correct? If so, that seems pretty expensive but at least it should be > a rare thing. We don't have the expected job end times available in slurmd, but those would just be guesses anyway. We could stat() the cgroup directories to see what was most recently created, that seems like the best option using readily available information. In any case, it comes down to guesswork what will be "best".
Created attachment 2336 [details] pam_slurm_adopt_3.diff That was a good idea. I replaced "any" with "newest". If for some reason "any" is desirable, it would be easy to add back. It compares the cgroup mtimes to pick the newest job.
One thing I have noticed is that processes in the step_extern cgroups are not cleaned up automatically. I do not know if this is intentional or not. I could easily add something to epilog to kill all tasks in step_extern/tasks, if needed, but it would be nice if it gets cleaned up automatically. Otherwise stray tasks can be left behind. Additionally, I'm not seeing any accounting for the extern step through sacct or in the mysql database. Bug 1593 comment 9 makes me think it is meant to be there. Should I file separate bugs for those? That aside, pam_slurm_adopt has handled our testing very well and we are now running it in production. We'll see if users find any bugs.
I ended up filing separate bugs: bug 2096 and bug 2097. Also, pam_slurm_adopt has run in production for a week now with no known issues so far. We have seen a fair amount of ssh traffic in various scenarios that have exercised different parts of the code.
Hey Ryan, I have committed this to the 15.08 branch. I haven't been able to test it as much as I wanted, but hope to do that soon. I'll check out the new bugs as well.
Setting DevPrio and TargetRelease flags.
I think this is fixed, please reopen of you feel otherwise.