Ticket 1593

Summary:	callerid RPC / pam_slurm_adopt
Product:	Slurm	Reporter:	Ryan Cox <ryan_cox>
Component:	Other	Assignee:	Moe Jette <jette>
Status:	RESOLVED FIXED	QA Contact:
Severity:	5 - Enhancement
Priority:	---	CC:	brian, da
Version:	15.08.x
Hardware:	Linux
OS:	Linux
Site:	BYU - Brigham Young University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	15.08.1-pre5
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	callerid_initial.diff

Description Ryan Cox 2015-04-09 08:45:22 MDT

Created attachment 1811 [details]
callerid_initial.diff

Danny,

(For anyone else reading this, here is information about this patch and the reasons for it: http://tech.ryancox.net/2015/04/caller-id-handling-ssh-launched-processes-in-slurm.html)

As discussed previously, here is a patch against master (branched this morning at 709f650456739799926e631c6533d8ff88b1b9db).  It works though I'm sure it has some rough edges that you'll find.  I had to export a few symbols that weren't there from stepd_api.[ch].  A lot of the code I had to modify is new territory for me so it's likely I made many mistakes.

There are a few minor things I might end up wanting to change (I'm not exactly in love with some of the variable or function names I chose though I can live with them).  I might make a few minor tweaks to the pam module as well but it won't affect the RPC code.  Currently the README is written like a manpage.  I might turn it into a man page and say "read the manpage" in the README.

Here is an excerpt from the README that states how decisions are made:
  1) Check the local stepds for a count of jobs owned by the non-root user
    a) If none, deny (option action_no_jobs)
    b) If only one, adopt the process into that job
    c) If multiple, continue
  2) Determine src/dst IP/port of socket
  3) Issue callerid RPC to slurmd at IP address of source
    a) If the remote slurmd can identify the source job, adopt into that job
    b) If not, continue
  4) Pick a random local job from the user to adopt into (option action_unknown)

I tried to document to thoroughly document the code, so hopefully it makes sense.  Also, I noticed that one of the stepd functions returns a uid_t which is set to -1 on error.  The problem with that is that Linux's uid_t is uint32_t.

One area of concern in the code is the stepd calls in pam_slurm_adopt.c code.  I hope I'm doing enough error handling there, but maybe not.  What happens if a step is completing or if the step data is still around even though it's actually dead?

The code to actually adopt processes is currently a no-op.  That will depend on having the allocation step code added.  I haven't checked yet to see if all the relevant plugins (proctrack, jobacct_gather, etc.) have hooks to add a new process to the plugin.  If not, it will have to be added as well.

Lastly, I exceed 80 characters on lines with user-visible strings since Slurm follows the Linux kernel coding style.  Chapter 2 of https://www.kernel.org/doc/Documentation/CodingStyle says "never break user-visible strings... because that breaks the ability to grep for them" (which I have wished Slurm followed, by the way, since I have hit that issue).  I know in the past that you wanted even those lines to be wrapped but I figured I would ask if anything has changed :)

Comment 1 Moe Jette 2015-05-04 10:30:25 MDT

Hi Ryan,

> It works though I'm sure it has some rough edges that you'll find. 
This looks better than the vast majority of patches that we get. There were a number of minor changes that I needed to make for a clean build on my system, but only a couple of things that you would probably want to change if you are actually using it now. The first patch below adds a log message when a caller_id RPC is issued by a non-authorized user. If someone were malicious, the information might be misused. The second patch initializes some variables to avoid issuing calls to free memory from random addresses off the stack, which could result in a SEGV. With the "free" calls that I added, there is also no need to explicitly test for non-zero values.

diff --git a/src/slurmd/slurmd/req.c b/src/slurmd/slurmd/req.c
index bc0b1fe..ceb6948 100644
--- a/src/slurmd/slurmd/req.c
+++ b/src/slurmd/slurmd/req.c
@@ -2850,7 +2850,9 @@ _rpc_network_callerid(slurm_msg_t *msg)
 			if (job_uid != req_uid) {
 				/* RPC call sent by non-root user who does not
 				 * own this job. Do not send them the job ID. */
-				job_id = (uint32_t)NO_VAL;
+				error("Security violation, REQUEST_NETWORK_CALLERID from uid=%d",
+				      req_uid);
+				job_id = NO_VAL;
 				rc = ESLURM_INVALID_JOB_ID;
 			}
 		}

diff --git a/contribs/pam_slurm_adopt/pam_slurm_adopt.c b/contribs/pam_slurm_adopt/pam_slurm_adopt.c
index 8b48a3c..8af62b7 100644
--- a/contribs/pam_slurm_adopt/pam_slurm_adopt.c
+++ b/contribs/pam_slurm_adopt/pam_slurm_adopt.c
@@ -365,9 +365,9 @@ PAM_EXTERN int pam_sm_acct_mgmt(pam_handle_t *pamh, int flags
        struct callerid_conn conn;
        uint32_t job_id;
        char ip_src_str[INET6_ADDRSTRLEN];
-       List steps;
+       List steps = NULL;
        struct passwd pwd, *pwd_result;
-       char *buf;
+       char *buf = NULL;
        int bufsize;
 
        _init_opts();
@@ -524,12 +524,10 @@ PAM_EXTERN int pam_sm_acct_mgmt(pam_handle_t *pamh, int flags
                debug("_indeterminate_multiple failed to find a job to adopt this into");
        }
 
-       cleanup:
-               if (steps)
-                       list_destroy(steps);
-               if (buf)
-                       xfree(buf);
-               return rc;
+cleanup:
+       FREE_NULL_LIST(steps);
+       xfree(buf);
+       return rc;
 }


Your commit is here:
https://github.com/SchedMD/slurm/commit/3153612e6907e736158d5df3fc43409c7b2395eb

My commit, with above changes and the cosmetic changes, is here:
https://github.com/SchedMD/slurm/commit/f5c0c7d36cd7a0bbd8c607252fb4a49ebf81e622

Comment 2 Moe Jette 2015-05-05 09:31:17 MDT

FYI: I made a few more fixes related to warnings building on a 32-bit system and errors reported by the CLANG tool (variables stored into, but never read, etc.). All were minor and would never likely ever cause any problem in practice.

Comment 3 Ryan Cox 2015-05-05 09:33:07 MDT

Thanks, Moe.  Danny mentioned a while back that you were thinking an "allocation step" is the best approach so that the pam module has a place to adopt processes into.  Is that still the plan?

Comment 4 Moe Jette 2015-05-05 09:39:22 MDT

(In reply to Ryan Cox from comment #3)
> Thanks, Moe.  Danny mentioned a while back that you were thinking an
> "allocation step" is the best approach so that the pam module has a place to
> adopt processes into.  Is that still the plan?

Yes, although I can't say when that might happen.

Comment 5 Moe Jette 2015-05-06 10:50:10 MDT

(In reply to Ryan Cox from comment #3)
> Thanks, Moe.  Danny mentioned a while back that you were thinking an
> "allocation step" is the best approach so that the pam module has a place to
> adopt processes into.  Is that still the plan?

I want to run a couple of ideas past you for this.

My thought is that the creation of the "allocation step" would be triggered by a configuration of "PrologFlags=alloc". An administrator would need to configure Slurm in order to support this functionality.

It would be easiest to apply only to salloc and sbatch jobs (not srun). Would you also want this job step for srun jobs in order to identify resource use by something outside of the Slurm spawned job? Or would having any job step available on the node be sufficient?

Would it be valuable to see resource use in this allocation step for accounting purposes? That certainly creates a lot more work than just having a cgroup on the node through the job's duration.

Comment 6 Ryan Cox 2015-05-07 03:25:04 MDT

(In reply to Moe Jette from comment #5)

> My thought is that the creation of the "allocation step" would be triggered
> by a configuration of "PrologFlags=alloc". An administrator would need to
> configure Slurm in order to support this functionality.

I think that is probably a good idea.  We don't currently use prolog ourselves (we do use epilog) so I don't have any experience with that flag.  The only question I would have is if that may adversely affect other sites' use of that flag.  If you don't think it will, this seems like a good approach.

> It would be easiest to apply only to salloc and sbatch jobs (not srun).
> Would you also want this job step for srun jobs in order to identify
> resource use by something outside of the Slurm spawned job? Or would having
> any job step available on the node be sufficient?

I was slightly confused about the differences in behavior of all of those options as they relate to batch steps, but I think I get what you mean now.  In case I'm wrong, I'll state my understanding so that you can correct me if needed.

IIRC, sbatch and salloc both create a batch step. When run within that batch step, srun or a compatible MPI will create a new step.  If srun is run directly from a login node with no associated job, a new job is created and only one step is created (step 0 or something) which directly runs the user-specified program.  Due to the difference in behavior, you're thinking of creating an allocation step when using sbatch and salloc but not when directly creating a job with srun.

That approach should be fine.  The adoption code could first try adopting processes into a specific allocation step then fall back to using any available step for that batch job.  That should be simple enough and would handle both sbatch/salloc and direct srun.

> Would it be valuable to see resource use in this allocation step for
> accounting purposes? That certainly creates a lot more work than just having
> a cgroup on the node through the job's duration.

It would be valuable to have the accounting information.  We strongly prefer it but would be okay without accounting if it's the difference between making it in 15.08 or not :)

Comment 7 Moe Jette 2015-05-07 03:47:49 MDT

(In reply to Ryan Cox from comment #6)
> (In reply to Moe Jette from comment #5)
> 
> > My thought is that the creation of the "allocation step" would be triggered
> > by a configuration of "PrologFlags=alloc". An administrator would need to
> > configure Slurm in order to support this functionality.
> 
> I think that is probably a good idea.  We don't currently use prolog
> ourselves (we do use epilog) so I don't have any experience with that flag. 
> The only question I would have is if that may adversely affect other sites'
> use of that flag.  If you don't think it will, this seems like a good
> approach.

You would not need to have a Prolog, but I want to re-use some existing logic from there. I'll probably have some more questions as time goes on, but I'm hoping to have this done within a couple of weeks.

Comment 8 Ryan Cox 2015-05-07 03:52:49 MDT

Great. Thanks!

Comment 9 Moe Jette 2015-05-14 06:50:26 MDT

I have code in the master branch to create a job container at job allocation time and account for resource usage in that container. You can enable it with the configuration option "PrologFlags=contain". You do not need a Prolog script to use this feature, but one will run if so configured. In the accounting records, the job appears at "#.extern" (as in "external", resource use outside of direct Slurm control). The cgroup will have a path that includes "step_extern", following the same model as "step_batch" and avoiding a huge numeric value. The numeric value is "INFINITE" or 0xFFFFFFFF, but do not believe that you will need to deal with that numeric value anywhere. I'm still working on a few configuration dependent issues in the regression tests, but the Slurm logic should be working fine. Let me know if you encounter any problems.

I'm going to close this bug and suggest that we continue discussions using bug 489,  Cgroup enhancements:
http://bugs.schedmd.com/show_bug.cgi?id=489

Comment 10 Ryan Cox 2015-05-15 05:03:04 MDT

(In reply to Moe Jette from comment #9)
> I have code in the master branch to create a job container at job allocation
> time and account for resource usage in that container. You can enable it
> with the configuration option "PrologFlags=contain". You do not need a
> Prolog script to use this feature, but one will run if so configured. In the
> accounting records, the job appears at "#.extern" (as in "external",
> resource use outside of direct Slurm control). The cgroup will have a path
> that includes "step_extern", following the same model as "step_batch" and
> avoiding a huge numeric value. The numeric value is "INFINITE" or
> 0xFFFFFFFF, but do not believe that you will need to deal with that numeric
> value anywhere. I'm still working on a few configuration dependent issues in
> the regression tests, but the Slurm logic should be working fine. Let me
> know if you encounter any problems.

Great!  I'll try to test this early next week.

> I'm going to close this bug and suggest that we continue discussions using
> bug 489,  Cgroup enhancements:
> http://bugs.schedmd.com/show_bug.cgi?id=489

Good idea.

Thanks.