Summary: | callerid RPC / pam_slurm_adopt | ||
---|---|---|---|
Product: | Slurm | Reporter: | Ryan Cox <ryan_cox> |
Component: | Other | Assignee: | Moe Jette <jette> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 5 - Enhancement | ||
Priority: | --- | CC: | brian, da |
Version: | 15.08.x | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | BYU - Brigham Young University | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 15.08.1-pre5 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: | callerid_initial.diff |
Description
Ryan Cox
2015-04-09 08:45:22 MDT
Hi Ryan, > It works though I'm sure it has some rough edges that you'll find. This looks better than the vast majority of patches that we get. There were a number of minor changes that I needed to make for a clean build on my system, but only a couple of things that you would probably want to change if you are actually using it now. The first patch below adds a log message when a caller_id RPC is issued by a non-authorized user. If someone were malicious, the information might be misused. The second patch initializes some variables to avoid issuing calls to free memory from random addresses off the stack, which could result in a SEGV. With the "free" calls that I added, there is also no need to explicitly test for non-zero values. diff --git a/src/slurmd/slurmd/req.c b/src/slurmd/slurmd/req.c index bc0b1fe..ceb6948 100644 --- a/src/slurmd/slurmd/req.c +++ b/src/slurmd/slurmd/req.c @@ -2850,7 +2850,9 @@ _rpc_network_callerid(slurm_msg_t *msg) if (job_uid != req_uid) { /* RPC call sent by non-root user who does not * own this job. Do not send them the job ID. */ - job_id = (uint32_t)NO_VAL; + error("Security violation, REQUEST_NETWORK_CALLERID from uid=%d", + req_uid); + job_id = NO_VAL; rc = ESLURM_INVALID_JOB_ID; } } diff --git a/contribs/pam_slurm_adopt/pam_slurm_adopt.c b/contribs/pam_slurm_adopt/pam_slurm_adopt.c index 8b48a3c..8af62b7 100644 --- a/contribs/pam_slurm_adopt/pam_slurm_adopt.c +++ b/contribs/pam_slurm_adopt/pam_slurm_adopt.c @@ -365,9 +365,9 @@ PAM_EXTERN int pam_sm_acct_mgmt(pam_handle_t *pamh, int flags struct callerid_conn conn; uint32_t job_id; char ip_src_str[INET6_ADDRSTRLEN]; - List steps; + List steps = NULL; struct passwd pwd, *pwd_result; - char *buf; + char *buf = NULL; int bufsize; _init_opts(); @@ -524,12 +524,10 @@ PAM_EXTERN int pam_sm_acct_mgmt(pam_handle_t *pamh, int flags debug("_indeterminate_multiple failed to find a job to adopt this into"); } - cleanup: - if (steps) - list_destroy(steps); - if (buf) - xfree(buf); - return rc; +cleanup: + FREE_NULL_LIST(steps); + xfree(buf); + return rc; } Your commit is here: https://github.com/SchedMD/slurm/commit/3153612e6907e736158d5df3fc43409c7b2395eb My commit, with above changes and the cosmetic changes, is here: https://github.com/SchedMD/slurm/commit/f5c0c7d36cd7a0bbd8c607252fb4a49ebf81e622 FYI: I made a few more fixes related to warnings building on a 32-bit system and errors reported by the CLANG tool (variables stored into, but never read, etc.). All were minor and would never likely ever cause any problem in practice. Thanks, Moe. Danny mentioned a while back that you were thinking an "allocation step" is the best approach so that the pam module has a place to adopt processes into. Is that still the plan? (In reply to Ryan Cox from comment #3) > Thanks, Moe. Danny mentioned a while back that you were thinking an > "allocation step" is the best approach so that the pam module has a place to > adopt processes into. Is that still the plan? Yes, although I can't say when that might happen. (In reply to Ryan Cox from comment #3) > Thanks, Moe. Danny mentioned a while back that you were thinking an > "allocation step" is the best approach so that the pam module has a place to > adopt processes into. Is that still the plan? I want to run a couple of ideas past you for this. My thought is that the creation of the "allocation step" would be triggered by a configuration of "PrologFlags=alloc". An administrator would need to configure Slurm in order to support this functionality. It would be easiest to apply only to salloc and sbatch jobs (not srun). Would you also want this job step for srun jobs in order to identify resource use by something outside of the Slurm spawned job? Or would having any job step available on the node be sufficient? Would it be valuable to see resource use in this allocation step for accounting purposes? That certainly creates a lot more work than just having a cgroup on the node through the job's duration. (In reply to Moe Jette from comment #5) > My thought is that the creation of the "allocation step" would be triggered > by a configuration of "PrologFlags=alloc". An administrator would need to > configure Slurm in order to support this functionality. I think that is probably a good idea. We don't currently use prolog ourselves (we do use epilog) so I don't have any experience with that flag. The only question I would have is if that may adversely affect other sites' use of that flag. If you don't think it will, this seems like a good approach. > It would be easiest to apply only to salloc and sbatch jobs (not srun). > Would you also want this job step for srun jobs in order to identify > resource use by something outside of the Slurm spawned job? Or would having > any job step available on the node be sufficient? I was slightly confused about the differences in behavior of all of those options as they relate to batch steps, but I think I get what you mean now. In case I'm wrong, I'll state my understanding so that you can correct me if needed. IIRC, sbatch and salloc both create a batch step. When run within that batch step, srun or a compatible MPI will create a new step. If srun is run directly from a login node with no associated job, a new job is created and only one step is created (step 0 or something) which directly runs the user-specified program. Due to the difference in behavior, you're thinking of creating an allocation step when using sbatch and salloc but not when directly creating a job with srun. That approach should be fine. The adoption code could first try adopting processes into a specific allocation step then fall back to using any available step for that batch job. That should be simple enough and would handle both sbatch/salloc and direct srun. > Would it be valuable to see resource use in this allocation step for > accounting purposes? That certainly creates a lot more work than just having > a cgroup on the node through the job's duration. It would be valuable to have the accounting information. We strongly prefer it but would be okay without accounting if it's the difference between making it in 15.08 or not :) (In reply to Ryan Cox from comment #6) > (In reply to Moe Jette from comment #5) > > > My thought is that the creation of the "allocation step" would be triggered > > by a configuration of "PrologFlags=alloc". An administrator would need to > > configure Slurm in order to support this functionality. > > I think that is probably a good idea. We don't currently use prolog > ourselves (we do use epilog) so I don't have any experience with that flag. > The only question I would have is if that may adversely affect other sites' > use of that flag. If you don't think it will, this seems like a good > approach. You would not need to have a Prolog, but I want to re-use some existing logic from there. I'll probably have some more questions as time goes on, but I'm hoping to have this done within a couple of weeks. Great. Thanks! I have code in the master branch to create a job container at job allocation time and account for resource usage in that container. You can enable it with the configuration option "PrologFlags=contain". You do not need a Prolog script to use this feature, but one will run if so configured. In the accounting records, the job appears at "#.extern" (as in "external", resource use outside of direct Slurm control). The cgroup will have a path that includes "step_extern", following the same model as "step_batch" and avoiding a huge numeric value. The numeric value is "INFINITE" or 0xFFFFFFFF, but do not believe that you will need to deal with that numeric value anywhere. I'm still working on a few configuration dependent issues in the regression tests, but the Slurm logic should be working fine. Let me know if you encounter any problems. I'm going to close this bug and suggest that we continue discussions using bug 489, Cgroup enhancements: http://bugs.schedmd.com/show_bug.cgi?id=489 (In reply to Moe Jette from comment #9) > I have code in the master branch to create a job container at job allocation > time and account for resource usage in that container. You can enable it > with the configuration option "PrologFlags=contain". You do not need a > Prolog script to use this feature, but one will run if so configured. In the > accounting records, the job appears at "#.extern" (as in "external", > resource use outside of direct Slurm control). The cgroup will have a path > that includes "step_extern", following the same model as "step_batch" and > avoiding a huge numeric value. The numeric value is "INFINITE" or > 0xFFFFFFFF, but do not believe that you will need to deal with that numeric > value anywhere. I'm still working on a few configuration dependent issues in > the regression tests, but the Slurm logic should be working fine. Let me > know if you encounter any problems. Great! I'll try to test this early next week. > I'm going to close this bug and suggest that we continue discussions using > bug 489, Cgroup enhancements: > http://bugs.schedmd.com/show_bug.cgi?id=489 Good idea. Thanks. |