Hello, I'm testing the master branch on alva to try some different methods of dealing with hetjobs on Cray. In any case, I noticed that the current master branch seems to not allow Cray systems to use PrologFlags=contain -- which we rather need. Please assure that PrologFlags=contain will also work with ProcTrackType=proctrack/cray. Thanks, Doug
Tim indicated I should mention that the real dependency is on task/cgroup, and that I should mention bug 5836 / f293a76350 to help smooth this through.
Hi Doug, Yep, you're right. I remember doing that. I'll get that fixed.
Bug 5836 comment 4: "It looks like x11 is just incompatible with proctrack/linuxproc. linuxproc doesn't want to kill processes that don't belong to that user - and since the slurmstepd (extern step) that handles x11 forwarding belongs to root, it isn't getting killed. https://github.com/SchedMD/slurm/blob/slurm-18.08/src/plugins/proctrack/linuxproc/kill_tree.c#L278-L288 That is where this message gets printed: [26.extern] debug2: 30654 (slurmstepd) is not a user command. Skipped sending signal 9 Eventually it times out (after UnkillableStepTimeout seconds), and the extern step is SIGKILL'ed and the node drained. I can document this and also print add error messages that indicate that X11 won't work with proctrack/linuxproc." So this isn't quite as straightforward. Also, in the private comments in bug 5836, I was under the impression that pam_slurm_adopt didn't work without task/cgroup AND proctrack/cgroup. So, this needs further study. I'll let you know what I find out. Regardless, we will make sure that PrologFlags=contain + proctrack/cray is a valid combination.
Hi Doug, We've committed the following to the master branch and will be included in 19.05rc1. commit 1e2bc37a7a62770643f05e195525f86ec22841dc Author: Marshall Garey <marshall@schedmd.com> AuthorDate: Tue Apr 16 15:15:57 2019 -0600 Clarify pam_slurm_adopt ProcTrackType requirement. Bug 6824. commit c6e6089f0c85b4b1f0f8a84bcea2f1ee79d6ec5c Author: Marshall Garey <marshall@schedmd.com> AuthorDate: Tue Apr 16 15:08:56 2019 -0600 Enforce valid combinations of ProcTrackType and PrologFlags PrologFlags=x11 is incompatible with proctracktype=proctrack/linuxproc because linuxproc cannot kill processes not owned by the user, and root owns processes spawned for handling x11 forwarding. When using PrologFlags=contain, it is recommended to use proctrack/cgroup or proctrack/cray because contain is mostly useful for pam_slurm_adopt which requires proctrack/cgroup or proctrack/cray. Bug 6824. commit e9e277a435bc8877e62989990df2607900bf2cdb Author: Marshall Garey <marshall@schedmd.com> AuthorDate: Tue Apr 16 14:55:46 2019 -0600 Revert regression caused by f293a76350 Commit f293a76350 caused slurmctld to fatal when proctrack/cray and PrologFlags=Contain are both in slurm.conf. proctrack/cray should be allowed with PrologFlags=contain. Bug 6824. Thanks for submitting the bug report.
Thanks Marshall!
Hi Marshall, We're working on preping our test systems for 19.05 on Cray CLE7 and I've run into this bug. I'm re-opening as I've found an issue as it is checking for proctrack/cray which no longer exists in 19.05, it needs to check for proctrack/cray_aries. Viz: Jun 28 23:03:24 ctlnet1 slurmctld[28329]: error: WARNING: If using PrologFlags=Contain for pam_slurm_adopt, either proctrack/cgroup or proctrack/cray is required. ctlnet1:~ # grep proctrack /etc/slurm/slurm.conf ProctrackType=proctrack/cray_aries The code that checks: if (xstrcmp(conf->proctrack_type, "proctrack/cgroup") && xstrcmp(conf->proctrack_type, "proctrack/cray")) error("WARNING: If using PrologFlags=Contain for pam_slurm_adopt, either proctrack/cgroup or proctrack/cray is required."); } I suspect it's a trivial change of making "proctrack/cray" be "proctrack/cray_aries". All the best, Chris
(In reply to Chris Samuel (NERSC) from comment #19) > I suspect it's a trivial change of making "proctrack/cray" be > "proctrack/cray_aries". Our local patch making this change resolves this issue.
Thanks for reporting this - sorry about that oversight. I'll poke Tim directly to see if he can add this to 19.05.1 before it is tagged.
Hey Marshall, (In reply to Marshall Garey from comment #21) > Thanks for reporting this - sorry about that oversight. I'll poke Tim > directly to see if he can add this to 19.05.1 before it is tagged. Not a problem, with all that renaming going on it's not unexpected! All the best, Chris
We have a patch in the queue to be reviewed. At first I thought the slurmctld fatal'd again, but since it's just an erroneous error message, it's not being rushed, so may or may not make it into 19.05.1. I'll let you know when it's been committed/pushed upstream.
*** Ticket 7431 has been marked as a duplicate of this ticket. ***
Created attachment 11017 [details] slurm_19.05.0_bug6824_missing_aries.patch Just for the record I've attached our local patch that resolved this issue.
Hi there, Any news on fixing this last little bug please? All the best, Chris
Hi Chris, It's been sitting in our internal review/QA queue, so the fix isn't upstream yet. Unfortunately our main developers/reviewers have been extremely busy, so small issues like this one sometimes take awhile to get reviewed and committed upstream. I'll ping the reviewers about it. - Marshall
This is fixed in commit 821a261b51b6a, along with a bit more clarification in the error message. It will be in 19.05.3
Hi Marshall, Thanks for that, much appreciated! All the best, Chris
*** Ticket 7618 has been marked as a duplicate of this ticket. ***
*** Ticket 7720 has been marked as a duplicate of this ticket. ***