Ticket 6824 - slurmctld: fatal: PrologFlags=Contain requires ProcTrackType=proctrack/cgroup
Summary: slurmctld: fatal: PrologFlags=Contain requires ProcTrackType=proctrack/cgroup
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 19.05.0
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Marshall Garey
QA Contact:
URL:
: 7431 7618 7720 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2019-04-09 23:05 MDT by Doug Jacobsen
Modified: 2020-02-10 15:10 MST (History)
9 users (show)

See Also:
Site: NERSC
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 19.05rc1 19.05.3 / 20.02.0rc1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm_19.05.0_bug6824_missing_aries.patch (1.13 KB, patch)
2019-07-25 18:46 MDT, Chris Samuel (NERSC)
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Doug Jacobsen 2019-04-09 23:05:49 MDT
Hello,

I'm testing the master branch on alva to try some different methods of dealing with hetjobs on Cray.  In any case, I noticed that the current master branch seems to not allow Cray systems to use PrologFlags=contain -- which we rather need.  Please assure that PrologFlags=contain will also work with ProcTrackType=proctrack/cray.

Thanks,
Doug
Comment 1 Doug Jacobsen 2019-04-10 08:26:05 MDT
Tim indicated I should mention that the real dependency is on task/cgroup, and that I should mention bug 5836 / f293a76350 to help smooth this through.
Comment 2 Marshall Garey 2019-04-10 09:34:07 MDT
Hi Doug,

Yep, you're right. I remember doing that. I'll get that fixed.
Comment 3 Marshall Garey 2019-04-10 10:03:01 MDT
Bug 5836 comment 4:

"It looks like x11 is just incompatible with proctrack/linuxproc. linuxproc doesn't want to kill processes that don't belong to that user - and since the slurmstepd (extern step) that handles x11 forwarding belongs to root, it isn't getting killed.

https://github.com/SchedMD/slurm/blob/slurm-18.08/src/plugins/proctrack/linuxproc/kill_tree.c#L278-L288

That is where this message gets printed:

[26.extern] debug2: 30654 (slurmstepd) is not a user command.  Skipped sending signal 9

Eventually it times out (after UnkillableStepTimeout seconds), and the extern step is SIGKILL'ed and the node drained.


I can document this and also print add error messages that indicate that X11 won't work with proctrack/linuxproc."

So this isn't quite as straightforward. Also, in the private comments in bug 5836, I was under the impression that pam_slurm_adopt didn't work without task/cgroup AND proctrack/cgroup. So, this needs further study. I'll let you know what I find out.

Regardless, we will make sure that PrologFlags=contain + proctrack/cray is a valid combination.
Comment 17 Marshall Garey 2019-04-29 14:20:19 MDT
Hi Doug,

We've committed the following to the master branch and will be included in 19.05rc1.

commit 1e2bc37a7a62770643f05e195525f86ec22841dc
Author:     Marshall Garey <marshall@schedmd.com>
AuthorDate: Tue Apr 16 15:15:57 2019 -0600

    Clarify pam_slurm_adopt ProcTrackType requirement.
    
    Bug 6824.

commit c6e6089f0c85b4b1f0f8a84bcea2f1ee79d6ec5c
Author:     Marshall Garey <marshall@schedmd.com>
AuthorDate: Tue Apr 16 15:08:56 2019 -0600

    Enforce valid combinations of ProcTrackType and PrologFlags
    
    PrologFlags=x11 is incompatible with proctracktype=proctrack/linuxproc
    because linuxproc cannot kill processes not owned by the user, and root
    owns processes spawned for handling x11 forwarding.
    
    When using PrologFlags=contain, it is recommended to use
    proctrack/cgroup or proctrack/cray because contain is mostly useful for
    pam_slurm_adopt which requires proctrack/cgroup or proctrack/cray.
    
    Bug 6824.

commit e9e277a435bc8877e62989990df2607900bf2cdb
Author:     Marshall Garey <marshall@schedmd.com>
AuthorDate: Tue Apr 16 14:55:46 2019 -0600

    Revert regression caused by f293a76350
    
    Commit f293a76350 caused slurmctld to fatal when proctrack/cray and
    PrologFlags=Contain are both in slurm.conf. proctrack/cray should be
    allowed with PrologFlags=contain.
    
    Bug 6824.

Thanks for submitting the bug report.
Comment 18 Chris Samuel (NERSC) 2019-04-29 15:10:35 MDT
Thanks Marshall!
Comment 19 Chris Samuel (NERSC) 2019-06-29 00:13:40 MDT
Hi Marshall,

We're working on preping our test systems for 19.05 on Cray CLE7 and I've run into this bug.

I'm re-opening as I've found an issue as it is checking for proctrack/cray which no longer exists in 19.05, it needs to check for proctrack/cray_aries.

Viz:

Jun 28 23:03:24 ctlnet1 slurmctld[28329]: error: WARNING: If using PrologFlags=Contain for pam_slurm_adopt, either proctrack/cgroup or proctrack/cray is required.

ctlnet1:~ # grep proctrack /etc/slurm/slurm.conf
ProctrackType=proctrack/cray_aries

The code that checks:

                        if (xstrcmp(conf->proctrack_type, "proctrack/cgroup") &&
                            xstrcmp(conf->proctrack_type, "proctrack/cray"))
                                error("WARNING: If using PrologFlags=Contain for pam_slurm_adopt, either proctrack/cgroup or proctrack/cray is required.");
                }

I suspect it's a trivial change of making "proctrack/cray" be "proctrack/cray_aries".

All the best,
Chris
Comment 20 Chris Samuel (NERSC) 2019-06-29 01:49:37 MDT
(In reply to Chris Samuel (NERSC) from comment #19)

> I suspect it's a trivial change of making "proctrack/cray" be
> "proctrack/cray_aries".

Our local patch making this change resolves this issue.
Comment 21 Marshall Garey 2019-07-01 09:09:07 MDT
Thanks for reporting this - sorry about that oversight. I'll poke Tim directly to see if he can add this to 19.05.1 before it is tagged.
Comment 22 Chris Samuel (NERSC) 2019-07-01 09:27:09 MDT
Hey Marshall,

(In reply to Marshall Garey from comment #21)

> Thanks for reporting this - sorry about that oversight. I'll poke Tim
> directly to see if he can add this to 19.05.1 before it is tagged.

Not a problem, with all that renaming going on it's not unexpected!

All the best,
Chris
Comment 25 Marshall Garey 2019-07-05 10:25:27 MDT
We have a patch in the queue to be reviewed. At first I thought the slurmctld fatal'd again, but since it's just an erroneous error message, it's not being rushed, so may or may not make it into 19.05.1. I'll let you know when it's been committed/pushed upstream.
Comment 26 Albert Gil 2019-07-18 08:48:26 MDT
*** Ticket 7431 has been marked as a duplicate of this ticket. ***
Comment 28 Chris Samuel (NERSC) 2019-07-25 18:46:41 MDT
Created attachment 11017 [details]
slurm_19.05.0_bug6824_missing_aries.patch

Just for the record I've attached our local patch that resolved this issue.
Comment 29 Chris Samuel (NERSC) 2019-08-13 22:00:05 MDT
Hi there,

Any news on fixing this last little bug please?

All the best,
Chris
Comment 30 Marshall Garey 2019-08-14 08:45:09 MDT
Hi Chris,

It's been sitting in our internal review/QA queue, so the fix isn't upstream yet. Unfortunately our main developers/reviewers have been extremely busy, so small issues like this one sometimes take awhile to get reviewed and committed upstream. I'll ping the reviewers about it.

- Marshall
Comment 33 Marshall Garey 2019-08-14 09:50:55 MDT
This is fixed in commit 821a261b51b6a, along with a bit more clarification in the error message. It will be in 19.05.3
Comment 34 Chris Samuel (NERSC) 2019-08-14 11:36:07 MDT
Hi Marshall,

Thanks for that, much appreciated!

All the best,
Chris
Comment 35 Jason Booth 2019-08-22 12:56:43 MDT
*** Ticket 7618 has been marked as a duplicate of this ticket. ***
Comment 36 Jason Booth 2019-08-22 14:32:21 MDT
*** Ticket 7618 has been marked as a duplicate of this ticket. ***
Comment 37 Marshall Garey 2019-09-10 10:49:39 MDT
*** Ticket 7720 has been marked as a duplicate of this ticket. ***