Summary: | Kill task failed & UnkillableStepTimeout = 180s | ||
---|---|---|---|
Product: | Slurm | Reporter: | GSK-ONYX-SLURM <slurm-support> |
Component: | slurmstepd | Assignee: | Felip Moll <felip.moll> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | ||
Version: | 17.11.7 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=5262 | ||
Site: | GSK | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | ? | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
slurmd log
slurmctld log cgroup.conf |
Description
GSK-ONYX-SLURM
2018-07-26 02:51:54 MDT
Hi Mark, Is it still the same exact situation described in 5262 and only happening on 1 node with python job within a slurm sbatch? Did you finally get any slurmd debug2 logs? In any case send me back the slurmd and ctld logs. Hi. Sorry for the delay. This is a completely different situation from our previous occurrences. Different application, different slurm version. The server was however previously in an 17.02.7 environment where we experienced the kill task issue with python jobs and step timeout at 60's. The server has migrated to a 17.11.7 environment where we are testing another application. See attached logs. Thanks. Mark. Created attachment 7492 [details]
slurmd log
Created attachment 7493 [details]
slurmctld log
Hi Mark, Please, tell me in what OS is it happening and your kernel version. Send me also your cgroup.conf. Thanks # cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.5 (Maipo) # uname -r 3.10.0-862.3.3.el7.x86_64 (In reply to GSK-EIS-SLURM from comment #6) > # cat /etc/redhat-release > Red Hat Enterprise Linux Server release 7.5 (Maipo) > > # uname -r > 3.10.0-862.3.3.el7.x86_64 Hi, please attach also cgroup.conf as requested in comment 5. There's possibly a known bug in your RHEL distribution. Created attachment 7526 [details]
cgroup.conf
Attached cgroup.conf. Sorry, overlooked it.
Hi Mark, Please, set this to your cgroup.conf: ConstrainKmemSpace=no Your issue is probably related to a bug in the kernel on applications that are setting kmem limit. It seems kmem limit in a cgroup causes a number of slab caches to get created that don't go away when the cgroup is removed, thus filling up an internal 64K cache that ends showing the error you see in Slurm logs. Following patches that would fix the issue seem not to be commited to RHEL kernels because incompatibilities with OpenShift software: https://github.com/torvalds/linux/commit/73f576c04b9410ed19660f74f97521bee6e1c546 https://github.com/torvalds/linux/commit/24ee3cf89bef04e8bc23788aca4e029a3f0f06d9 Actual kernel patch: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d6e0b7fa11862433773d986b5f995ffdf47ce672 There's currently an open bug in RHEL: https://bugzilla.redhat.com/show_bug.cgi?id=1507149 Bug 5082 seems a possible duplicate of your issue, if you are curious about a more extensive explanation. The workaround is simply to set "ConstrainKmemSpace=No" in cgroup.conf, then you must reboot the affected nodes. This parameter has been set to "no" by default on 18.08. Commit 32fabc5e006b8f41. Tell me if after applying this change it fixes the issue for you. Regards, Felip Hi Felip. We were also given advice on setting ConstrainKmemSpace=No in Bug 5497 and we did implement that which resolved our cgroup issues. Unfortunately that change got lost when cgroups.conf was overwritten as part of another change. I then sent you a cgroups.conf that didn't have the fix any more. So the timeline is: 25 July I log this bug, 5485 30 July We implement ConstrainKmemSpace=No as a test 07 Aug cgroups.conf get overwritten and loses ConstrainKmemSpace fix 07 Aug I send you the old cgroups.conf I have re-instated ConstrainKmemSpace=No and we'll continue to monitor. We did not see any further re-occurrence of the kill task failed issue while the ConstrainKmemSpace fix was in place between 30 July and 07 Aug. Please go ahead and close this bug. If the issue re-occurs we'll log another bug or re-open this one. Thanks. Mark. Ok Mark, I expect not seeing this error again if your cgroups is correctly set. In any case, you just said it: reopen or launch a new one!. Best regards Felip M |