Ticket 5485

Summary:	Kill task failed & UnkillableStepTimeout = 180s
Product:	Slurm	Reporter:	GSK-ONYX-SLURM <slurm-support>
Component:	slurmstepd	Assignee:	Felip Moll <felip.moll>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	17.11.7
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=5262
Site:	GSK	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	?
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmd log slurmctld log cgroup.conf

Description GSK-ONYX-SLURM 2018-07-26 02:51:54 MDT

Hi.
We had previous issues with kill task failed draining servers when the UnkillableStepTimeout parameter was set at the default 60s.  We increased this across both our dev/test and production clusters.

We seem to now have a repeat in our 17.11.7 dev/test environment with UnkillableStepTimeout set at 180's.

uk1sxlx00091 (The Lion): sinfo -Nl
Thu Jul 26 09:39:32 2018
NODELIST      NODES       PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
uk1salx00553      1 uk_columbus_tst     drained   48   48:1:1 257526     2036      1   (null) Kill task failed
uk1salx00553      1    uk_test_hpc*     drained   48   48:1:1 257526     2036      1   (null) Kill task failed
uk1sxlx00095      1    uk_test_hpc*        idle    1    1:1:1   1800     2036      1   (null) none
uk1sxlx00091 (The Lion):
uk1sxlx00091 (The Lion):  scontrol show config | grep -i UnkillableStepTimeout
UnkillableStepTimeout   = 180 sec
uk1sxlx00091 (The Lion):

uk1sxlx00091 (The Lion): scontrol show node=uk1salx00553
NodeName=uk1salx00553 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=48 CPULoad=2.62
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:8
   NodeAddr=uk1salx00553 NodeHostName=uk1salx00553 Version=17.11
   OS=Linux 3.10.0-862.3.3.el7.x86_64 #1 SMP Wed Jun 13 05:44:23 EDT 2018
   RealMemory=257526 AllocMem=0 FreeMem=827 Sockets=48 Boards=1
   MemSpecLimit=1024
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=2036 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=uk_test_hpc,uk_columbus_tst
   BootTime=2018-07-21T14:00:25 SlurmdStartTime=2018-07-25T17:49:26
   CfgTRES=cpu=48,mem=257526M,billing=48
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Kill task failed [root@2018-07-25T21:10:04]

I cannot see in the logs anything like the previous instances where it reported completion process taking in excess of 60s.

What further information do you need?

Thanks.
Mark.

Comment 1 Felip Moll 2018-07-27 04:45:10 MDT

Hi Mark,

Is it still the same exact situation described in 5262 and only happening on 1 node with python job within a slurm sbatch?

Did you finally get any slurmd debug2 logs? In any case send me back the slurmd and ctld logs.

Comment 2 GSK-ONYX-SLURM 2018-08-02 05:47:00 MDT

Hi.
Sorry for the delay.
This is a completely different situation from our previous occurrences.  Different application, different slurm version.  The server was however previously in an 17.02.7 environment where we experienced the kill task issue with python jobs and step timeout at 60's.
The server has migrated to a 17.11.7 environment where we are testing another application.

See attached logs.
Thanks.
Mark.

Comment 3 GSK-ONYX-SLURM 2018-08-02 05:48:02 MDT

Created attachment 7492 [details]
slurmd log

Comment 4 GSK-ONYX-SLURM 2018-08-02 05:49:53 MDT

Created attachment 7493 [details]
slurmctld log

Comment 5 Felip Moll 2018-08-02 08:03:02 MDT

Hi Mark,

Please, tell me in what OS is it happening and your kernel version.

Send me also your cgroup.conf.

Thanks

Comment 6 GSK-ONYX-SLURM 2018-08-06 14:25:41 MDT

# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.5 (Maipo)

# uname -r
3.10.0-862.3.3.el7.x86_64

Comment 7 Felip Moll 2018-08-07 06:38:51 MDT

(In reply to GSK-EIS-SLURM from comment #6)
> # cat /etc/redhat-release
> Red Hat Enterprise Linux Server release 7.5 (Maipo)
> 
> # uname -r
> 3.10.0-862.3.3.el7.x86_64

Hi, please attach also cgroup.conf as requested in comment 5.

There's possibly a known bug in your RHEL distribution.

Comment 8 GSK-ONYX-SLURM 2018-08-07 07:07:16 MDT

Created attachment 7526 [details]
cgroup.conf

Attached cgroup.conf.  Sorry, overlooked it.

Comment 9 Felip Moll 2018-08-07 08:42:58 MDT

Hi Mark,

Please, set this to your cgroup.conf:
ConstrainKmemSpace=no

Your issue is probably related to a bug in the kernel on applications that are setting kmem limit.

It seems kmem limit in a cgroup causes a number of slab caches to get created that don't go away when the cgroup is removed, thus filling up an internal 64K cache that ends showing the error you see in Slurm logs.

Following patches that would fix the issue seem not to be commited to RHEL kernels because incompatibilities with OpenShift software:

https://github.com/torvalds/linux/commit/73f576c04b9410ed19660f74f97521bee6e1c546
https://github.com/torvalds/linux/commit/24ee3cf89bef04e8bc23788aca4e029a3f0f06d9

Actual kernel patch:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d6e0b7fa11862433773d986b5f995ffdf47ce672

There's currently an open bug in RHEL:
https://bugzilla.redhat.com/show_bug.cgi?id=1507149

Bug 5082 seems a possible duplicate of your issue, if you are curious about a more extensive explanation.

The workaround is simply to set "ConstrainKmemSpace=No" in cgroup.conf, then you must reboot the affected nodes.

This parameter has been set to "no" by default on 18.08. Commit 32fabc5e006b8f41.



Tell me if after applying this change it fixes the issue for you.


Regards,
Felip

Comment 10 GSK-ONYX-SLURM 2018-08-08 05:47:31 MDT

Hi Felip.

We were also given advice on setting ConstrainKmemSpace=No in Bug 5497 and we did implement that which resolved our cgroup issues.  Unfortunately that change got lost when cgroups.conf was overwritten as part of another change.  I then sent you a cgroups.conf that didn't have the fix any more.

So the timeline is:

25 July    I log this bug, 5485
30 July    We implement ConstrainKmemSpace=No as a test
07 Aug     cgroups.conf get overwritten and loses ConstrainKmemSpace fix
07 Aug     I send you the old cgroups.conf

I have re-instated ConstrainKmemSpace=No and we'll continue to monitor.  We did not see any further re-occurrence of the kill task failed issue while the ConstrainKmemSpace fix was in place between 30 July and 07 Aug.

Please go ahead and close this bug.  If the issue re-occurs we'll log another bug or re-open this one.

Thanks.
Mark.

Comment 11 Felip Moll 2018-08-08 07:02:15 MDT

Ok Mark,

I expect not seeing this error again if your cgroups is correctly set.

In any case, you just said it: reopen or launch a new one!.

Best regards
Felip M