Hi. We had previous issues with kill task failed draining servers when the UnkillableStepTimeout parameter was set at the default 60s. We increased this across both our dev/test and production clusters. We seem to now have a repeat in our 17.11.7 dev/test environment with UnkillableStepTimeout set at 180's. uk1sxlx00091 (The Lion): sinfo -Nl Thu Jul 26 09:39:32 2018 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON uk1salx00553 1 uk_columbus_tst drained 48 48:1:1 257526 2036 1 (null) Kill task failed uk1salx00553 1 uk_test_hpc* drained 48 48:1:1 257526 2036 1 (null) Kill task failed uk1sxlx00095 1 uk_test_hpc* idle 1 1:1:1 1800 2036 1 (null) none uk1sxlx00091 (The Lion): uk1sxlx00091 (The Lion): scontrol show config | grep -i UnkillableStepTimeout UnkillableStepTimeout = 180 sec uk1sxlx00091 (The Lion): uk1sxlx00091 (The Lion): scontrol show node=uk1salx00553 NodeName=uk1salx00553 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=48 CPULoad=2.62 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:8 NodeAddr=uk1salx00553 NodeHostName=uk1salx00553 Version=17.11 OS=Linux 3.10.0-862.3.3.el7.x86_64 #1 SMP Wed Jun 13 05:44:23 EDT 2018 RealMemory=257526 AllocMem=0 FreeMem=827 Sockets=48 Boards=1 MemSpecLimit=1024 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=2036 Weight=1 Owner=N/A MCS_label=N/A Partitions=uk_test_hpc,uk_columbus_tst BootTime=2018-07-21T14:00:25 SlurmdStartTime=2018-07-25T17:49:26 CfgTRES=cpu=48,mem=257526M,billing=48 AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Kill task failed [root@2018-07-25T21:10:04] I cannot see in the logs anything like the previous instances where it reported completion process taking in excess of 60s. What further information do you need? Thanks. Mark.
Hi Mark, Is it still the same exact situation described in 5262 and only happening on 1 node with python job within a slurm sbatch? Did you finally get any slurmd debug2 logs? In any case send me back the slurmd and ctld logs.
Hi. Sorry for the delay. This is a completely different situation from our previous occurrences. Different application, different slurm version. The server was however previously in an 17.02.7 environment where we experienced the kill task issue with python jobs and step timeout at 60's. The server has migrated to a 17.11.7 environment where we are testing another application. See attached logs. Thanks. Mark.
Created attachment 7492 [details] slurmd log
Created attachment 7493 [details] slurmctld log
Hi Mark, Please, tell me in what OS is it happening and your kernel version. Send me also your cgroup.conf. Thanks
# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.5 (Maipo) # uname -r 3.10.0-862.3.3.el7.x86_64
(In reply to GSK-EIS-SLURM from comment #6) > # cat /etc/redhat-release > Red Hat Enterprise Linux Server release 7.5 (Maipo) > > # uname -r > 3.10.0-862.3.3.el7.x86_64 Hi, please attach also cgroup.conf as requested in comment 5. There's possibly a known bug in your RHEL distribution.
Created attachment 7526 [details] cgroup.conf Attached cgroup.conf. Sorry, overlooked it.
Hi Mark, Please, set this to your cgroup.conf: ConstrainKmemSpace=no Your issue is probably related to a bug in the kernel on applications that are setting kmem limit. It seems kmem limit in a cgroup causes a number of slab caches to get created that don't go away when the cgroup is removed, thus filling up an internal 64K cache that ends showing the error you see in Slurm logs. Following patches that would fix the issue seem not to be commited to RHEL kernels because incompatibilities with OpenShift software: https://github.com/torvalds/linux/commit/73f576c04b9410ed19660f74f97521bee6e1c546 https://github.com/torvalds/linux/commit/24ee3cf89bef04e8bc23788aca4e029a3f0f06d9 Actual kernel patch: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d6e0b7fa11862433773d986b5f995ffdf47ce672 There's currently an open bug in RHEL: https://bugzilla.redhat.com/show_bug.cgi?id=1507149 Bug 5082 seems a possible duplicate of your issue, if you are curious about a more extensive explanation. The workaround is simply to set "ConstrainKmemSpace=No" in cgroup.conf, then you must reboot the affected nodes. This parameter has been set to "no" by default on 18.08. Commit 32fabc5e006b8f41. Tell me if after applying this change it fixes the issue for you. Regards, Felip
Hi Felip. We were also given advice on setting ConstrainKmemSpace=No in Bug 5497 and we did implement that which resolved our cgroup issues. Unfortunately that change got lost when cgroups.conf was overwritten as part of another change. I then sent you a cgroups.conf that didn't have the fix any more. So the timeline is: 25 July I log this bug, 5485 30 July We implement ConstrainKmemSpace=No as a test 07 Aug cgroups.conf get overwritten and loses ConstrainKmemSpace fix 07 Aug I send you the old cgroups.conf I have re-instated ConstrainKmemSpace=No and we'll continue to monitor. We did not see any further re-occurrence of the kill task failed issue while the ConstrainKmemSpace fix was in place between 30 July and 07 Aug. Please go ahead and close this bug. If the issue re-occurs we'll log another bug or re-open this one. Thanks. Mark.
Ok Mark, I expect not seeing this error again if your cgroups is correctly set. In any case, you just said it: reopen or launch a new one!. Best regards Felip M