Ticket 5262

Summary:	Server drains after kill task failed - JOB NOT ENDING WITH SIGNALS
Product:	Slurm	Reporter:	GSK-ONYX-SLURM <slurm-support>
Component:	slurmstepd	Assignee:	Brian Christiansen <brian>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	csc-slurm-tickets, felip.moll
Version:	17.02.7
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=5068 https://bugs.schedmd.com/show_bug.cgi?id=5485
Site:	GSK	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	Unknown
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description GSK-ONYX-SLURM 2018-06-05 08:29:16 MDT

I am aware of bug entry 3941.  However there seems to be a pattern to what we see that may be different.

The user says they are trying to setup a simply python job with a slurm sbatch wrapper.  However the python job doesn't work so they cancel the job.

The server begins draining or becomes drained.  Unkillable step timeout is the default:

scontrol show config | grep kill
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec

The user says they submit the job but it just sits in the queue and does nothing.  They then cancel it.  We then see the kill task failed and the drain state.  And the log data always seems to show the problem preceded by a cancel.

From the compute node:
[2018-06-05T13:18:11.628] [209995] error: *** JOB 209995 ON uk1salx00552 CANCELLED AT 2018-06-05T13:18:11 ***
[2018-06-05T13:18:20.987] [209996] error: *** JOB 209996 ON uk1salx00552 CANCELLED AT 2018-06-05T13:18:20 ***
[2018-06-05T13:19:12.000] [209995] error: *** JOB 209995 STEPD TERMINATED ON uk1salx00552 AT 2018-06-05T13:19:11 DUE TO JOB NOT ENDING WITH SIGNALS ***
[2018-06-05T13:19:12.000] [209995] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status 15
[2018-06-05T13:19:12.004] [209995] done with job
[2018-06-05T13:19:21.000] [209996] error: *** JOB 209996 STEPD TERMINATED ON uk1salx00552 AT 2018-06-05T13:19:20 DUE TO JOB NOT ENDING WITH SIGNALS ***
[2018-06-05T13:19:21.000] [209996] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status 15
[2018-06-05T13:19:21.005] [209996] done with job
[2018-06-05T14:13:51.464] [209997] error: *** JOB 209997 ON uk1salx00552 CANCELLED AT 2018-06-05T14:13:51 DUE TO TIME LIMIT ***
[2018-06-05T14:14:52.000] [209997] error: *** JOB 209997 STEPD TERMINATED ON uk1salx00552 AT 2018-06-05T14:14:51 DUE TO JOB NOT ENDING WITH SIGNALS ***
[2018-06-05T14:14:52.000] [209997] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status 15
[2018-06-05T14:14:52.005] [209997] done with job

From the control node:
[2018-06-05T13:13:06.836] _slurm_rpc_submit_batch_job JobId=209995 usec=157544
[2018-06-05T13:13:07.374] email msg to nobody: SLURM Job_id=209995 Name=test_slurm_3.sh Began, Queued time 00:00:01
[2018-06-05T13:13:07.374] sched: Allocate JobID=209995 NodeList=uk1salx00552 #CPUs=1 Partition=uk_hpc
[2018-06-05T13:13:24.089] _slurm_rpc_submit_batch_job JobId=209996 usec=26950
[2018-06-05T13:13:24.484] email msg to nobody: SLURM Job_id=209996 Name=test_slurm_3.sh Began, Queued time 00:00:00
[2018-06-05T13:13:24.484] sched: Allocate JobID=209996 NodeList=uk1salx00552 #CPUs=1 Partition=uk_hpc
[2018-06-05T13:13:28.606] _slurm_rpc_submit_batch_job JobId=209997 usec=46585
[2018-06-05T13:13:29.536] email msg to nobody: SLURM Job_id=209997 Name=test_slurm_3.sh Began, Queued time 00:00:01
[2018-06-05T13:13:29.536] sched: Allocate JobID=209997 NodeList=uk1salx00552 #CPUs=1 Partition=uk_hpc
[2018-06-05T13:18:11.621] _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 209995 uid 48870
[2018-06-05T13:18:11.623] email msg to nobody: SLURM Job_id=209995 Name=test_slurm_3.sh Ended, Run time 00:05:04, CANCELLED, ExitCode 0
[2018-06-05T13:18:20.981] _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 209996 uid 48870
[2018-06-05T13:18:20.983] email msg to nobody: SLURM Job_id=209996 Name=test_slurm_3.sh Ended, Run time 00:04:56, CANCELLED, ExitCode 0
[2018-06-05T13:19:12.002] error: slurmd error running JobId=209995 on node(s)=uk1salx00552: Kill task failed
[2018-06-05T13:19:12.002] drain_nodes: node uk1salx00552 state set to DRAIN
[2018-06-05T13:19:12.671] cleanup_completing: job 209995 completion process took 61 seconds
[2018-06-05T13:19:15.021] Resending TERMINATE_JOB request JobId=209996 Nodelist=uk1salx00552
[2018-06-05T13:19:21.003] error: slurmd error running JobId=209996 on node(s)=uk1salx00552: Kill task failed
[2018-06-05T13:19:22.033] cleanup_completing: job 209996 completion process took 62 seconds

From the compute node:
[2018-06-05T14:31:40.506] [210036] error: *** JOB 210036 ON uk1salx00552 CANCELLED AT 2018-06-05T14:31:40 ***
[2018-06-05T14:32:41.000] [210036] error: *** JOB 210036 STEPD TERMINATED ON uk1salx00552 AT 2018-06-05T14:32:40 DUE TO JOB NOT ENDING WITH SIGNALS ***
[2018-06-05T14:32:41.000] [210036] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status 15
[2018-06-05T14:32:41.006] [210036] done with job

From the control node:
[2018-06-05T14:27:41.102] email msg to nobody: SLURM Job_id=210036 Name=test_slurm_3.sh Began, Queued time 00:00:03
[2018-06-05T14:27:41.102] sched: Allocate JobID=210036 NodeList=uk1salx00552 #CPUs=1 Partition=uk_hpc
[2018-06-05T14:31:40.498] _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 210036 uid 48870
[2018-06-05T14:31:40.500] email msg to nobody: SLURM Job_id=210036 Name=test_slurm_3.sh Ended, Run time 00:03:59, CANCELLED, ExitCode 0
[2018-06-05T14:32:41.006] error: slurmd error running JobId=210036 on node(s)=uk1salx00552: Kill task failed
[2018-06-05T14:32:41.006] drain_nodes: node uk1salx00552 state set to DRAIN
[2018-06-05T14:32:41.555] cleanup_completing: job 210036 completion process took 61 seconds

Thanks.
Mark.

Comment 1 Brian Christiansen 2018-06-05 11:31:16 MDT

Hey Mark,

Are you able to isolate this to a single node -- just to make it easier to diagnose? In either case, when you see the situation again can you go to the node and see if there are any stray user processes still running?

If you can isolate the issue, the proctack/cgroup plugin (which I believe you are using) also prints out messages at debug2 of what processes the stepd is repeatedly trying to kill.

e.g.
debug2("killing process %d (%s) with signal %d", pids[i],
       (slurm_task==1)?"slurm_task":"inherited_task",

this will be helpful to see if the pid's that the stepd is trying to kill are still around.

Let me know if you have any questions.

Thanks,
Brian

Comment 2 GSK-ONYX-SLURM 2018-06-06 17:21:12 MDT

Hi Brian.

Yes this does seem to be affecting a single compute server at present.  I've enabled debug2 for SlurmdDebug just for that server and done a reconfigure.  I'm starting to see debug2 messages in the slurmd log file.  So we'll wait and see if we get a hit tomorrow (today) which I'm sure we will.

Something else odd I noticed.  I set

SlurmdDebug=debug2

in slurm.conf and then did

scontrol reconfigure

But when I do

scontrol show config | grep SlurmdDebug

it still shows...

SlurmdDebug             = info

Weird.

Thanks.
Mark.

Comment 3 Marshall Garey 2018-06-12 11:15:02 MDT

(In reply to GSK-EIS-SLURM from comment #2)
> Yes this does seem to be affecting a single compute server at present.  I've
> enabled debug2 for SlurmdDebug just for that server and done a reconfigure. 
> I'm starting to see debug2 messages in the slurmd log file.  So we'll wait
> and see if we get a hit tomorrow (today) which I'm sure we will.
> 
> Something else odd I noticed.  I set
> 
> SlurmdDebug=debug2
> 
> in slurm.conf and then did
> 
> scontrol reconfigure
> 
> But when I do
> 
> scontrol show config | grep SlurmdDebug
> 
> it still shows...
> 
> SlurmdDebug             = info

My guess - since you only changed slurm.conf on your compute node, this is probably showing your slurm.conf on your slurmctld. You might also see messages indicating your configuration files don't match.



Anyway, let us know if you see an unkillable task again and see what the processes are that are unkillable, per comment 1.

Thanks.

- Marshall

Comment 4 GSK-ONYX-SLURM 2018-06-13 04:54:44 MDT

Ok, understood.  Still monitoring the situation.  Since we set debug2 there's been no repeat.  I'll have to see what we can do to reproduce.

Regarding the reference to bug 5068 you sent me:

scontrol show config | grep -i MemLimitEnforce
MemLimitEnforce         = Yes

scontrol show config | grep -i JobAcctGatherParams
JobAcctGatherParams     = NoOverMemoryKill

grep ConstrainRAMSpace /etc/slurm/cgroup.conf
ConstrainRAMSpace=yes

So that seems to be saying we may have a conflict.  I will look into that.

Looking further into bug 5068 I see we have

grep -i TaskPlugin /etc/slurm/slurm.conf
TaskPlugin=task/cgroup,task/affinity

grep ConstrainCores /etc/slurm/cgroup.conf
ConstrainCores=no        # if taskaffinity=no in cgroup.conf do we need to set constraincores=no explicitly?

grep TaskAffinity /etc/slurm/cgroup.conf
TaskAffinity=no

Whats the likely impact of not having constraincores as yes ?

And I will also look more at increasing UnkillableStepTimeout from our default of 60s, although I would like to get the debug info with the setting at 60s first if I can.

Comment 5 Marshall Garey 2018-06-13 14:15:46 MDT

(In reply to GSK-EIS-SLURM from comment #4)
> Ok, understood.  Still monitoring the situation.  Since we set debug2
> there's been no repeat.  I'll have to see what we can do to reproduce.
> 
> Regarding the reference to bug 5068 you sent me:
> 
> scontrol show config | grep -i MemLimitEnforce
> MemLimitEnforce         = Yes
> 
> scontrol show config | grep -i JobAcctGatherParams
> JobAcctGatherParams     = NoOverMemoryKill
> 
> grep ConstrainRAMSpace /etc/slurm/cgroup.conf
> ConstrainRAMSpace=yes
> 
> So that seems to be saying we may have a conflict.  I will look into that.

If you want cgroup memory enforcement, then yes, it's best to disable memory enforcement by the jobacctgather plugin and let only the cgroup plugin take care of it:

MemLimitEnforce=No
JobAcctGatherParams=NoOverMemoryKill


> Looking further into bug 5068 I see we have
> 
> grep -i TaskPlugin /etc/slurm/slurm.conf
> TaskPlugin=task/cgroup,task/affinity

This is what we recommend.

> grep ConstrainCores /etc/slurm/cgroup.conf
> ConstrainCores=no        # if taskaffinity=no in cgroup.conf do we need to
> set constraincores=no explicitly?

No, since ConstrainCores=no is the default (though it doesn't harm anything by setting it explicitly).
 
> grep TaskAffinity /etc/slurm/cgroup.conf
> TaskAffinity=no

This is what we recommend when using the task/affinity plugin (which you are).

> Whats the likely impact of not having constraincores as yes ?

It just means that tasks won't be constrained to specific cores with the cgroup cpuset subsystem. They will still be bound to specific CPUs by the task/affinity plugin using sched_setaffinity(), but that doesn't stop a program from using sched_setaffinity() itself to bind itself to CPUs it shouldn't have access to. Using ConstrainCores=yes enforces the cpu binding - tasks can't use sched_setaffinity() to be able to access CPUs that aren't part of the job allocation.

So we recommend using ConstrainCores=yes.

> And I will also look more at increasing UnkillableStepTimeout from our
> default of 60s, although I would like to get the debug info with the setting
> at 60s first if I can.

Sounds good.

Comment 6 Brian Christiansen 2018-07-06 15:47:49 MDT

Hey Mark,

Are you still seeing draining issue? Were you able to reproduce it?

Thanks,
Brian

Comment 7 GSK-ONYX-SLURM 2018-07-09 08:40:43 MDT

Hi.
Since enabling debug we haven't seen a repeat of the issue.  But we got hit by other issues which may have masked this.

We have implemented UnkillableStepTimeout at 180s in our test / dev clusters and will be implementing this in our production clusters tomorrow, 10 July.

Once we've implemented this successfully in production I will confirm that we can close the call.  If the problem re-occurs with the timeout at 180s then I'll log another ticket.

I'll confirm in the next two days that this bug can be closed.

Thanks.
Mark.

Comment 8 Brian Christiansen 2018-07-09 22:44:31 MDT

Thanks Mark. Keep us posted.

Comment 9 GSK-ONYX-SLURM 2018-07-19 01:52:52 MDT

Hi.

Please go ahead and close this bug.  Since increasing the timeout to 180s we've not seen any repeats.  If we do I'll reopen this bug or log a new one.

Thanks for your help.

Cheers.
Mark.

Comment 10 Brian Christiansen 2018-07-19 09:30:45 MDT

Sounds good. Thanks.