Ticket 5836

Summary:	Nodes drained with "kill task failed" when --x11 is used
Product:	Slurm	Reporter:	Michael Gutteridge <mrg>
Component:	slurmd	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	18.08.1
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=6824
Site:	FHCRC - Fred Hutchinson Cancer Research Center	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	18.08.4 19.05.0pre2
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf slurmd log snippet at debug2

Description Michael Gutteridge 2018-10-10 16:52:57 MDT

Created attachment 7996 [details]
slurm.conf

I've just built Slurm 18.08.1 with X11 support enabled and am now having trouble with nodes ending up drained with the reason "Kill task failed".  This only seems to happen when the job is run with `--x11` and happens whether or not an X client is started.  

The X session seems to be created just fine (I see the client), but when I shut down the client or when the job completes we see:

- the job remains in state "CG" for an unusual amount of time

gizmod6[~]: squeue
          JOBID           JOBID      USER  ACCOUNT PARTITION QOS      NAME               ST       TIME  NODES CPUS MIN_ NODELIST(REASON) PRIORITY
             13              13       uid   (null)     debug (null)   hostname           CG       0:01      1 1    1    gizmod2 4294901749

- on the daemon node there is a lingering slurmstepd process:

   uid      20521     1  0 15:41 ?        00:00:00 slurmstepd: [13.extern]

The daemon's logs have this information:

slurmd-gizmod2: debug:  Checking credential with 320 bytes of sig data
slurmd-gizmod2: _run_prolog: run job script took usec=7
slurmd-gizmod2: _run_prolog: prolog with lock for job 13 ran for 0 seconds
slurmd-gizmod2: launch task 13.0 request from UID:34152 GID:34152 HOST:140.107.217.124 PORT:22723
slurmd-gizmod2: debug:  Checking credential with 320 bytes of sig data
slurmd-gizmod2: debug:  Leaving stepd_get_x11_display
slurmd-gizmod2: debug:  task affinity : before lllp distribution cpu bind method is '(null type)' ((null))
slurmd-gizmod2: debug:  binding tasks:1 to nodes:0 sockets:1:0 cores:1:0 threads:1
slurmd-gizmod2: lllp_distribution jobid [13] implicit auto binding: sockets,one_thread, dist 8192
slurmd-gizmod2: _task_layout_lllp_cyclic 
slurmd-gizmod2: _lllp_generate_cpu_bind jobid [13]: mask_cpu,one_thread, 0x001
slurmd-gizmod2: debug:  task affinity : after lllp distribution cpu bind method is 'mask_cpu,one_thread' (0x001)
slurmd-gizmod2: debug:  Waiting for job 13's prolog to complete
slurmd-gizmod2: debug:  Finished wait for job 13's prolog to complete
slurmd-gizmod2: debug:  task_p_slurmd_reserve_resources: 13
slurmd-gizmod2: debug:  _rpc_terminate_job, uid = 6281
slurmd-gizmod2: debug:  task_p_slurmd_release_resources: affinity jobid 13
slurmd-gizmod2: debug:  credential for job 13 revoked

   .... this is when the job (hostname) finished ....

slurmd-gizmod2: debug:  _rpc_terminate_job, uid = 6281
slurmd-gizmod2: debug:  task_p_slurmd_release_resources: affinity jobid 13

   .... the job is now in CG until ....

slurmd-gizmod2: debug:  Waiting for job 13's prolog to complete
slurmd-gizmod2: debug:  Finished wait for job 13's prolog to complete
slurmd-gizmod2: debug:  completed epilog for jobid 13
slurmd-gizmod2: debug:  Job 13: sent epilog complete msg: rc = 0

I've seen a number of bugs indicating that X11UseLocalhost=yes is required, however I've been unable to make X11 work without setting it to "no".  The display shows up fine, but perhaps it's breaking something else.

I'll attach the slurm.conf I'm using on this cluster.  Thanks for your help

Michael

Comment 1 Michael Gutteridge 2018-10-11 10:51:26 MDT

Some updates after a little more work:

- when I switched process tracking to "pgid" I no longer get this behavior.  It does seem to be related to the "linuxproc" tracker.

- I'll upload a snippet of the slurmd log (debug2) for one of the hung-up jobs.  I think this message is pretty interesting:

[2018-10-11T09:21:32.705] [26.extern] debug2: 30654 (slurmstepd) is not a user command.  Skipped sending signal 18

That is the process that's stuck and ultimately causes slurmd to give up on killing it.  From appearances, this process is one half of the X11 keepalive.

So- I've been wanting to move to cgroups anyway (though I prefer to do that as a strategy, not bugfix).  I'm happy to work the problem some more though.

Thanks

Michael

Comment 2 Michael Gutteridge 2018-10-11 10:51:54 MDT

Created attachment 8008 [details]
slurmd log snippet at debug2

Comment 4 Marshall Garey 2018-10-15 11:36:05 MDT

It looks like x11 is just incompatible with proctrack/linuxproc. linuxproc doesn't want to kill processes that don't belong to that user - and since the slurmstepd (extern step) that handles x11 forwarding belongs to root, it isn't getting killed.

https://github.com/SchedMD/slurm/blob/slurm-18.08/src/plugins/proctrack/linuxproc/kill_tree.c#L278-L288

That is where this message gets printed:

[26.extern] debug2: 30654 (slurmstepd) is not a user command.  Skipped sending signal 9

Eventually it times out (after UnkillableStepTimeout seconds), and the extern step is SIGKILL'ed and the node drained.


I can document this and also print add error messages that indicate that X11 won't work with proctrack/linuxproc.

Comment 5 Michael Gutteridge 2018-10-15 11:41:00 MDT

Good enough for me.  I can (now) see how it could be problematic with linuxproc.

I've already made the change here and with cgroup process tracking it's working well.

Thanks for the info.

Michael

Comment 13 Marshall Garey 2018-11-05 09:39:37 MST

We decided that we'll probably fatal() on startup if the Contain flag is not used with proctrack/cgroup, since Contain only works with cgroups.

I'll let you know when we've committed a patch.

Comment 16 Marshall Garey 2018-12-05 16:56:19 MST

On the master branch, we've added a fatal() call if using PrologFlags=Contain and you don't have ProctrackType=proctrack/cgroup. Commit f293a76350. We've documented this new behavior in commit 065d6554a1, which is included in 18.08.

I'm closing this ticket. Thanks for reporting it.