| Summary: | Nodes drained with "kill task failed" when --x11 is used | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Michael Gutteridge <mrg> |
| Component: | slurmd | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 18.08.1 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=6824 | ||
| Site: | FHCRC - Fred Hutchinson Cancer Research Center | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 18.08.4 19.05.0pre2 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurm.conf
slurmd log snippet at debug2 |
||
Some updates after a little more work: - when I switched process tracking to "pgid" I no longer get this behavior. It does seem to be related to the "linuxproc" tracker. - I'll upload a snippet of the slurmd log (debug2) for one of the hung-up jobs. I think this message is pretty interesting: [2018-10-11T09:21:32.705] [26.extern] debug2: 30654 (slurmstepd) is not a user command. Skipped sending signal 18 That is the process that's stuck and ultimately causes slurmd to give up on killing it. From appearances, this process is one half of the X11 keepalive. So- I've been wanting to move to cgroups anyway (though I prefer to do that as a strategy, not bugfix). I'm happy to work the problem some more though. Thanks Michael Created attachment 8008 [details]
slurmd log snippet at debug2
It looks like x11 is just incompatible with proctrack/linuxproc. linuxproc doesn't want to kill processes that don't belong to that user - and since the slurmstepd (extern step) that handles x11 forwarding belongs to root, it isn't getting killed. https://github.com/SchedMD/slurm/blob/slurm-18.08/src/plugins/proctrack/linuxproc/kill_tree.c#L278-L288 That is where this message gets printed: [26.extern] debug2: 30654 (slurmstepd) is not a user command. Skipped sending signal 9 Eventually it times out (after UnkillableStepTimeout seconds), and the extern step is SIGKILL'ed and the node drained. I can document this and also print add error messages that indicate that X11 won't work with proctrack/linuxproc. Good enough for me. I can (now) see how it could be problematic with linuxproc. I've already made the change here and with cgroup process tracking it's working well. Thanks for the info. Michael We decided that we'll probably fatal() on startup if the Contain flag is not used with proctrack/cgroup, since Contain only works with cgroups. I'll let you know when we've committed a patch. On the master branch, we've added a fatal() call if using PrologFlags=Contain and you don't have ProctrackType=proctrack/cgroup. Commit f293a76350. We've documented this new behavior in commit 065d6554a1, which is included in 18.08. I'm closing this ticket. Thanks for reporting it. |
Created attachment 7996 [details] slurm.conf I've just built Slurm 18.08.1 with X11 support enabled and am now having trouble with nodes ending up drained with the reason "Kill task failed". This only seems to happen when the job is run with `--x11` and happens whether or not an X client is started. The X session seems to be created just fine (I see the client), but when I shut down the client or when the job completes we see: - the job remains in state "CG" for an unusual amount of time gizmod6[~]: squeue JOBID JOBID USER ACCOUNT PARTITION QOS NAME ST TIME NODES CPUS MIN_ NODELIST(REASON) PRIORITY 13 13 uid (null) debug (null) hostname CG 0:01 1 1 1 gizmod2 4294901749 - on the daemon node there is a lingering slurmstepd process: uid 20521 1 0 15:41 ? 00:00:00 slurmstepd: [13.extern] The daemon's logs have this information: slurmd-gizmod2: debug: Checking credential with 320 bytes of sig data slurmd-gizmod2: _run_prolog: run job script took usec=7 slurmd-gizmod2: _run_prolog: prolog with lock for job 13 ran for 0 seconds slurmd-gizmod2: launch task 13.0 request from UID:34152 GID:34152 HOST:140.107.217.124 PORT:22723 slurmd-gizmod2: debug: Checking credential with 320 bytes of sig data slurmd-gizmod2: debug: Leaving stepd_get_x11_display slurmd-gizmod2: debug: task affinity : before lllp distribution cpu bind method is '(null type)' ((null)) slurmd-gizmod2: debug: binding tasks:1 to nodes:0 sockets:1:0 cores:1:0 threads:1 slurmd-gizmod2: lllp_distribution jobid [13] implicit auto binding: sockets,one_thread, dist 8192 slurmd-gizmod2: _task_layout_lllp_cyclic slurmd-gizmod2: _lllp_generate_cpu_bind jobid [13]: mask_cpu,one_thread, 0x001 slurmd-gizmod2: debug: task affinity : after lllp distribution cpu bind method is 'mask_cpu,one_thread' (0x001) slurmd-gizmod2: debug: Waiting for job 13's prolog to complete slurmd-gizmod2: debug: Finished wait for job 13's prolog to complete slurmd-gizmod2: debug: task_p_slurmd_reserve_resources: 13 slurmd-gizmod2: debug: _rpc_terminate_job, uid = 6281 slurmd-gizmod2: debug: task_p_slurmd_release_resources: affinity jobid 13 slurmd-gizmod2: debug: credential for job 13 revoked .... this is when the job (hostname) finished .... slurmd-gizmod2: debug: _rpc_terminate_job, uid = 6281 slurmd-gizmod2: debug: task_p_slurmd_release_resources: affinity jobid 13 .... the job is now in CG until .... slurmd-gizmod2: debug: Waiting for job 13's prolog to complete slurmd-gizmod2: debug: Finished wait for job 13's prolog to complete slurmd-gizmod2: debug: completed epilog for jobid 13 slurmd-gizmod2: debug: Job 13: sent epilog complete msg: rc = 0 I've seen a number of bugs indicating that X11UseLocalhost=yes is required, however I've been unable to make X11 work without setting it to "no". The display shows up fine, but perhaps it's breaking something else. I'll attach the slurm.conf I'm using on this cluster. Thanks for your help Michael