11175 – Allocating srun job should not be scheduled if client is disconnected

Ticket 11175 - Allocating srun job should not be scheduled if client is disconnected

Summary: Allocating srun job should not be scheduled if client is disconnected

Status:	OPEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	20.11.5
Hardware:	Linux Linux

Severity:	5 - Enhancement
Assignee:	Tim McMullan
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-03-22 17:07 MDT by Felix Abecassis
Modified:	2021-08-25 15:46 MDT (History)
CC List:	2 users (show)

See Also:	11182
Site:	NVIDIA (PSLA)
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Felix Abecassis 2021-03-22 17:07:23 MDT

Tested with 20.02.6, 20.11.5 and a recent master (db2446ac54cf3e83f00d6de8eb27c8af32ccdf4e).

Let's assume we have a cluster with 100 nodes. A user submits an interactive srun job (from the login node, but without --pty) requesting 90 nodes. Given that the cluster is busy, this job will be eligible to run in a few days.

In the meantime, the user's srun process on the login node is terminated, for example because of a network issue, a reboot of the login node, kill -9 $(pgrep -U ${USER} srun), etc.
The job will remain pending, even if there is no process to receive the output of the job, and srun -o job.log doesn't change the situation ("scontrol show job" has no stdout listed for this job). 
This job is still eligible for execution from the scheduler point of view, the nodes will be allocated, the prolog will run, and then the job will promptly fail, e.g. after 11s:
 $ scontrol show job XYZ     
   JobId=XYZ JobName=bash
   UserId=fabecassis(11838) GroupId=dip(30) MCS_label=N/A
   Priority=2028216 Nice=0 Account=admin QOS=all
   JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=1:0
   RunTime=00:00:11 TimeLimit=01:00:00 TimeMin=N/A

This is obviously better than running on all nodes for the full duration and sending the output to /dev/null. But this is still a waste of compute resources, particularly for large jobs, since the scheduler needs to accumulate nodes when preparing for this large job (see https://cug.org/proceedings/cug2019_proceedings/includes/files/pres115s1.pdf). When a large number of jobs are queued on the cluster, it is also easy to forget that you had a large job submitted interactively one week ago.

Is there a way to prevent this behavior from happening? From slurmctld, I see the following log when the job gets cancelled after the prologs executed:
slurmctld: Killing interactive JobId=XYZ: Communication connection failure
Could this check be made regularly instead of doing it after the job started?

To reproduce the problem, you can do something like this:
$ srun -N1 -t 60 -o test.log --begin=now+2minutes bash -c "echo START ; srun bash -c 'echo foo > ~/bar' ; sleep 1200s" & sleep 15s ; kill -9 $!
None of the job commands will be executed, but the job will still start.


It gets even nastier when combining this bug with a node_features plugin and a constraint requiring a reboot. Simultaneously, the node starts rebooting and the job is cancelled. I suppose Slurm lost track of the reboot request since the node is immediately drained with the following:
$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Epilog error         slurm     2021-03-22T13:28:39 node-XYZ
In one case, an unrelated batch job even "started" on this particular node after the cancellation of the interactive srun, accumulated 5 minutes of execution time before failing with JobState=NODE_FAIL.

Comment 2 Luke Yeager 2021-03-23 11:58:12 MDT

(In reply to Felix Abecassis from comment #0)
> It gets even nastier when combining this bug with a node_features plugin and
> a constraint requiring a reboot. Simultaneously, the node starts rebooting
> and the job is cancelled. I suppose Slurm lost track of the reboot request
> since the node is immediately drained with the following:
The combination of this bug with our node_features plugin[s] is indeed nasty, but it's not precisely because Slurm loses track of the reboot request.

The issue that I've observed is this:

1. Slurm recognizes that a reboot is required in order to fulfill the constraint of the job. It calls node_features_p_node_set() for our plugin before rebooting the node.
2. Slurm recognizes that the srun process has been killed and that the job should be cancelled. It kicks off the epilog (why?). Bizarrely, the prolog never runs, just the epilog.
3. There is a race condition between these two tasks.

Sometimes, the plugin finishes node_features_p_node_set() and then initiates the reboot before the epilog starts. In this case, the epilog fails with all sorts of crazy errors because the node is in the process of shutting down.

Sometimes, the epilog runs while the node_features_p_node_set() function is still running. In one case, the plugin uninstalls and reinstalls the nvidia driver. Depending on which task happens first, sometimes this causes the epilog to fail. Sometimes it causes the driver reinstall to fail (which is the worst of all, because the node reboots without a driver installed).

Here are the relevant snippets of the slurmctld and slurmd logs for one of these  failure modes:

> Mar 22 16:42:07 login-node slurmctld[11735]: JobId=98975 nhosts:1 ncpus:1 node_req:64000 nodes=compute-node-747
> Mar 22 16:42:07 login-node slurmctld[11735]: reboot_job_nodes: reboot nodes compute-node-747 features nvidia_driver=XXX
> Mar 22 16:42:07 login-node slurmctld[11735]: sched/backfill: _start_job: Started JobId=98975 in batch on compute-node-747
> Mar 22 16:42:07 compute-node-747 slurmd[3840]: Node reboot request with features nvidia_driver=XXX being processed
> Mar 22 16:42:07 compute-node-747 systemd[1]: Stopping LSB: Startup/shutdown script for GDRcopy kernel-mode driver...
> ...
> Mar 22 16:42:18 compute-node-747 slurm[8592]: Running /etc/slurm/epilog.d/50-exclusive-gpu ...
> Mar 22 16:42:18 compute-node-747 slurm[8623]: Draining node -- Failed to reset GPU power levels
> Mar 22 16:42:18 login-node slurmctld[11735]: update_node: node compute-node-747 reason set to: Failed to reset GPU power levels
> Mar 22 16:42:18 login-node slurmctld[11735]: update_node: node compute-node-747 state set to DRAINED$

So, there are two bug reports here:

1. Killing the srun should delete the job from the queue to avoid scheduling issues. This can happen at any site and is not related to node_features plugins. It's bad from a cluster occupancy point of view, but at least it cleans up nicely.

2. There is a race condition between rebooting a node for a node_features plugin and running the epilog for a cancelled job. This has horrendous implications for our site because it can leave the node in a broken state. But it only applies for sites which use node_features plugins, as far as I can see.

Comment 3 Tim McMullan 2021-03-23 12:47:13 MDT

Looking through both of your comments, I agree with Luke's statement that there are really 2 issues here.

Would you mind splitting the node_features plugin + epilog related issue into a separate ticket just to keep the histories clean?  I'll look into what is happening with srun in this ticket.

I was able to reproduce the behavior with srun, I'm currently looking into exactly what is happening here and why.

Thanks!
--Tim

Comment 4 Luke Yeager 2021-03-23 13:09:13 MDT

(In reply to Tim McMullan from comment #3)
> Would you mind splitting the node_features plugin + epilog related issue
> into a separate ticket just to keep the histories clean?  I'll look into
> what is happening with srun in this ticket.
Sure, no problem. Bug#11182.

Comment 6 Felix Abecassis 2021-03-25 18:21:34 MDT

Answering here on a comment raised in https://bugs.schedmd.com/show_bug.cgi?id=11182

> So just avoid sending SIGKILL to an srun and you should be good to go.

So, SIGKILL was an example, it can happen in other situations. And even if it's indeed a SIGKILL, we can't prevent our users from sending SIGKILL to their own processes. So it would be great if we the job could be cancelled earlier than it is today.

Comment 8 Tim McMullan 2021-04-01 06:50:25 MDT

I've been able to reproduce this and I've been looking around in the code surrounding it.

There is something that is related to this idea that is controlled by the InactiveLimit" in slurm.conf, however it is only meant for running jobs and something totally new would likely need to be made to do this for scheduled jobs.  I don't think setting the InactiveLimit would help in your case either, since its discovering that the srun died due to failed I/O basically immediately.

I think at this time checking a scheduled but not allocated srun would likely be an enhancement, but I am going to discuss this more internally before I can say that for sure.

Thanks!
--Tim

Comment 9 Felix Abecassis 2021-04-01 09:50:39 MDT

Thanks Tim!

Would it be easier to just patch the Stdout of the job?
e.g. setting Stdout/Stderr to the default of "slurm-%j.out", instead of cancelling the job?

Comment 10 Tim McMullan 2021-04-07 05:50:26 MDT

That might be possible to do, but it really sounds to me like sbatch might just be the right solution.  Using srun directly certainly works, but a job that could take days to schedule and could be output to a file feels like it should just be a batch job.  sbatch has a "--wrap" option which can make life a little easier when just running a single command as a batch script too.

Is there something that makes sbatch not work for this?

Thanks!
--Tim

Comment 11 Felix Abecassis 2021-04-07 10:04:02 MDT

> That might be possible to do, but it really sounds to me like sbatch might just be the right solution.  Using srun directly certainly works, but a job that could take days to schedule and could be output to a file feels like it should just be a batch job. 

I agree, and we do recommend that our users use sbatch for most of their workloads. But there are some use cases where srun is useful, and we can't prevent users from using srun for a large job (can we?).

What I'm suggesting is effectively an automatic "promotion" from an interactive (non-pty) srun to a batch job, only for those jobs that we know are going to fail immediately, since the receiving process is gone. This will avoid wasting resources by having the job run to completion. 

I figured that patching the job when it starts might be simpler than figuring out in advance when the job can be safely cancelled, as initially requested.

Comment 12 Felix Abecassis 2021-04-07 10:06:19 MDT

Note: I would still prefer to have the job cancelled early instead of being promoted to a batch job when it is being started.

Comment 13 Tim McMullan 2021-04-09 08:13:51 MDT

I can see the utility in something like this, but right now its operating as intended, so I think it would be an enhancement to change the behavior.

Preventing the user from submitting a large srun outside of an allocation is an interesting idea, and it may be possible using a cli filter plugin.  With cli_filter/lua you can detect that it was srun with 'args.type == "srun"', then check if you are in an allocation by checking if "SLURM_JOB_ID" exists in the users environment.  If it doesn't, you could reject an srun if its requesting 90 nodes before it ever gets submitted... but allow an sbatch to get scheduled normally.

Would something like that work?

Thanks!
--Tim

Comment 14 Felix Abecassis 2021-04-09 09:47:47 MDT

> I can see the utility in something like this, but right now its operating as intended, so I think it would be an enhancement to change the behavior.

Could you clarify which part is working as intended?
For now, let's disregard the suggestion to inject a Stdout path for these jobs, do we agree that the behavior reported initially in this thread is a bug?


> [...]
> Would something like that work?

It could work fine as a stopgap measure, but we already have a good amount of logic in our job_submit Lua script; hence we would prefer to have a fix in Slurm instead.

Comment 16 Tim McMullan 2021-05-14 11:52:37 MDT

I'm sorry about the delay on this, but I did some chatting internally about what we wanted to do on this.

We are going to consider this an enhancement for now, particularly since any change we might make here would have to wait for a new stable release.

I am taking a look at what exactly would be required to alter the behavior here though!

Thanks,
--Tim

Comment 17 Felix Abecassis 2021-05-14 12:45:47 MDT

Thanks for the update, Tim!

Comment 18 Felix Abecassis 2021-08-25 15:46:55 MDT

Following our meeting today, we would like to prevent slurmd from running the prolog and if possible also prevent slurmd from rebooting the node (for a node feature change).

We understand that we don't want slurmctld to continually monitor the state of the client processes.