11182 – Race condition between node_features reboot and epilog

Ticket 11182 - Race condition between node_features reboot and epilog

Summary: Race condition between node_features reboot and epilog

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	20.11.5
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Skyler Malinowski
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-03-23 13:08 MDT by Luke Yeager
Modified:	2021-07-29 06:55 MDT (History)
CC List:	4 users (show)

See Also:	11175 9567
Site:	NVIDIA (PSLA)
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	21.08pre1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Luke Yeager 2021-03-23 13:08:46 MDT

(Originally reported at bug#11175)
(Details about our node_features plugin are at bug#9567)

The issue that I've observed is this:

1. Slurm recognizes that a reboot is required in order to fulfill the constraint of the job. It calls node_features_p_node_set() for our plugin before rebooting the node.
2. Slurm recognizes that the srun process has been killed and that the job should be cancelled. It kicks off the epilog (why?). Bizarrely, the prolog never runs, just the epilog.
3. There is a race condition between these two tasks.

Sometimes, the plugin finishes node_features_p_node_set() and then initiates the reboot before the epilog starts. In this case, the epilog fails with all sorts of crazy errors because the node is in the process of shutting down.

Sometimes, the epilog runs while the node_features_p_node_set() function is still running. In one case, the plugin uninstalls and reinstalls the nvidia driver. Depending on which task happens first, sometimes this causes the epilog to fail. Sometimes it causes the driver reinstall to fail (which is the worst of all, because the node reboots without a driver installed).

Here are the relevant snippets of the slurmctld and slurmd logs for one of these  failure modes:

> Mar 22 16:42:07 login-node slurmctld[11735]: JobId=98975 nhosts:1 ncpus:1 node_req:64000 nodes=compute-node-747
> Mar 22 16:42:07 login-node slurmctld[11735]: reboot_job_nodes: reboot nodes compute-node-747 features nvidia_driver=XXX
> Mar 22 16:42:07 login-node slurmctld[11735]: sched/backfill: _start_job: Started JobId=98975 in batch on compute-node-747
> Mar 22 16:42:07 compute-node-747 slurmd[3840]: Node reboot request with features nvidia_driver=XXX being processed
> Mar 22 16:42:07 compute-node-747 systemd[1]: Stopping LSB: Startup/shutdown script for GDRcopy kernel-mode driver...
> ...
> Mar 22 16:42:18 compute-node-747 slurm[8592]: Running /etc/slurm/epilog.d/50-exclusive-gpu ...
> Mar 22 16:42:18 compute-node-747 slurm[8623]: Draining node -- Failed to reset GPU power levels
> Mar 22 16:42:18 login-node slurmctld[11735]: update_node: node compute-node-747 reason set to: Failed to reset GPU power levels
> Mar 22 16:42:18 login-node slurmctld[11735]: update_node: node compute-node-747 state set to DRAINED$

Comment 1 Michael Hinton 2021-03-23 15:51:45 MDT

Luke,

Is v2 from bug 9567 current? If not, could you upload the most recent version of your node features plugin to bug 9567?

Also, from bug 9567 comment 0:
> d) We can't share our specific helper binary for NPS. I can either share my
> "fake" helper binaries that I used for development, or do a simple helper
> binary for a generic feature requiring a reboot (e.g. changing a Linux
> kernel boot parameter).
Could you share both of these to bug 9567? That should help me reproduce the race condition.

Thanks!
-Michael

Comment 2 Luke Yeager 2021-03-23 15:56:38 MDT

(In reply to Michael Hinton from comment #1)
> Luke,
> 
> Is v2 from bug 9567 current? If not, could you upload the most recent
> version of your node features plugin to bug 9567?
Yes, it is.

> Could you share both of these to bug 9567? That should help me reproduce the
> race condition.
You should be able to just take the example helper (e.g. use the one "amd_rome" - none of them actually make any real changes at the moment, they just edit a file and pretend like that did something useful) and add a 'sleep 60' somewhere to simulate some slow work.
https://gitlab.com/nvidia/dlsyseng/slurm/-/commit/e89d0897d94f633361a8850dbfe54deeb7080bd2

Comment 3 Michael Hinton 2021-03-23 17:34:48 MDT

Side note: In node_features_p_job_valid(), there is an unused `char *pos;` variable that makes the build fail when build warnings are treated as build errors.

Comment 4 Michael Hinton 2021-03-24 17:44:16 MDT

Hey Luke,

Can you provide more robust logs? E.g., I don't see evidence of an srun process being killed or a job being cancelled. Are there jobs already running on the node when the job gets scheduled? In my testing, a new job doesn't get killed or cancelled (that I can see) if it initiates a node reboot via a node feature, and the prolog only runs after then node reboots.

Could you provide you prolog and epilog scripts, or at least describe what they are doing and how long they last for? Can you provide your current slurm.conf? Can you provide the command used to start the job? Can you also say how long the reboot process takes, as well as how long node_features_p_node_set() takes?

(In reply to Luke Yeager from comment #0)
> > Mar 22 16:42:07 login-node slurmctld[11735]: JobId=98975 nhosts:1 ncpus:1 node_req:64000 nodes=compute-node-747
> > Mar 22 16:42:07 login-node slurmctld[11735]: reboot_job_nodes: reboot nodes compute-node-747 features nvidia_driver=XXX
> > Mar 22 16:42:07 login-node slurmctld[11735]: sched/backfill: _start_job: Started JobId=98975 in batch on compute-node-747
> > Mar 22 16:42:07 compute-node-747 slurmd[3840]: Node reboot request with features nvidia_driver=XXX being processed
It looks like a job was started by the backfill scheduler - is this the same job that initiated the reboot?

I've been playing around with NodeFeaturesHelpers and I think I got it working as expected, but I still haven't been able to reproduce the issue by adding in delays. So I need more information.

Thanks!
-Michael

Comment 5 Luke Yeager 2021-03-24 18:15:01 MDT

(In reply to Michael Hinton from comment #4)
> Can you provide more robust logs?
Sure!

Command to reproduce (make sure the --constraint forces a reboot):
> srun -C foo=bar --begin=now+1minutes hostname & sleep 1 ; kill -9 $!

slurmctld log:
1. Issue reboot
2. Realize the srun process is dead, issue cancel
3. Receive drain

> Mar 24 16:56:28 login-node slurmctld[45435]: JobId=104187 nhosts:1 ncpus:1 node_req:64000 nodes=compute-node-735
> Mar 24 16:56:28 login-node slurmctld[45435]: reboot_job_nodes: reboot nodes compute-node-735 features nvidia_driver=450.80
> Mar 24 16:56:28 login-node slurmctld[45435]: sched/backfill: _start_job: Started JobId=104187 in batch on compute-node-735
> Mar 24 16:56:39 login-node slurmctld[45435]: Killing interactive JobId=104187: Communication connection failure
> Mar 24 16:56:39 login-node slurmctld[45435]: _job_complete: JobId=104187 WEXITSTATUS 1
> Mar 24 16:56:39 login-node slurmctld[45435]: _job_complete: JobId=104187 done
> Mar 24 16:56:41 login-node slurmctld[45435]: update_node: node compute-node-735 reason set to: nvidia module not loaded
> Mar 24 16:56:41 login-node slurmctld[45435]: update_node: node compute-node-735 state set to DRAINED$
> Mar 24 16:56:42 login-node slurmctld[45435]: error: job_epilog_complete: JobId=104187 epilog error on compute-node-735, draining the node
> Mar 24 16:56:42 login-node slurmctld[45435]: error: _slurm_rpc_epilog_complete: epilog error JobId=104187 Node=compute-node-735 Err=Job epilog failed

slurmd log:
1. Receive reboot request, use node_features/helpers plugin to change the driver version (first stop services, then uninstall, then reinstall new version)
2. Receive cancel request, start running epilog
3. Epilog fails, send DRAIN message
4. Plugin finishes (it fails but doesn't matter) and then reboots the node

> Mar 24 16:56:28 compute-node-735 slurmd[3933616]: Node reboot request with features nvidia_driver=XXX being processed
> Mar 24 16:56:28 compute-node-735 slurmd[3933616]: node_features/helpers: run_command: executing "/etc/slurm/node_features_helpers/nvidia_driver_wrapper.sh nvidia_driver=XXX"
> Mar 24 16:56:28 compute-node-735 nvidia-persistenced[3055]: Received signal 15
> Mar 24 16:56:30 compute-node-735 systemd[1]: Stopping LSB: Activates/Deactivates nv_peer_mem to \ start at boot time....
> Mar 24 16:56:31 compute-node-735 systemd[1]: Stopping LSB: Startup/shutdown script for GDRcopy kernel-mode driver...
> Mar 24 16:56:36 compute-node-735 kernel: nvidia-uvm: Unloaded the UVM driver.
> Mar 24 16:56:39 compute-node-735 slurm-epilog[4049225]: START user=lyeager job=104187
> Mar 24 16:56:39 compute-node-735 slurm[4049268]: Running /etc/slurm/epilog.d/40-lastuserjob-processes ...
> Mar 24 16:56:39 compute-node-735 slurm[4049277]: Running /etc/slurm/epilog.d/50-exclusive-gpu ...
> Mar 24 16:56:39 compute-node-735 slurm[4049285]: Running /etc/slurm/epilog.d/60-exclusive-cpu ...
> Mar 24 16:56:39 compute-node-735 slurm[4049300]: Running /etc/slurm/epilog.d/80-exclusive-tmpfiles ...
> Mar 24 16:56:39 compute-node-735 slurm[4049321]: Running /etc/slurm/epilog.d/81-exclusive-scratchfiles ...
> Mar 24 16:56:39 compute-node-735 slurm[4049332]: Running /etc/slurm/epilog.d/90-all-metrics ...
> Mar 24 16:56:40 compute-node-735 slurm[4049577]: Running /etc/slurm/epilog.d/95-all-healthcheck ...
> Mar 24 16:56:40 compute-node-735 slurm[4049582]: Draining node -- nvidia module not loaded
> Mar 24 16:56:41 compute-node-735 slurm[4049584]: [ERROR] /etc/slurm/epilog.d/95-all-healthcheck failed. Check the log at /var/log/slurm/prolog-epilog for more details.
> Mar 24 16:56:41 compute-node-735 slurm[4049589]: One or more scripts failed.
> Mar 24 16:56:41 compute-node-735 slurmd[3933616]: error: /etc/slurm/epilog.sh: exited with status 0x0100
> Mar 24 16:56:41 compute-node-735 slurmd[3933616]: error: [job 104187] epilog failed status=1:0
> Mar 24 16:56:42 compute-node-735 slurmd[3933616]: error: node_features/helpers: run_command: command "/etc/slurm/node_features_helpers/nvidia_driver_wrapper.sh nvidia_driver=XXX" returned with exit status: 3840
> Mar 24 16:56:42 compute-node-735 slurmd[3933616]: error: node_features/helpers: feature_set_value: failed to set new value for feature: nvidia_driver
> Mar 24 16:56:42 compute-node-735 systemd[1]: Stopping SLURM compute node daemon...
> Mar 24 16:56:42 compute-node-735 systemd[1]: Stopping MUNGE authentication service...
> Mar 24 16:56:49 compute-node-735 systemd[1]: Reached target Shutdown.
> Mar 24 16:56:49 compute-node-735 systemd[1]: Reached target Final Step.
> Mar 24 16:56:49 compute-node-735 systemd[1]: systemd-reboot.service: Succeeded.
> Mar 24 16:56:49 compute-node-735 systemd[1]: Finished Reboot.
> Mar 24 16:56:49 compute-node-735 systemd[1]: Reached target Reboot.
> Mar 24 16:56:49 compute-node-735 systemd[1]: Shutting down.
> Mar 24 16:56:49 compute-node-735 systemd-shutdown[1]: Syncing filesystems and block devices.
> Mar 24 16:56:49 compute-node-735 systemd-shutdown[1]: Sending SIGTERM to remaining processes...
> Mar 24 16:56:49 compute-node-735 systemd-journald[1145]: Journal stopped
> -- Reboot --
> Mar 24 17:00:28 compute-node-735 kernel: Linux version XXX

Comment 6 Michael Hinton 2021-03-25 15:09:22 MDT

I'm able to reproduce the epilog-only behavior with just the following command:

srun --begin=now+1minutes hostname & sleep 1 ; kill -9 $!

Note the absence of constraints (I also commented out NodeFeaturesPlugins in slurm.conf beforehand). However, these workflows all work:

Ctrl-C in foreground:

$ srun --begin=now+1minutes hostname
srun: job 14454 queued and waiting for resources
^Csrun: Job allocation 14454 has been revoked
srun: Force Terminated job 14454

Send SIGTERM instead of SIGKILL:

$ srun --begin=now+1minutes hostname & sleep 1 ; kill $!
[1] 2510800
srun: job 14455 queued and waiting for resources
srun: Job allocation 14455 has been revoked
srun: Force Terminated job 14455

Use scancel:

$ srun --begin=now+1minutes hostname &
[1] 2511714
$ srun: job 14456 queued and waiting for resources
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
14456 debug hostname hintron PD 0:00 1 (BeginTime)
$ scancel 14456
srun: Job has been cancelled
srun: Force Terminated job 14456
srun: error: Unable to allocate resources: Job/step already completing or completed
[1]+ Exit 1 srun --begin=now+1minutes hostname

The real issue here is that you are sending SIGKILL to srun, so it can't cleanup properly. SIGKILL cannot be intercepted and handled. So it's no surprise that you get weird results, where the job still gets scheduled by backfill and runs, and only the epilog runs, not the prolog. The controller realized that srun was no longer around, breaking the communication connection, hence the "Killing interactive JobId=14460: Communication connection failure" error.

So just avoid sending SIGKILL to an srun and you should be good to go. This has nothing to do with NodeFeatures. But if you think there is something more to this that you want us to pursue, feel free to open a separate bug on this.

------------------------------

Regarding NodeFeatures, it's clear that your epilog is failing a health check because you unloaded the NVIDIA driver via NodeFeaturesHelpers *beforehand*. But I guess the issue is that the epilog shouldn't be running at all, right? In that case, avoiding SIGKILL as mentioned above should fix the issue.

But if you want to be bullet proof, then I would have your NodeFeaturesPlugin create some kind of file that your Prologs and Epilogs could access to indicate that a reboot is eminent and that nvidia modules were uninstalled, so skip driver health checks (or just skip the prologs/epilogs completely). Then, on reboot, have a startup script remove that file. In essence, you would be creating a lock. This would cover the case when a user accidentally SIGKILL'ed something.

Let me know if that works for you.

Thanks!
-Michael

Comment 7 Michael Hinton 2021-03-25 15:10:29 MDT

Also, hi Doug! Just realized you were CC'd. Hope things are going well :)

Comment 8 Felix Abecassis 2021-03-25 15:18:03 MDT

> Note the absence of constraints (I also commented out NodeFeaturesPlugins in slurm.conf beforehand).

If you remove the constraint, then it's https://bugs.schedmd.com/show_bug.cgi?id=11175. We created this new bug per Tim's request. Can we focus on the interaction with the node feature plugin here?

> So just avoid sending SIGKILL to an srun and you should be good to go. 

Please take a look at the original bug, sending SIGKILL is a minimal repro to show the problem. This issue also happened when our login node was rebooted.

> But if you want to be bullet proof, then I would have your NodeFeaturesPlugin create some kind of file that your Prologs and Epilogs could access to indicate that a reboot is eminent and that nvidia modules were uninstalled, so skip driver health checks (or just skip the prologs/epilogs completely). 

Shouldn't this locking be handled by Slurm? Running an epilog during a reboot can cause failures beyond the NVIDIA device driver. For instance, network filesystems like NFS or Lustre might be getting unmounted concurrently to the epilog running.

Comment 9 Douglas Wightman 2021-03-25 15:31:22 MDT

Hi Michael! Things are good, hope things are good with you too. Thanks for your help with this.

Comment 10 Michael Hinton 2021-03-25 15:34:32 MDT

(In reply to Felix Abecassis from comment #8)
> Shouldn't this locking be handled by Slurm? Running an epilog during a
> reboot can cause failures beyond the NVIDIA device driver. For instance,
> network filesystems like NFS or Lustre might be getting unmounted
> concurrently to the epilog running.
I'll need to research this some more to see what makes sense. But my initial reaction is that prologs and epilogs are all custom code - there is no guarantee what people will use them for (even if a majority of people use them for node health checks). If we automatically disallowed running them while a node was marked for reboot, it's possible it could break existing prologs or epilogs for people.

You still have a valid point, though - I'll talk internally to see what others think about the issue. I bet we could create an "opt-in" parameter in slurm.conf that does this reboot locking automatically. But at the very least, I have provided a workaround that should solve the issue for now, so just let me know if that works for you.

Thanks,
-Michael

(In reply to Douglas Wightman from comment #9)
> Hi Michael! Things are good, hope things are good with you too. Thanks for
> your help with this.
No problem!

Comment 11 Michael Hinton 2021-03-25 17:22:32 MDT

It sounds like you are doing all your node health checks in the prolog and epilog. Have you considered moving those checks into a HealthCheckProgram? If I remember correctly, I think the HealthCheckProgram isn't run when the node is marked for reboot, which might also solve the problem.

Comment 12 Felix Abecassis 2021-03-25 18:10:34 MDT

> It sounds like you are doing all your node health checks in the prolog and epilog. Have you considered moving those checks into a HealthCheckProgram? If I remember correctly, I think the HealthCheckProgram isn't run when the node is marked for reboot, which might also solve the problem.

We use the epilog for basic checks and node cleanup. We also have more extensive NHC scripts. But here it's not the NHC that is failing.
In one instance we had an epilog checking for leftover user tasks that failed because it couldn't "pgrep -u $SLURM_JOB_USER" since NIS was stopped. In another instance, an epilog cleaning up /raid/scratch/* failed because LVM was stopped.

> If we automatically disallowed running them while a node was marked for reboot, it's possible it could break existing prologs or epilogs for people.

I understand the need for compatibility, but regardless of what the epilog is doing, I believe the node might effectively reboot while an epilog is still running.

So even if the epilog is very careful regarding which filesystems are mounted and which kernel modules are loaded, any non-trivial epilog (i.e. taking a few seconds to execute) risks getting interrupted by the reboot, I don't think it will cause a node to drain, but that can certainly cause data corruption.

I would also argue that the epilog should not be too lenient in its checks, otherwise you are not going to be able to detect broken nodes anymore (e.g. /raid/scratch was unmounted, but outside of a reboot request).

Comment 13 Felix Abecassis 2021-03-25 18:18:00 MDT

I meant to say that it's not *only* the NHC that is failing. All bets are off when the reboot is initiated, so pretty much all our epilogs can possibly fail.

Comment 14 Michael Hinton 2021-04-09 15:26:56 MDT

Felix,

The feedback I got is that this sounds like a reasonable request, but that this not a change that will make it into 20.11. I'm going to go ahead and convert this ticket into an enhancement, since there is a workaround available.

To refresh, here is the workaround:
(In reply to Michael Hinton from comment #6)
> Have your
> NodeFeaturesPlugin create some kind of file that your Prologs and Epilogs
> could access to indicate that a reboot is eminent and that nvidia modules
> were uninstalled, so skip driver health checks (or just skip the
> prologs/epilogs completely). Then, on reboot, have a startup script remove
> that file. In essence, you would be creating a lock.

If you want to guarantee that this will make it into a future release of Slurm, then we can discuss sponsorship. You are also welcome to submit a code contribution for consideration.

Thanks!
-Michael

Comment 16 Jason Booth 2021-04-21 14:50:29 MDT

Hi Luke - after reviewing this bug we think there may be more to the issue that warrants further investigation. I am going to have Skyler from our support team work with you on this. Please give him some time to review Michael's notes. He will then let you know what he finds.

Comment 18 Skyler Malinowski 2021-04-27 12:10:59 MDT

Hello Luke, Felix, and Douglas,

I will be solely focusing on the node_features and reboot situation.

When a constrained job is scheduled to a node that has the inactive constraint, a reboot RPC is heard on the node; this causes the node_features plugin to be called, then the reboot program is executed. Moreover, if the node_feature plugin code is synchronous, then there should not be an issue with a reboot interrupting the node_feature. Conversely, if the node_feature plugin contains code that does not terminate by the end of the node_feature plugin dispatch, then it may be interrupted by the reboot that follows. This seems to be working as intended.

Side note: if the node_feature plugin does nontrivial work, you will probably see a warning `Warning: Note very large processing time from node_features_g_node_state`.

Killing the job with SIGKILL, as discussed in comment #6, does produce the epilog bug. This behavior may be better addressed in bug 11175.

Best,
Skyler

Comment 19 Felix Abecassis 2021-04-27 12:21:38 MDT

> Moreover, if the node_feature plugin code is synchronous, then there should not be an issue with a reboot interrupting the node_feature. Conversely, if the node_feature plugin contains code that does not terminate by the end of the node_feature plugin dispatch, then it may be interrupted by the reboot that follows. This seems to be working as intended.

Our node_feature code is fully synchronous.
But we don't have any indication this behavior is currently an issue. I confirm the reboot is always after the node_features plugin returns. But this is not what is being discussed here.

Comment 20 Skyler Malinowski 2021-05-03 07:35:17 MDT

Scratch Comment #18. I was following a faulty train of thought that totally missed the mark. I regret the comment.

I am able to reproduce the race condition with node features and I am still investigating the source of the issue. I wish I had more to report in light of my previous comment.

Comment 21 Felix Abecassis 2021-05-03 10:31:27 MDT

No worries, Skyler! I appreciate the transparency.

Comment 22 Skyler Malinowski 2021-05-20 13:31:27 MDT

Quick update:
After reviewing the ticket and internal discussion, we will be addressing the issue and targeting a general release in 21.08. I am currently working on the patch.

Thank your for your patience.

Comment 32 Skyler Malinowski 2021-07-29 06:55:34 MDT

Hello Luke et al.,

The patch has been merged into 21.08pre1. Please look forward to the stable release of 21.08.

Best,
Skyler