7499 – slurmctld crashing on restart on Cori on both primary and backup controllers

Ticket 7499 - slurmctld crashing on restart on Cori on both primary and backup controllers

Summary: slurmctld crashing on restart on Cori on both primary and backup controllers

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	19.05.1
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Marshall Garey
QA Contact:

URL:

Duplicates (3):	7519 7522 8135 (view as ticket list)
Depends on:
Blocks:

Reported:	2019-07-31 15:43 MDT by Chris Samuel (NERSC)
Modified:	2020-03-05 10:46 MST (History)
CC List:	4 users (show)

See Also:	7501 7537 7519 7671 7757 7882 8011 8135 8615
Site:	NERSC
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	19.05.3 20.02.0pre1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Chris Samuel (NERSC) 2019-07-31 15:43:38 MDT

Hi there,

We had a separate issue with backfilling on Cori where it constantly complained about too many RPCs and was yielding locks.

I restarted slurmctld at one point and then it crashed, the last thing it logged was:

[2019-07-31T14:19:44.234] Killing JobId=23367113 StepId=Extern on failed node nid00956

Restarting it again had it crash again, once more after:

[2019-07-31T14:22:46.246] Killing JobId=23367113 StepId=Extern on failed node nid00956

I'll attach a core file now (the second overwrote the first).

Thanks,
Chris

Comment 1 Jason Booth 2019-07-31 15:48:18 MDT

Hi Chris - 

Would you send in a backtrace for us to review?

thread apply all bt full

Comment 6 Marshall Garey 2019-07-31 16:16:19 MDT

Thanks, we're looking at the backtrace. Can you also upload a slurmctld log file?

Comment 22 Marshall Garey 2019-08-06 15:16:16 MDT

Yeah, I don't care about those. I care about when it started running to the last mention of it in the logs.

Comment 33 Marshall Garey 2019-08-08 17:20:17 MDT

(In reply to Nate Rini from comment #32)
> (In reply to Chris Samuel (NERSC) from comment #30)
> > Re resizing - the usual reason are people who select a heap of nodes, check
> > for available huge pages on the node and discard more than their required
> > threshold, shrinking their job down to just the ones with enough.
> 
> Do they use `--exclude=` in their srun calls or do they call scontrol to
> resize the jobs explicitly?
> 
> (In reply to Chris Samuel (NERSC) from comment #31)
> > But "Hi Nate" too...
> :D

To extend this question, do you have a specific example of (1) the full job submission script/command line, (2) the exact command used to update the job? In particular, the job submission for job 23367113 would be helpful.

Comment 36 Marshall Garey 2019-08-12 15:49:00 MDT

Hey Chris, we're still looking at the problem and have made some progress. I don't have a complete answer yet, though, so I'll hold off until I do. However, a more complete slurmctld log file (comment 21 and comment 22) and the answers to questions in comment 33 would be helpful. If you don't know those answers, then just the log file would be nice. And actually, I partially redact my statement in comment 22 about only caring about when the job started running - I actually would like to see the job submission, then when the job started running. I don't care about all the lines from the scheduler saying that the job is still pending, though I do care about any other types of log statements. Perhaps a grep of the log with that job id will be good enough for now.

Comment 46 Marshall Garey 2019-08-30 13:51:13 MDT

I've been working on a set of patches to fix some issues I found with job resizing and --no-kill. Basically, if you remove the batch host from the job, Slurm doesn't ever kill the job. Slurm tries to kill the batch step, but it doesn't get killed all the way in slurmctld, so it thinks the batch step is still running even though it isn't. My patches will prevent this from happening.

However, I haven't been able to reproduce the abort that you hit - I don't know why batch step's step_node_bitmap was NULL - so I'd like to continue to work on that. I'm concerned that if my patches don't prevent that from happening, then they'll just introduce more possible places for slurmctld to segfault (because I'm using step_node_bitmap). If you could get me the logs and other info that I requested earlier, that would help. It's possible that my patches prevent this situation from ever happening because they'll reject attempts to remove the batch host from jobs, but I can't say for sure without knowing how step_node_bitmap==NULL for the batch step in the first place.

- Marshall

Comment 51 Marshall Garey 2019-09-05 15:51:17 MDT

Update - I've made good progress on patches, but they're not quite done.

Comment 56 Marshall Garey 2019-09-19 12:20:44 MDT

I also think I'll have working patches ready for review today that will at least prevent some bad behaviors with job resizing. It should prevent the segfault where it happened, but I still don't know how the batch step's step_bitmap was NULL in the first place.

Comment 61 Marshall Garey 2019-09-23 11:23:19 MDT

*** Ticket 7519 has been marked as a duplicate of this ticket. ***

Comment 63 Marshall Garey 2019-09-26 11:34:13 MDT

We have pushed a commit that prevents users from removing the batch host from their jobs when resizing their jobs. This commit will be in 19.05.3. If users try to remove the batch host when resizing their jobs, they will receive the error message "Invalid node name specified for job ####". A more descriptive error message will be in the slurmctld log file:

error("%s: Batch host %s for %pJ is not in the requested node list %s. You cannot remove the batch host from a job when resizing.",
      __func__, job_ptr->batch_host,
      job_ptr, job_specs->req_nodes);

I will see if I can make the error message sent to the user more descriptive.

Since I still haven't verified exactly how the batch step's step_node_bitmap became NULL, this commit isn't necessarily a direct fix for the segfault that happened. However, this should (hopefully) prevent the path that got to the segfault. There are more issues that we're looking at, so I'm not closing this ticket yet. However, in light of this fix, do you still consider this a sev-2 issue or can we drop it down to a sev-3?


commit 18279136e3f8e4b0c7ba8a4a09558041a8e5812e (HEAD -> slurm-19.05, origin/slurm-19.05, bug7499)
Author:     Marshall Garey <marshall@schedmd.com>
AuthorDate: Wed Sep 25 09:36:18 2019 -0600
Commit:     Brian Christiansen <brian@schedmd.com>
CommitDate: Thu Sep 26 09:29:54 2019 -0600

    Do not remove batch host when resizing/shrinking a batch job
    
    Bug 7499

Comment 66 Chris Samuel (NERSC) 2019-09-27 18:26:03 MDT

Hi Marshall,

(In reply to Marshall Garey from comment #63)

> We have pushed a commit that prevents users from removing the batch host
> from their jobs when resizing their jobs. This commit will be in 19.05.3. If
> users try to remove the batch host when resizing their jobs, they will
> receive the error message "Invalid node name specified for job ####". A more
> descriptive error message will be in the slurmctld log file:

Thanks so much for that!  I'll look at that next week.

Do we have an idea when 19.05.3 might land?

I'll drop the priority on this as we have the protective patch in place.

All the best,
Chris

Comment 69 Marshall Garey 2019-09-27 19:18:15 MDT

(In reply to Chris Samuel (NERSC) from comment #66)
> Do we have an idea when 19.05.3 might land?

19.05.3 is scheduled to be released next Thursday. However, there are a few high priority bugs that we need to complete for 19.05.3, so if they don't get done then we will delay releasing 19.5.3.

Comment 70 Marshall Garey 2019-10-07 16:11:13 MDT

Hi Chris,

I'm still trying to figure out how step_node_bitmap and step_layout could be NULL for a running job's batch step. One thing that would really help is a grep of the slurmctld log file for job id 23367113 (comment 18). And feel free to upload every instance of it in the log, even the ones that are just about the main and backfill schedulers looking at it before it ran - I just want to make it easier and faster for you, and that's easy enough for me to filter out.

I can mark the attachment as private after you upload it.

Comment 73 Marshall Garey 2019-10-29 10:26:42 MDT

I didn't mean to send an email update to everyone adding 8011 to see also. I added it to see also as another data point since we saw this error in the log:

> > [2019-10-28T20:35:25.632] error: _pick_step_nodes: JobId=29764246 StepId=Batch has no step_node_bitmap

Comment 74 Chris Samuel (NERSC) 2019-10-29 12:42:15 MDT

Hi Marshall,

Sorry for missing the previous request, once we've got our systems back from the latest PSPS here I'll go hunting for that.

All the best,
Chris

Comment 76 Marshall Garey 2019-11-05 10:24:50 MST

Chris,

I uploaded the slurmctld log file that you emailed me to this bug in a private attachment. I was able to reproduce this bug because of the log file you sent - thanks! It's rather straightforward:

* sbatch -N2 -wd9,d10 --wrap="sleep 10; srun sleep 789;"
* Restart slurmctld *before* the srun happens
* The batch step's step_node_list will be NULL
* The step will start
* scontrol update jobid=### nodelist=<don't include the batch host>
* slurmctld will crash

So, my patch that landed in 19.05.3 will prevent slurmctld from trying to update the job to not include the batch step, which will at least prevent that crash from happening. However, the batch step's step_node_list is still NULL, so that's still a bug. It's possible there's another crash waiting to happen if that situation comes up again.

Now that I have an easy reproducer I should be able to find a fix for it.

- Marshall

Comment 77 Chris Samuel (NERSC) 2019-11-05 12:43:23 MST

Hi Marshall,

That's great to hear, thank you!

All the best,
Chris

Comment 82 Marshall Garey 2020-01-29 12:12:14 MST

Closing as resolved/fixed

This has been fixed by the following commits, all in 20.02. They can't go into 19.05 because they require a change to the protocol for StateSaveLocation.

commit 71531fa500025f9e33d1a269e130ff3df50c6203 (HEAD -> master, origin/master, origin/HEAD, bug7499)
Author:     Brian Christiansen <brian@schedmd.com>
AuthorDate: Tue Jan 28 14:06:42 2020 -0700
Commit:     Brian Christiansen <brian@schedmd.com>
CommitDate: Wed Jan 29 10:52:06 2020 -0700

    Pack switch_job if not NULL regardless of step type
    
    Bug 7499

commit df2838bf1578986574f05f80a5effbfa323fdb49
Author:     Marshall Garey <marshall@schedmd.com>
AuthorDate: Tue Jan 28 14:06:42 2020 -0700
Commit:     Brian Christiansen <brian@schedmd.com>
CommitDate: Wed Jan 29 10:52:06 2020 -0700

    Pack step_layout regardless of type
    
    The batch step has a step_layout. pack_slurm_step_layout() will only
    pack step_layout if it's not NULL.
    
    Bug 7499

commit 1d296cc257ff0f6900973ae2c4ed4c81b398c72e
Author:     Marshall Garey <marshall@schedmd.com>
AuthorDate: Tue Jan 28 14:06:42 2020 -0700
Commit:     Brian Christiansen <brian@schedmd.com>
CommitDate: Wed Jan 29 10:52:06 2020 -0700

    Delete step record regardless of type if step_node_bitmap is NULL
    
    The batch step had a NULL step_node_bitmap and wasn't deleted.
    
    Bug 7499


And here's the commit that landed in 19.05.3 (awhile ago now):

commit 18279136e3f8e4b0c7ba8a4a09558041a8e5812e (HEAD -> slurm-19.05, origin/slurm-19.05, bug7499)
Author:     Marshall Garey <marshall@schedmd.com>
AuthorDate: Wed Sep 25 09:36:18 2019 -0600
Commit:     Brian Christiansen <brian@schedmd.com>
CommitDate: Thu Sep 26 09:29:54 2019 -0600

    Do not remove batch host when resizing/shrinking a batch job
    
    Bug 7499

Comment 83 Chris Samuel (NERSC) 2020-01-29 12:47:31 MST

Wonderful, thanks Marshall!

Comment 84 Dominik Bartkiewicz 2020-02-13 11:44:01 MST

*** Ticket 8135 has been marked as a duplicate of this ticket. ***

Comment 86 Nate Rini 2020-03-05 10:46:53 MST

Opening this ticket to public.