Hi there, We had a separate issue with backfilling on Cori where it constantly complained about too many RPCs and was yielding locks. I restarted slurmctld at one point and then it crashed, the last thing it logged was: [2019-07-31T14:19:44.234] Killing JobId=23367113 StepId=Extern on failed node nid00956 Restarting it again had it crash again, once more after: [2019-07-31T14:22:46.246] Killing JobId=23367113 StepId=Extern on failed node nid00956 I'll attach a core file now (the second overwrote the first). Thanks, Chris
Hi Chris - Would you send in a backtrace for us to review? thread apply all bt full
Thanks, we're looking at the backtrace. Can you also upload a slurmctld log file?
Yeah, I don't care about those. I care about when it started running to the last mention of it in the logs.
(In reply to Nate Rini from comment #32) > (In reply to Chris Samuel (NERSC) from comment #30) > > Re resizing - the usual reason are people who select a heap of nodes, check > > for available huge pages on the node and discard more than their required > > threshold, shrinking their job down to just the ones with enough. > > Do they use `--exclude=` in their srun calls or do they call scontrol to > resize the jobs explicitly? > > (In reply to Chris Samuel (NERSC) from comment #31) > > But "Hi Nate" too... > :D To extend this question, do you have a specific example of (1) the full job submission script/command line, (2) the exact command used to update the job? In particular, the job submission for job 23367113 would be helpful.
Hey Chris, we're still looking at the problem and have made some progress. I don't have a complete answer yet, though, so I'll hold off until I do. However, a more complete slurmctld log file (comment 21 and comment 22) and the answers to questions in comment 33 would be helpful. If you don't know those answers, then just the log file would be nice. And actually, I partially redact my statement in comment 22 about only caring about when the job started running - I actually would like to see the job submission, then when the job started running. I don't care about all the lines from the scheduler saying that the job is still pending, though I do care about any other types of log statements. Perhaps a grep of the log with that job id will be good enough for now.
I've been working on a set of patches to fix some issues I found with job resizing and --no-kill. Basically, if you remove the batch host from the job, Slurm doesn't ever kill the job. Slurm tries to kill the batch step, but it doesn't get killed all the way in slurmctld, so it thinks the batch step is still running even though it isn't. My patches will prevent this from happening. However, I haven't been able to reproduce the abort that you hit - I don't know why batch step's step_node_bitmap was NULL - so I'd like to continue to work on that. I'm concerned that if my patches don't prevent that from happening, then they'll just introduce more possible places for slurmctld to segfault (because I'm using step_node_bitmap). If you could get me the logs and other info that I requested earlier, that would help. It's possible that my patches prevent this situation from ever happening because they'll reject attempts to remove the batch host from jobs, but I can't say for sure without knowing how step_node_bitmap==NULL for the batch step in the first place. - Marshall
Update - I've made good progress on patches, but they're not quite done.
I also think I'll have working patches ready for review today that will at least prevent some bad behaviors with job resizing. It should prevent the segfault where it happened, but I still don't know how the batch step's step_bitmap was NULL in the first place.
*** Ticket 7519 has been marked as a duplicate of this ticket. ***
We have pushed a commit that prevents users from removing the batch host from their jobs when resizing their jobs. This commit will be in 19.05.3. If users try to remove the batch host when resizing their jobs, they will receive the error message "Invalid node name specified for job ####". A more descriptive error message will be in the slurmctld log file: error("%s: Batch host %s for %pJ is not in the requested node list %s. You cannot remove the batch host from a job when resizing.", __func__, job_ptr->batch_host, job_ptr, job_specs->req_nodes); I will see if I can make the error message sent to the user more descriptive. Since I still haven't verified exactly how the batch step's step_node_bitmap became NULL, this commit isn't necessarily a direct fix for the segfault that happened. However, this should (hopefully) prevent the path that got to the segfault. There are more issues that we're looking at, so I'm not closing this ticket yet. However, in light of this fix, do you still consider this a sev-2 issue or can we drop it down to a sev-3? commit 18279136e3f8e4b0c7ba8a4a09558041a8e5812e (HEAD -> slurm-19.05, origin/slurm-19.05, bug7499) Author: Marshall Garey <marshall@schedmd.com> AuthorDate: Wed Sep 25 09:36:18 2019 -0600 Commit: Brian Christiansen <brian@schedmd.com> CommitDate: Thu Sep 26 09:29:54 2019 -0600 Do not remove batch host when resizing/shrinking a batch job Bug 7499
Hi Marshall, (In reply to Marshall Garey from comment #63) > We have pushed a commit that prevents users from removing the batch host > from their jobs when resizing their jobs. This commit will be in 19.05.3. If > users try to remove the batch host when resizing their jobs, they will > receive the error message "Invalid node name specified for job ####". A more > descriptive error message will be in the slurmctld log file: Thanks so much for that! I'll look at that next week. Do we have an idea when 19.05.3 might land? I'll drop the priority on this as we have the protective patch in place. All the best, Chris
(In reply to Chris Samuel (NERSC) from comment #66) > Do we have an idea when 19.05.3 might land? 19.05.3 is scheduled to be released next Thursday. However, there are a few high priority bugs that we need to complete for 19.05.3, so if they don't get done then we will delay releasing 19.5.3.
Hi Chris, I'm still trying to figure out how step_node_bitmap and step_layout could be NULL for a running job's batch step. One thing that would really help is a grep of the slurmctld log file for job id 23367113 (comment 18). And feel free to upload every instance of it in the log, even the ones that are just about the main and backfill schedulers looking at it before it ran - I just want to make it easier and faster for you, and that's easy enough for me to filter out. I can mark the attachment as private after you upload it.
I didn't mean to send an email update to everyone adding 8011 to see also. I added it to see also as another data point since we saw this error in the log: > > [2019-10-28T20:35:25.632] error: _pick_step_nodes: JobId=29764246 StepId=Batch has no step_node_bitmap
Hi Marshall, Sorry for missing the previous request, once we've got our systems back from the latest PSPS here I'll go hunting for that. All the best, Chris
Chris, I uploaded the slurmctld log file that you emailed me to this bug in a private attachment. I was able to reproduce this bug because of the log file you sent - thanks! It's rather straightforward: * sbatch -N2 -wd9,d10 --wrap="sleep 10; srun sleep 789;" * Restart slurmctld *before* the srun happens * The batch step's step_node_list will be NULL * The step will start * scontrol update jobid=### nodelist=<don't include the batch host> * slurmctld will crash So, my patch that landed in 19.05.3 will prevent slurmctld from trying to update the job to not include the batch step, which will at least prevent that crash from happening. However, the batch step's step_node_list is still NULL, so that's still a bug. It's possible there's another crash waiting to happen if that situation comes up again. Now that I have an easy reproducer I should be able to find a fix for it. - Marshall
Hi Marshall, That's great to hear, thank you! All the best, Chris
Closing as resolved/fixed This has been fixed by the following commits, all in 20.02. They can't go into 19.05 because they require a change to the protocol for StateSaveLocation. commit 71531fa500025f9e33d1a269e130ff3df50c6203 (HEAD -> master, origin/master, origin/HEAD, bug7499) Author: Brian Christiansen <brian@schedmd.com> AuthorDate: Tue Jan 28 14:06:42 2020 -0700 Commit: Brian Christiansen <brian@schedmd.com> CommitDate: Wed Jan 29 10:52:06 2020 -0700 Pack switch_job if not NULL regardless of step type Bug 7499 commit df2838bf1578986574f05f80a5effbfa323fdb49 Author: Marshall Garey <marshall@schedmd.com> AuthorDate: Tue Jan 28 14:06:42 2020 -0700 Commit: Brian Christiansen <brian@schedmd.com> CommitDate: Wed Jan 29 10:52:06 2020 -0700 Pack step_layout regardless of type The batch step has a step_layout. pack_slurm_step_layout() will only pack step_layout if it's not NULL. Bug 7499 commit 1d296cc257ff0f6900973ae2c4ed4c81b398c72e Author: Marshall Garey <marshall@schedmd.com> AuthorDate: Tue Jan 28 14:06:42 2020 -0700 Commit: Brian Christiansen <brian@schedmd.com> CommitDate: Wed Jan 29 10:52:06 2020 -0700 Delete step record regardless of type if step_node_bitmap is NULL The batch step had a NULL step_node_bitmap and wasn't deleted. Bug 7499 And here's the commit that landed in 19.05.3 (awhile ago now): commit 18279136e3f8e4b0c7ba8a4a09558041a8e5812e (HEAD -> slurm-19.05, origin/slurm-19.05, bug7499) Author: Marshall Garey <marshall@schedmd.com> AuthorDate: Wed Sep 25 09:36:18 2019 -0600 Commit: Brian Christiansen <brian@schedmd.com> CommitDate: Thu Sep 26 09:29:54 2019 -0600 Do not remove batch host when resizing/shrinking a batch job Bug 7499
Wonderful, thanks Marshall!
*** Ticket 8135 has been marked as a duplicate of this ticket. ***
Opening this ticket to public.