| Summary: | [CAST-34339] Slurmctld ghost AllocTRES | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Brian F Gilmer <brian.gilmer> |
| Component: | GPU | Assignee: | Scott Hilton <scott> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 22.05.9 | ||
| Hardware: | Cray Shasta | ||
| OS: | Linux | ||
| Site: | CRAY | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | CSC COMPUTER SCIENCES LTD | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Brian F Gilmer
2023-10-16 10:17:12 MDT
Brian, I believe that this issue has been fixed in 23.02.2 with bug 16121. -Scott *** This ticket has been marked as a duplicate of ticket 16121 *** Can the fix be applied to 22.05? Brian, We only push bug fixes to the current major version of slurm. Would applying commit 5c3a7f6aaf work on 22.05? Probably. But it would be an unsupported install. Upgrading to 23.02 is our official recommendation. -Scott From the customer Slurm expert: I just read the commit 5c3a7f6aaf. Looks like we have a workaround available: turn off requeueing jobs (set JobRequeue=0 and use submit plugin to remove any requeue options from submission - not sure if this is enough to stop "scontrol update job=xxx requeue=1" though). Can we confirm that this would work as a workaround? Brian, I think that workaround would work as you surmised but scontrol could bypass it. I can't think of anyway to completely forbid requeue off right now. But if that is your best option, I would rather you just apply 5c3a7f6aaf to 22.05. 5c3a7f6aaf is a very simple commit that doesn't rely on any recent changes. I am quite confident it should work. -Scott I took a look and we did rename gres_ctld_job_clear() in 23.02. So the diff should look like this for you.
diff --git a/src/slurmctld/node_scheduler.c b/src/slurmctld/node_scheduler.c
index 60abfb9137..c268aad893 100644
--- a/src/slurmctld/node_scheduler.c
+++ b/src/slurmctld/node_scheduler.c
@@ -2228,7 +2228,7 @@ static void _end_null_job(job_record_t *job_ptr)
job_ptr->exit_code = 0;
gres_ctld_job_clear(job_ptr->gres_list_req);
gres_ctld_job_clear(job_ptr->gres_list_req_accum);
- gres_ctld_job_clear(job_ptr->gres_list_alloc);
+ FREE_NULL_LIST(job_ptr->gres_list_alloc);
job_ptr->job_state = JOB_RUNNING;
job_ptr->bit_flags |= JOB_WAS_RUNNING;
FREE_NULL_BITMAP(job_ptr->node_bitmap);
@@ -2652,7 +2652,7 @@ extern int select_nodes(job_record_t *job_ptr, bool test_only,
job_ptr->exit_code = 0;
gres_ctld_job_clear(job_ptr->gres_list_req);
gres_ctld_job_clear(job_ptr->gres_list_req_accum);
- gres_ctld_job_clear(job_ptr->gres_list_alloc);
+ FREE_NULL_LIST(job_ptr->gres_list_alloc);
if (!job_ptr->step_list)
job_ptr->step_list = list_create(free_step_record);
Thanks, Changing the version of Slurm at this point isn't an option. This is from the user in response to a question of why can't they test the controller restart. "There is no opportunity to test it: we need the system 100% available from Friday evening, hence the urgency of this case and we have no reproducer for the issue. (Without the urgency, this would be a minor issue and no rush: if we lose a few dozen nodes a week and need to restart slurmctld during office hours, that's fine - it's just the critical acceptance weekend which is more complicated.) Also, as I mentioned in my previous comment, we'll go with the workaround untested, would just like to have a confirmation from SchedMD that we've understood the issue correctly and the workaround will indeed protect us. If not, it'll be a potential slurmctld restart in the middle of the night, which no one is comfortable with but we will do it - and ring the hotline, waking everyone up if the ctld does not come up as expected (I'm pretty confident it will, but it hasn't always done so)." Brian, I guess, for now, stopping jobs from requeuing is probably their best option. Let me know if they have other questions on this topic. -Scott Thanks, The customer got through the testing so this bug can be closed now. Closing *** This ticket has been marked as a duplicate of ticket 16121 *** |