Summary: | Multipartition Jobs Labelled with Wrong Partition | ||
---|---|---|---|
Product: | Slurm | Reporter: | Paul Edmon <pedmon> |
Component: | Scheduling | Assignee: | Carlos Tripiana Montes <tripiana> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 2 - High Impact | ||
Priority: | --- | ||
Version: | 24.11.1 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Harvard University | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | 24.11.2, 25.05.0rc1 | |
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: |
Current slurm.conf
Current topology.conf |
Description
Paul Edmon
2025-02-13 06:43:43 MST
Created attachment 40779 [details]
Current slurm.conf
Created attachment 40780 [details]
Current topology.conf
Sorry to hear you are impacted by such issue. It seems like a duplicate of Ticket 22010, but I need you to confirm that you issued something like "scontrol reconfigure", or fully restarted slurmctld daemon, between the job's start and the moment you queried the job info with squeue or scontrol (show job). Also worth to request you to check sacct for the job's partition, which must still be the one assigned when the job started to run, rather than the one it seems to get assigned after controller's restart. Regards, Carlos. Yes, sacct is reporting the right thing: [root@holy8a24507 general]# sacct -j 3474320 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 3474320 Z34.sbatc+ sapphire hernquist+ 112 RUNNING 0:0 3474320.bat+ batch hernquist+ 36 RUNNING 0:0 3474320.ext+ extern hernquist+ 112 RUNNING 0:0 3474320.0 Arepo_smu+ hernquist+ 112 RUNNING 0:0 As for the restart/reconfigure question, yes we have done that multiple times between the job start and looking at it now. I can't see Ticket 22010 but given what you are saying here this is likely the same issue. -Paul Edmon- On 2/13/2025 9:25 AM, bugs@schedmd.com wrote: > > *Comment # 3 <https://support.schedmd.com/show_bug.cgi?id=22076#c3> on > ticket 22076 <https://support.schedmd.com/show_bug.cgi?id=22076> from > Carlos Tripiana Montes <mailto:tripiana@schedmd.com> * > Sorry to hear you are impacted by such issue. > > It seems like a duplicate ofTicket 22010 <show_bug.cgi?id=22010>, but I need you to confirm that you > issued something like "scontrol reconfigure", or fully restarted slurmctld > daemon, between the job's start and the moment you queried the job info with > squeue or scontrol (show job). > > Also worth to request you to check sacct for the job's partition, which must > still be the one assigned when the job started to run, rather than the one it > seems to get assigned after controller's restart. > > Regards, > Carlos. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. > Thanks for confirming the details. I can say that I proposed a fix for that, and it's under review right now. I can, more or less, say the review process is evolving w/o any plot-twist by now. If everything goes smooth, I'll handle you the commit(s) for this fix in reasonable time. I know, this is a true sev2 issue. So, if you are willing to manually patch your slurm, I can share an early access patch with you. It doesn't need to be the final fix for that, and we cannot promise it won't have any side effects, but so far so good, it seems to be fixing the issue perfectly. Regards, Carlos. Yeah, I would like to patch sooner rather than later if one is available as it is impacting preemption and hence scheduling efficiency. I've already patched for the --test-only requeue issue yesterday (see: https://support.schedmd.com/show_bug.cgi?id=21975#c49). That said if this will be in 24.11.2 and if that is imminent soon (assuming this is in that), I'd rather just grab the full block of patches. It will be a trade off, depending on timing. Things are stable so the scheduler is not broken perse, so I'd rather keep the scheduler running even in this state then merge a patch that may make things worse. I trust your QA, but again its the unknown unknowns and I tend to trust formal releases more than patches. Keep me posted. -Paul Edmon- On 2/13/2025 10:11 AM, bugs@schedmd.com wrote: > > *Comment # 5 <https://support.schedmd.com/show_bug.cgi?id=22076#c5> on > ticket 22076 <https://support.schedmd.com/show_bug.cgi?id=22076> from > Carlos Tripiana Montes <mailto:tripiana@schedmd.com> * > Thanks for confirming the details. > > I can say that I proposed a fix for that, and it's under review right now. I > can, more or less, say the review process is evolving w/o any plot-twist by > now. If everything goes smooth, I'll handle you the commit(s) for this fix in > reasonable time. > > I know, this is a true sev2 issue. So, if you are willing to manually patch > your slurm, I can share an early access patch with you. It doesn't need to be > the final fix for that, and we cannot promise it won't have any side effects, > but so far so good, it seems to be fixing the issue perfectly. > > Regards, > Carlos. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. > Hi Paul, These commits, are the official fix we pushed to the repo for 24.11: 5c21c47c - Refactor _get_part_list() to set part_ptr_list and part_ptr 72f9552b - Refactor code to one call 50bbd2b0 - Fix multi-partition, running job getting wrong partition on restart I recommend you to apply such patch in the meantime we release the next minor for 24.11 (24.11.2). Quick question. Do you have an ETA for when 24.11.2 will be released? On 2/14/25 3:55 AM, bugs@schedmd.com wrote: > Carlos Tripiana Montes <mailto:tripiana@schedmd.com> changed ticket > 22076 <https://support.schedmd.com/show_bug.cgi?id=22076> > What Removed Added > Status OPEN RESOLVED > Resolution --- FIXED > Version Fixed 24.11.2, 25.05.0rc1 > > *Comment # 7 <https://support.schedmd.com/show_bug.cgi?id=22076#c7> on > ticket 22076 <https://support.schedmd.com/show_bug.cgi?id=22076> from > Carlos Tripiana Montes <mailto:tripiana@schedmd.com> * > Hi Paul, > > These commits, are the official fix we pushed to the repo for 24.11: > > 5c21c47c - Refactor _get_part_list() to set part_ptr_list and part_ptr > 72f9552b - Refactor code to one call > 50bbd2b0 - Fix multi-partition, running job getting wrong partition on restart > > I recommend you to apply such patch in the meantime we release the next minor > for 24.11 (24.11.2). > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. > AFAIK, tentative is for about Feb 25th, so in 10 days from today. But, it's tentative, not fixed yet. In any case, I expect not to be delayed for too much. |