Ticket 22076

Summary:	Multipartition Jobs Labelled with Wrong Partition
Product:	Slurm	Reporter:	Paul Edmon <pedmon>
Component:	Scheduling	Assignee:	Carlos Tripiana Montes <tripiana>
Status:	RESOLVED FIXED	QA Contact:
Severity:	2 - High Impact
Priority:	---
Version:	24.11.1
Hardware:	Linux
OS:	Linux
Site:	Harvard University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	24.11.2, 25.05.0rc1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Current slurm.conf Current topology.conf

Description Paul Edmon 2025-02-13 06:43:43 MST

I have seen several jobs that were submitted to multiple partitions but when scheduled they run on nodes that are part of one partition but are labeled as running on a different partition.  Below is an example.

This job:

           3474320 itc_clust Z34.sbat slucchin  R 1-16:57:51      5 holy8a[26108,28109,28303,28312,28401]

[root@holy8a24507 general]# scontrol show job 3474320
JobId=3474320 JobName=Z34.sbatch.sh
   UserId=slucchini(64217) GroupId=hernquist_lab(33234) MCS_label=N/A
   Priority=4459058 Nice=0 Account=hernquist_lab QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=1-17:01:45 TimeLimit=3-00:00:00 TimeMin=N/A
   SubmitTime=2025-02-11T14:05:15 EligibleTime=2025-02-11T14:05:15
   AccrueTime=2025-02-11T14:05:15
   StartTime=2025-02-11T15:36:35 EndTime=2025-02-14T15:36:35 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-02-11T15:36:35 Scheduler=Backfill
   Partition=itc_cluster AllocNode:Sid=holylogin06:2459539
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=holy8a[26108,28109,28303,28312,28401]
   BatchHost=holy8a26108
   NumNodes=5 NumCPUs=112 NumTasks=112 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=112,mem=873600M,node=1,billing=126
   AllocTRES=cpu=112,mem=873600M,node=5,billing=126
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=7800M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/n/netscratch/hernquist_lab/Lab/slucchini/Arepo/Cosmo/zooms/Z34/Z34.sbatch.sh
   WorkDir=/n/netscratch/hernquist_lab/Lab/slucchini/Arepo/Cosmo/zooms/Z34
   StdErr=/n/netscratch/hernquist_lab/Lab/slucchini/Arepo/Cosmo/zooms/Z34/jobfile.err
   StdIn=/dev/null
   StdOut=/n/netscratch/hernquist_lab/Lab/slucchini/Arepo/Cosmo/zooms/Z34/jobfile.out

Says it is running in itc_cluster. However those nodes are not part of itc_cluster which is defined this way:

[root@holy8a24507 general]# scontrol show partition itc_cluster
PartitionName=itc_cluster
   AllowGroups=itc_lab,slurm-admin,slurm_group_itc AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=00:10:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
   Nodes=holy8a[24507-24512,24605-24612,26407-26412]
   PriorityJobFactor=1 PriorityTier=4 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=2240 TotalNodes=20 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
   TRES=cpu=2240,mem=20624340M,node=20,billing=2753
   TRESBillingWeights=CPU=0.6,Mem=0.07G

Rather those nodes are a part of the sapphire partition:

[root@holy8a24507 general]# scontrol show partition sapphire
PartitionName=sapphire
   AllowGroups=cluster_users,cluster_users_2,slurm-admin DenyAccounts=kempner_albergo_lab,kempner_alvarez_lab,kempner_ba_lab,kempner_barak_lab,kempner_bsabatini_lab,kempner_dam_lab,kempner_emalach_lab,kempner_gershman_lab,kempner_grads,kempner_hms,kempner_kdbrantley_lab,kempner_konkle_lab,kempner_krajan_lab,kempner_lab,kempner_murphy_lab,kempner_mzitnik_lab,kempner_pehlevan_lab,kempner_pslade_lab,kempner_sham_lab,kempner_sompolinsky_lab,kempner_undergrads,kempner_wcarvalho_lab AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=00:10:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
   Nodes=holy8a[24105-24112,24201-24212,26105-26112,26201-26212,26301-26302,26401-26402,28101-28112,28201-28212,28301-28312,28401-28412,28501-28508,28601-28608,30101-30112,30201-30212,30301-30312,30401-30412,30501-30508,30601-30608,32501-32508,32601-32608]
   PriorityJobFactor=1 PriorityTier=3 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=21056 TotalNodes=188 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
   TRES=cpu=21056,mem=193868796M,node=188,billing=25886
   TRESBillingWeights=CPU=0.6,Mem=0.07G

If you look at the original job script:

[root@holy8a24507 general]# sacct -B -j 3474320
Batch Script for 3474320
--------------------------------------------------------------------------------
#!/bin/sh
#SBATCH -p itc_cluster,sapphire,hernquist
#SBATCH -t 3-00:00:00
#SBATCH --ntasks=112
#SBATCH --mem-per-cpu=7800
#SBATCH -e jobfile.err
#SBATCH -o jobfile.out

OUTFOLDER="output"
OUTFILE="OUTPUT."$SLURM_JOB_ID

### CODE TO EXECUTE

srun -n $SLURM_NTASKS --mpi=pmix ./Arepo_smuggledev-dtmerge_noalphalim Z34.arepo_smuggle.params >> $OUTFOLDER/$OUTFILE


You can see it was submitted to three partitions. So it is fine that it is running on sapphire, but scheduler thinks its running on itc_cluster. This is bad because I have seen other examples where a job was submited to both a nonpreemptable and preemptable partition, the job ends up running in the preemptable partition but is labelled as running in the nonpreemptable partition and thus cannot be preempted because the scheduler incorrectly thinks it is running in a nonpreemptable partition. This causes jobs that should be able to preempt that job to fail to preempt and thus pend when they shouldn't.

To me this is a major issue as jobs should always be labeled by the partition they end up running in. This wasn't an issue in 24.05 or any previous version so a bug must have been introduced in 24.11

Comment 1 Paul Edmon 2025-02-13 06:48:44 MST

Created attachment 40779 [details]
Current slurm.conf

Comment 2 Paul Edmon 2025-02-13 06:48:57 MST

Created attachment 40780 [details]
Current topology.conf

Comment 3 Carlos Tripiana Montes 2025-02-13 07:25:23 MST

Sorry to hear you are impacted by such issue.

It seems like a duplicate of Ticket 22010, but I need you to confirm that you issued something like "scontrol reconfigure", or fully restarted slurmctld daemon, between the job's start and the moment you queried the job info with squeue or scontrol (show job).

Also worth to request you to check sacct for the job's partition, which must still be the one assigned when the job started to run, rather than the one it seems to get assigned after controller's restart.

Regards,
Carlos.

Comment 4 Paul Edmon 2025-02-13 07:32:27 MST

Yes, sacct is reporting the right thing:

[root@holy8a24507 general]# sacct -j 3474320
JobID           JobName  Partition    Account  AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
3474320      Z34.sbatc+   sapphire hernquist+        112 RUNNING      0:0
3474320.bat+      batch            hernquist+         36 RUNNING      0:0
3474320.ext+     extern            hernquist+        112 RUNNING      0:0
3474320.0    Arepo_smu+            hernquist+        112 RUNNING      0:0

As for the restart/reconfigure question, yes we have done that multiple 
times between the job start and looking at it now.  I can't see Ticket 
22010 but given what you are saying here this is likely the same issue.

-Paul Edmon-

On 2/13/2025 9:25 AM, bugs@schedmd.com wrote:
>
> *Comment # 3 <https://support.schedmd.com/show_bug.cgi?id=22076#c3> on 
> ticket 22076 <https://support.schedmd.com/show_bug.cgi?id=22076> from 
> Carlos Tripiana Montes <mailto:tripiana@schedmd.com> *
> Sorry to hear you are impacted by such issue.
>
> It seems like a duplicate ofTicket 22010 <show_bug.cgi?id=22010>, but I need you to confirm that you
> issued something like "scontrol reconfigure", or fully restarted slurmctld
> daemon, between the job's start and the moment you queried the job info with
> squeue or scontrol (show job).
>
> Also worth to request you to check sacct for the job's partition, which must
> still be the one assigned when the job started to run, rather than the one it
> seems to get assigned after controller's restart.
>
> Regards,
> Carlos.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 5 Carlos Tripiana Montes 2025-02-13 08:11:31 MST

Thanks for confirming the details.

I can say that I proposed a fix for that, and it's under review right now. I can, more or less, say the review process is evolving w/o any plot-twist by now. If everything goes smooth, I'll handle you the commit(s) for this fix in reasonable time.

I know, this is a true sev2 issue. So, if you are willing to manually patch your slurm, I can share an early access patch with you. It doesn't need to be the final fix for that, and we cannot promise it won't have any side effects, but so far so good, it seems to be fixing the issue perfectly.

Regards,
Carlos.

Comment 6 Paul Edmon 2025-02-13 08:16:17 MST

Yeah, I would like to patch sooner rather than later if one is available 
as it is impacting preemption and hence scheduling efficiency.  I've 
already patched for the --test-only requeue issue yesterday (see: 
https://support.schedmd.com/show_bug.cgi?id=21975#c49).

That said if this will be in 24.11.2 and if that is imminent soon 
(assuming this is in that), I'd rather just grab the full block of patches.

It will be a trade off, depending on timing. Things are stable so the 
scheduler is not broken perse, so I'd rather keep the scheduler running 
even in this state then merge a patch that may make things worse. I 
trust your QA, but again its the unknown unknowns and I tend to trust 
formal releases more than patches.

Keep me posted.

-Paul Edmon-

On 2/13/2025 10:11 AM, bugs@schedmd.com wrote:
>
> *Comment # 5 <https://support.schedmd.com/show_bug.cgi?id=22076#c5> on 
> ticket 22076 <https://support.schedmd.com/show_bug.cgi?id=22076> from 
> Carlos Tripiana Montes <mailto:tripiana@schedmd.com> *
> Thanks for confirming the details.
>
> I can say that I proposed a fix for that, and it's under review right now. I
> can, more or less, say the review process is evolving w/o any plot-twist by
> now. If everything goes smooth, I'll handle you the commit(s) for this fix in
> reasonable time.
>
> I know, this is a true sev2 issue. So, if you are willing to manually patch
> your slurm, I can share an early access patch with you. It doesn't need to be
> the final fix for that, and we cannot promise it won't have any side effects,
> but so far so good, it seems to be fixing the issue perfectly.
>
> Regards,
> Carlos.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 7 Carlos Tripiana Montes 2025-02-14 01:55:22 MST

Hi Paul,

These commits, are the official fix we pushed to the repo for 24.11:

5c21c47c - Refactor _get_part_list() to set part_ptr_list and part_ptr
72f9552b - Refactor code to one call
50bbd2b0 - Fix multi-partition, running job getting wrong partition on restart

I recommend you to apply such patch in the meantime we release the next minor for 24.11 (24.11.2).

Comment 8 Paul Edmon 2025-02-14 08:13:58 MST

Quick question. Do you have an ETA for when 24.11.2 will be released?

On 2/14/25 3:55 AM, bugs@schedmd.com wrote:
> Carlos Tripiana Montes <mailto:tripiana@schedmd.com> changed ticket 
> 22076 <https://support.schedmd.com/show_bug.cgi?id=22076>
> What 	Removed 	Added
> Status 	OPEN 	RESOLVED
> Resolution 	--- 	FIXED
> Version Fixed 		24.11.2, 25.05.0rc1
>
> *Comment # 7 <https://support.schedmd.com/show_bug.cgi?id=22076#c7> on 
> ticket 22076 <https://support.schedmd.com/show_bug.cgi?id=22076> from 
> Carlos Tripiana Montes <mailto:tripiana@schedmd.com> *
> Hi Paul,
>
> These commits, are the official fix we pushed to the repo for 24.11:
>
> 5c21c47c - Refactor _get_part_list() to set part_ptr_list and part_ptr
> 72f9552b - Refactor code to one call
> 50bbd2b0 - Fix multi-partition, running job getting wrong partition on restart
>
> I recommend you to apply such patch in the meantime we release the next minor
> for 24.11 (24.11.2).
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 9 Carlos Tripiana Montes 2025-02-14 08:17:58 MST

AFAIK, tentative is for about Feb 25th, so in 10 days from today. But, it's tentative, not fixed yet. In any case, I expect not to be delayed for too much.