I have seen several jobs that were submitted to multiple partitions but when scheduled they run on nodes that are part of one partition but are labeled as running on a different partition. Below is an example. This job: 3474320 itc_clust Z34.sbat slucchin R 1-16:57:51 5 holy8a[26108,28109,28303,28312,28401] [root@holy8a24507 general]# scontrol show job 3474320 JobId=3474320 JobName=Z34.sbatch.sh UserId=slucchini(64217) GroupId=hernquist_lab(33234) MCS_label=N/A Priority=4459058 Nice=0 Account=hernquist_lab QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=1-17:01:45 TimeLimit=3-00:00:00 TimeMin=N/A SubmitTime=2025-02-11T14:05:15 EligibleTime=2025-02-11T14:05:15 AccrueTime=2025-02-11T14:05:15 StartTime=2025-02-11T15:36:35 EndTime=2025-02-14T15:36:35 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-02-11T15:36:35 Scheduler=Backfill Partition=itc_cluster AllocNode:Sid=holylogin06:2459539 ReqNodeList=(null) ExcNodeList=(null) NodeList=holy8a[26108,28109,28303,28312,28401] BatchHost=holy8a26108 NumNodes=5 NumCPUs=112 NumTasks=112 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=112,mem=873600M,node=1,billing=126 AllocTRES=cpu=112,mem=873600M,node=5,billing=126 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=7800M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/n/netscratch/hernquist_lab/Lab/slucchini/Arepo/Cosmo/zooms/Z34/Z34.sbatch.sh WorkDir=/n/netscratch/hernquist_lab/Lab/slucchini/Arepo/Cosmo/zooms/Z34 StdErr=/n/netscratch/hernquist_lab/Lab/slucchini/Arepo/Cosmo/zooms/Z34/jobfile.err StdIn=/dev/null StdOut=/n/netscratch/hernquist_lab/Lab/slucchini/Arepo/Cosmo/zooms/Z34/jobfile.out Says it is running in itc_cluster. However those nodes are not part of itc_cluster which is defined this way: [root@holy8a24507 general]# scontrol show partition itc_cluster PartitionName=itc_cluster AllowGroups=itc_lab,slurm-admin,slurm_group_itc AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=00:10:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED Nodes=holy8a[24507-24512,24605-24612,26407-26412] PriorityJobFactor=1 PriorityTier=4 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=2240 TotalNodes=20 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED TRES=cpu=2240,mem=20624340M,node=20,billing=2753 TRESBillingWeights=CPU=0.6,Mem=0.07G Rather those nodes are a part of the sapphire partition: [root@holy8a24507 general]# scontrol show partition sapphire PartitionName=sapphire AllowGroups=cluster_users,cluster_users_2,slurm-admin DenyAccounts=kempner_albergo_lab,kempner_alvarez_lab,kempner_ba_lab,kempner_barak_lab,kempner_bsabatini_lab,kempner_dam_lab,kempner_emalach_lab,kempner_gershman_lab,kempner_grads,kempner_hms,kempner_kdbrantley_lab,kempner_konkle_lab,kempner_krajan_lab,kempner_lab,kempner_murphy_lab,kempner_mzitnik_lab,kempner_pehlevan_lab,kempner_pslade_lab,kempner_sham_lab,kempner_sompolinsky_lab,kempner_undergrads,kempner_wcarvalho_lab AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=00:10:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED Nodes=holy8a[24105-24112,24201-24212,26105-26112,26201-26212,26301-26302,26401-26402,28101-28112,28201-28212,28301-28312,28401-28412,28501-28508,28601-28608,30101-30112,30201-30212,30301-30312,30401-30412,30501-30508,30601-30608,32501-32508,32601-32608] PriorityJobFactor=1 PriorityTier=3 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=21056 TotalNodes=188 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED TRES=cpu=21056,mem=193868796M,node=188,billing=25886 TRESBillingWeights=CPU=0.6,Mem=0.07G If you look at the original job script: [root@holy8a24507 general]# sacct -B -j 3474320 Batch Script for 3474320 -------------------------------------------------------------------------------- #!/bin/sh #SBATCH -p itc_cluster,sapphire,hernquist #SBATCH -t 3-00:00:00 #SBATCH --ntasks=112 #SBATCH --mem-per-cpu=7800 #SBATCH -e jobfile.err #SBATCH -o jobfile.out OUTFOLDER="output" OUTFILE="OUTPUT."$SLURM_JOB_ID ### CODE TO EXECUTE srun -n $SLURM_NTASKS --mpi=pmix ./Arepo_smuggledev-dtmerge_noalphalim Z34.arepo_smuggle.params >> $OUTFOLDER/$OUTFILE You can see it was submitted to three partitions. So it is fine that it is running on sapphire, but scheduler thinks its running on itc_cluster. This is bad because I have seen other examples where a job was submited to both a nonpreemptable and preemptable partition, the job ends up running in the preemptable partition but is labelled as running in the nonpreemptable partition and thus cannot be preempted because the scheduler incorrectly thinks it is running in a nonpreemptable partition. This causes jobs that should be able to preempt that job to fail to preempt and thus pend when they shouldn't. To me this is a major issue as jobs should always be labeled by the partition they end up running in. This wasn't an issue in 24.05 or any previous version so a bug must have been introduced in 24.11
Created attachment 40779 [details] Current slurm.conf
Created attachment 40780 [details] Current topology.conf
Sorry to hear you are impacted by such issue. It seems like a duplicate of Ticket 22010, but I need you to confirm that you issued something like "scontrol reconfigure", or fully restarted slurmctld daemon, between the job's start and the moment you queried the job info with squeue or scontrol (show job). Also worth to request you to check sacct for the job's partition, which must still be the one assigned when the job started to run, rather than the one it seems to get assigned after controller's restart. Regards, Carlos.
Yes, sacct is reporting the right thing: [root@holy8a24507 general]# sacct -j 3474320 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 3474320 Z34.sbatc+ sapphire hernquist+ 112 RUNNING 0:0 3474320.bat+ batch hernquist+ 36 RUNNING 0:0 3474320.ext+ extern hernquist+ 112 RUNNING 0:0 3474320.0 Arepo_smu+ hernquist+ 112 RUNNING 0:0 As for the restart/reconfigure question, yes we have done that multiple times between the job start and looking at it now. I can't see Ticket 22010 but given what you are saying here this is likely the same issue. -Paul Edmon- On 2/13/2025 9:25 AM, bugs@schedmd.com wrote: > > *Comment # 3 <https://support.schedmd.com/show_bug.cgi?id=22076#c3> on > ticket 22076 <https://support.schedmd.com/show_bug.cgi?id=22076> from > Carlos Tripiana Montes <mailto:tripiana@schedmd.com> * > Sorry to hear you are impacted by such issue. > > It seems like a duplicate ofTicket 22010 <show_bug.cgi?id=22010>, but I need you to confirm that you > issued something like "scontrol reconfigure", or fully restarted slurmctld > daemon, between the job's start and the moment you queried the job info with > squeue or scontrol (show job). > > Also worth to request you to check sacct for the job's partition, which must > still be the one assigned when the job started to run, rather than the one it > seems to get assigned after controller's restart. > > Regards, > Carlos. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
Thanks for confirming the details. I can say that I proposed a fix for that, and it's under review right now. I can, more or less, say the review process is evolving w/o any plot-twist by now. If everything goes smooth, I'll handle you the commit(s) for this fix in reasonable time. I know, this is a true sev2 issue. So, if you are willing to manually patch your slurm, I can share an early access patch with you. It doesn't need to be the final fix for that, and we cannot promise it won't have any side effects, but so far so good, it seems to be fixing the issue perfectly. Regards, Carlos.
Yeah, I would like to patch sooner rather than later if one is available as it is impacting preemption and hence scheduling efficiency. I've already patched for the --test-only requeue issue yesterday (see: https://support.schedmd.com/show_bug.cgi?id=21975#c49). That said if this will be in 24.11.2 and if that is imminent soon (assuming this is in that), I'd rather just grab the full block of patches. It will be a trade off, depending on timing. Things are stable so the scheduler is not broken perse, so I'd rather keep the scheduler running even in this state then merge a patch that may make things worse. I trust your QA, but again its the unknown unknowns and I tend to trust formal releases more than patches. Keep me posted. -Paul Edmon- On 2/13/2025 10:11 AM, bugs@schedmd.com wrote: > > *Comment # 5 <https://support.schedmd.com/show_bug.cgi?id=22076#c5> on > ticket 22076 <https://support.schedmd.com/show_bug.cgi?id=22076> from > Carlos Tripiana Montes <mailto:tripiana@schedmd.com> * > Thanks for confirming the details. > > I can say that I proposed a fix for that, and it's under review right now. I > can, more or less, say the review process is evolving w/o any plot-twist by > now. If everything goes smooth, I'll handle you the commit(s) for this fix in > reasonable time. > > I know, this is a true sev2 issue. So, if you are willing to manually patch > your slurm, I can share an early access patch with you. It doesn't need to be > the final fix for that, and we cannot promise it won't have any side effects, > but so far so good, it seems to be fixing the issue perfectly. > > Regards, > Carlos. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
Hi Paul, These commits, are the official fix we pushed to the repo for 24.11: 5c21c47c - Refactor _get_part_list() to set part_ptr_list and part_ptr 72f9552b - Refactor code to one call 50bbd2b0 - Fix multi-partition, running job getting wrong partition on restart I recommend you to apply such patch in the meantime we release the next minor for 24.11 (24.11.2).
Quick question. Do you have an ETA for when 24.11.2 will be released? On 2/14/25 3:55 AM, bugs@schedmd.com wrote: > Carlos Tripiana Montes <mailto:tripiana@schedmd.com> changed ticket > 22076 <https://support.schedmd.com/show_bug.cgi?id=22076> > What Removed Added > Status OPEN RESOLVED > Resolution --- FIXED > Version Fixed 24.11.2, 25.05.0rc1 > > *Comment # 7 <https://support.schedmd.com/show_bug.cgi?id=22076#c7> on > ticket 22076 <https://support.schedmd.com/show_bug.cgi?id=22076> from > Carlos Tripiana Montes <mailto:tripiana@schedmd.com> * > Hi Paul, > > These commits, are the official fix we pushed to the repo for 24.11: > > 5c21c47c - Refactor _get_part_list() to set part_ptr_list and part_ptr > 72f9552b - Refactor code to one call > 50bbd2b0 - Fix multi-partition, running job getting wrong partition on restart > > I recommend you to apply such patch in the meantime we release the next minor > for 24.11 (24.11.2). > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
AFAIK, tentative is for about Feb 25th, so in 10 days from today. But, it's tentative, not fixed yet. In any case, I expect not to be delayed for too much.