22421 – Jobs Not Being Preempted in Lower Priority Partition

Ticket 22421 - Jobs Not Being Preempted in Lower Priority Partition

Summary: Jobs Not Being Preempted in Lower Priority Partition

Status:	OPEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	24.11.2
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Will Shanks
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2025-03-24 07:49 MDT by Paul Edmon
Modified:	2025-04-01 12:04 MDT (History)
CC List:	0 users

See Also:
Site:	Harvard University
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Current slurm.conf (69.83 KB, text/x-matlab) 2025-03-24 07:49 MDT, Paul Edmon	Details
Current topology.conf (4.78 KB, text/x-matlab) 2025-03-24 07:50 MDT, Paul Edmon	Details
slurmd log for holygpu8a18104 for 03-24-25 9-10am (43.23 KB, application/x-compressed) 2025-03-24 11:07 MDT, Paul Edmon	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Paul Edmon 2025-03-24 07:49:36 MDT

Occasionally we will run into a situation where jobs from a higher priority partition will not preempt those in a lower priority partition and thus the jobs will get stuck in spin lock in the scheduler. Below is an example. I don't understand why the backfill loop will not schedule this job. The error I'm seeing is:

Mar 24 09:29:36 holy-slurm02 slurmctld[1004136]: slurmctld: preempted JobId=7292566 has been requeued to reclaim resources for JobId=7821033
Mar 24 09:29:36 holy-slurm02 slurmctld[1004136]: preempted JobId=7292566 has been requeued to reclaim resources for JobId=7821033
Mar 24 09:29:36 holy-slurm02 slurmctld[1004136]: sched/backfill: _start_job: Failed to start JobId=7821033 with holygpu8a[16103-16104,16201-16204,16302,16304,16401-16404,16502,16601,18101-18102,18201-18203,25204-25205,25305,25405-25406,27201,27301,27401,27503,29105-29106,29201-29203,29305-29306,29401-29402,31105-31106,31201-31203,31304] avail: Requested nodes are busy
Mar 24 09:31:23 holy-slurm02 slurmctld[1004136]: sched/backfill: _start_job: Failed to start JobId=7821033 with holygpu8a[16103-16104,16201-16204,16302,16304,16401-16404,16502,16601,18101-18102,18104,18201-18203,25205,25305,25405-25406,27201,27301,27401,27503,29105-29106,29201-29203,29305-29306,29401-29402,31105-31106,31201-31203,31304] avail: Requested nodes are busy

The job in question is:

[root@holy-slurm02 ~]# scontrol show job 7821033
JobId=7821033 JobName=architecture_config0_2036_model_id3
   UserId=usirin(61157) GroupId=idreos_lab(403333) MCS_label=N/A
   Priority=2762461 Nice=0 Account=idreos_lab QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=12:00:00 TimeMin=N/A
   SubmitTime=2025-03-23T13:36:22 EligibleTime=2025-03-23T13:36:22
   AccrueTime=2025-03-23T13:36:22
   StartTime=2025-03-24T09:30:13 EndTime=2025-03-24T21:30:13 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-03-24T09:30:12 Scheduler=Backfill:*
   Partition=seas_gpu AllocNode:Sid=holygpu7c26106:1191568
   ReqNodeList=(null) ExcNodeList=holygpu8a16301
   NodeList= SchedNodeList=holygpu8a18104
   NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=8,mem=32000M,node=1,billing=342,gres/gpu=1
   AllocTRES=(null)
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=8 MinMemoryTRES=100M MinTmpDiskNode=0
   Features=a100|h100 DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/n/home04/usirin/mygithub/imagecalculator_training/utku/simsiam/run_ic_job.sh
   WorkDir=/n/home04/usirin/mygithub/imagecalculator_training/utku/simsiam
   StdErr=/n/home04/usirin/mygithub/imagecalculator_training/utku/simsiam/./sbatch_outs/architecture_config0_2036_model_id3_7821033.err
   StdIn=/dev/null
   StdOut=/n/home04/usirin/mygithub/imagecalculator_training/utku/simsiam/./sbatch_outs/architecture_config0_2036_model_id3_7821033.out
   MemPerTres=gpu:100
   TresPerNode=gres/gpu:1
   TresPerTask=cpu=8

[root@holy-slurm02 ~]# sacct -B -j 7821033
Batch Script for 7821033
--------------------------------------------------------------------------------
#!/bin/bash
#SBATCH -c 8                                        # Number of cores (-c)
#SBATCH --gres=gpu:1                                # GPU
#SBATCH -t 0-12:00                                  # Runtime in D-HH:MM, minimum of 10 minutes
#SBATCH --mem=32000                                 # Memory pool for all cores (see also --mem-per-cpu)
#SBATCH -o ./sbatch_outs/%x_%j.out                          # File to which STDOUT will be written, %j inserts jobid
#SBATCH -e ./sbatch_outs/%x_%j.err                          # File to which STDERR will be written, %j inserts jobid
#SBATCH --constraint=a100|h100
#SBATCH -p seas_gpu
#SBATCH -x holygpu8a16301

# training setup
num_gpu=1
batch_size=64 # per-gpu

# ic fixed dimensions
# test variables
experiment_config_path=$1
use_jpeg_baseline=$2
model_id=$3
input_size=$4
port_num=$(python -c 'import socket; s = socket.socket(socket.AF_INET, socket.SOCK_STREAM); s.bind(("", 0)); print(s.getsockname()[1])')

echo "model_id in run_ic_job.sh: $model_id"
echo "port_num: $port_num"

python image_calculator_build_accuracy.py $input_size $num_gpu $batch_size $port_num $model_id $use_jpeg_baseline $experiment_config_path

This is the partition information afor the partition it is in:

[root@holy-slurm02 ~]# scontrol show partition seas_gpu
PartitionName=seas_gpu
   AllowGroups=seas,slurm-admin DenyAccounts=kempner_albergo_lab,kempner_alvarez_lab,kempner_ba_lab,kempner_barak_lab,kempner_bsabatini_lab,kempner_dam_lab,kempner_emalach_lab,kempner_gershman_lab,kempner_grads,kempner_hms,kempner_kdbrantley_lab,kempner_konkle_lab,kempner_krajan_lab,kempner_lab,kempner_murphy_lab,kempner_mzitnik_lab,kempner_pehlevan_lab,kempner_pslade_lab,kempner_sham_lab,kempner_sompolinsky_lab,kempner_undergrads,kempner_wcarvalho_lab AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=00:10:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
   Nodes=holygpu8a[16101-16104,16201-16204,16301-16304,16401-16404,16501-16502,16601,18101-18104,18201-18203,25204-25205,25305-25306,25405-25406,27101,27201,27301,27401,27502-27503,29105-29106,29201-29203,29304-29306,29401-29402,31104-31106,31201-31203,31304-31306]
   PriorityJobFactor=1 PriorityTier=3 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=4480 TotalNodes=57 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
   TRES=cpu=4480,mem=56194919M,node=57,billing=83294,gres/gpu=228,gres/gpu:nvidia_a100-sxm4-80gb=124,gres/gpu:nvidia_h100_80gb_hbm3=104
   TRESBillingWeights=CPU=0.9,Mem=0.06G,Gres/gpu=333.2

The node it notes it is trying to schedule it on is:

[root@holy8a24507 ~]# /usr/bin/squeue -w holygpu8a18104
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           7178100 gpu_reque 2020-10-   siacus  R       0:03      1 holygpu8a18104
           7177255 gpu_reque 2022-10-   siacus  R       0:59      1 holygpu8a18104
           7178285 gpu_reque 2020-12-   siacus  R       0:59      1 holygpu8a18104
           7178729 gpu_reque 2015-04-   siacus  R       0:59      1 holygpu8a18104

The partition that should be preempted is defined as:

[root@holy-slurm02 ~]# scontrol show partition gpu_requeue
PartitionName=gpu_requeue
   AllowGroups=cluster_users,cluster_users_2,slurm-admin DenyAccounts=kempner_albergo_lab,kempner_alvarez_lab,kempner_ba_lab,kempner_barak_lab,kempner_bsabatini_lab,kempner_dam_lab,kempner_emalach_lab,kempner_gershman_lab,kempner_grads,kempner_hms,kempner_kdbrantley_lab,kempner_konkle_lab,kempner_krajan_lab,kempner_lab,kempner_murphy_lab,kempner_mzitnik_lab,kempner_pehlevan_lab,kempner_pslade_lab,kempner_sham_lab,kempner_sompolinsky_lab,kempner_undergrads,kempner_wcarvalho_lab AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=00:10:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
   Nodes=holygpu2c[0923,1121],holygpu7c[0920,1311,1315,1317,1323],holygpu8a[11101-11104,11201-11204,11301-11304,11401-11404,11501-11504,11601-11604,13101-13104,13201-13204,13301-13304,13401-13404,13501-13504,13601-13604,15101-15104,15201-15204,15301-15304,15401-15404,15501-15504,15601-15604,16101-16104,16201-16204,16301-16304,16401-16404,16501-16502,16601,17101-17104,17201-17204,17301-17304,17401-17404,17501-17504,17601-17604,18101-18104,18201-18204,18301-18304,18401-18404,18501-18502,18601-18602,19101-19106,19201-19206,19301-19306,19401-19406,19501-19506,19601-19606,22101-22106,22201-22206,22301-22306,22401-22406,22501-22506,22601-22606,24404-24405,25104-25106,25204-25206,25305-25306,25404-25406,26304,26504-26506,27101-27103,27201-27203,27301-27303,27401-27403,27501-27503,27601,29104-29106,29201-29203,29304-29306,29401-29402,29406,29504-29506,30505,31104-31106,31201-31203,31304-31306,31401-31402,31406,31504-31506]
   PriorityJobFactor=1 PriorityTier=2 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=REQUEUE
   State=UP TotalCPUs=22448 TotalNodes=278 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
   TRES=cpu=22448,mem=334394881M,node=278,billing=172333,gres/gpu=1150,gres/gpu:nvidia_a100-sxm4-40gb=176,gres/gpu:nvidia_a100-sxm4-80gb=343,gres/gpu:nvidia_a100_1g.10gb=7,gres/gpu:nvidia_a100_1g.5gb=28,gres/gpu:nvidia_a40=16,gres/gpu:nvidia_h100_80gb_hbm3=568,gres/gpu:tesla_v100-pcie-32gb=8,gres/gpu:tesla_v100s-pcie-32gb=4
   TRESBillingWeights=CPU=0.5,Mem=0.125G,Gres/gpu=104.6

An example of one of the jobs that should be preempted is:

[root@holy8a24507 ~]# scontrol show job 7178100
JobId=7178100 JobName=2020-10-10.parquet
   UserId=siacus(62339) GroupId=siacus_lab(10662) MCS_label=N/A
   Priority=999999999 Nice=0 Account=siacus_lab QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=16 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:10 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2025-03-24T07:44:51 EligibleTime=2025-03-24T07:46:52
   AccrueTime=2025-03-24T07:46:52
   StartTime=2025-03-24T09:39:30 EndTime=2025-03-26T09:39:30 Deadline=N/A
   PreemptEligibleTime=2025-03-24T09:39:30 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-03-24T09:39:30 Scheduler=Main
   Partition=gpu_requeue AllocNode:Sid=holylogin06:361304
   ReqNodeList=(null) ExcNodeList=holygpu7c0920,holygpu8a[19604,25104]
   NodeList=holygpu8a18104
   BatchHost=holygpu8a18104
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=2,mem=25G,node=1,billing=108,gres/gpu=1
   AllocTRES=cpu=2,mem=25G,node=1,billing=108,gres/gpu=1,gres/gpu:nvidia_h100_80gb_hbm3=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=2 MinMemoryTRES=100M MinTmpDiskNode=0
   Features=h100|a100 DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/n/netscratch/siacus_lab/Lab/scripts/process_USA-fasrc-cuda.sbatch
   WorkDir=/n/netscratch/siacus_lab/Lab/scripts
   StdErr=/n/netscratch/siacus_lab/Lab/log//tweets_USA/2020-10-10.parquet.log
   StdIn=/dev/null
   StdOut=/n/netscratch/siacus_lab/Lab/log//tweets_USA/2020-10-10.parquet.log
   MemPerTres=gpu:100
   TresPerNode=gres/gpu:1,gres/gpu:1
   TresPerTask=cpu=2
   MailUser=siacus@iq.harvard.edu MailType=FAIL


[root@holy8a24507 ~]# sacct -B -j 7178100
Batch Script for 7178100
--------------------------------------------------------------------------------
#!/bin/sh -l
# FILENAME: process_classify-batch.sbatch

#SBATCH --gres=gpu:1          # Request 1 GPU
#SBATCH --gpus-per-node=1     # Number of GPUs per node
#SBATCH --partition=gpu_requeue # Partition name for GPU jobs
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --requeue
#SBATCH --mail-user=siacus@iq.harvard.edu
#SBATCH --mail-type=FAIL       # Send email to above address at begin and end of job
#SBATCH --cpus-per-task=2                 # Request 2 CPU core
#SBATCH --constraint="h100|a100" # do not change this !!!!
#SBATCH --mem=25G
#SBATCH -x holygpu7c0920,holygpu8a25104,holygpu8a19604



# ANVIL: Manage processing environment, load compilers and applications.
# module purge
# # Load necessary modules
# module load modtree/gpu   # default gcc and cuda version too old
# module load cuda/11  # the version of cuda and gcc shold match on this cluster
# module load gcc/11
# module load anaconda
# module list
# conda activate jago

# FASRC:
module load nvhpc/23.7-fasrc01
module load cuda/12.2.0-fasrc01
module load gcc/12.2.0-fasrc01

# Print the hostname of the compute node on which this job is running.
hostname

cd /n/netscratch/siacus_lab/Lab/scripts

/n/home11/siacus/miniconda3/envs/cuda/bin/python classify-fasrc.py ${1}

While the individual priority of the gpu_requeue jobs is higher than the one in seas_gpu, seas_gpu has a partition priority that is higher than gpu_requeue and thus should preempt regardless of the individual job priority. I don't see anything perse that would preclude preemption the gpu_requeue job. I tried dialing up the Backfill debugging info but it didn't give me any more details as to why those specific nodes could not be use other than they were busy.

Any insight into why this is happening would be appreciated. This could be a weird edge case in the scheduler that needs to be remediated.

Comment 1 Paul Edmon 2025-03-24 07:49:59 MDT

Created attachment 41238 [details]
Current slurm.conf

Comment 2 Paul Edmon 2025-03-24 07:50:14 MDT

Created attachment 41239 [details]
Current topology.conf

Comment 3 Will Shanks 2025-03-24 11:01:02 MDT

Hello Paul,

I'm still digging through all of the info you have already given, but would it be possible to get the slurmd log for the "busy" node the seas_gpu job won't run on too? Looks like it is holygpu8a18104.

-- Will

Comment 4 Paul Edmon 2025-03-24 11:07:29 MDT

Created attachment 41240 [details]
slurmd log for holygpu8a18104 for 03-24-25 9-10am

Comment 5 Paul Edmon 2025-03-24 11:08:10 MDT

Yup, I've attached it for the hour around that log snippet. Happy to 
pull more.

-Paul Edmon-

On 3/24/2025 1:01 PM, bugs@schedmd.com wrote:
>
> *Comment # 3 <https://support.schedmd.com/show_bug.cgi?id=22421#c3> on 
> ticket 22421 <https://support.schedmd.com/show_bug.cgi?id=22421> from 
> Will Shanks <mailto:will@schedmd.com> *
> Hello Paul,
>
> I'm still digging through all of the info you have already given, but would it
> be possible to get the slurmd log for the "busy" node the seas_gpu job won't
> run on too? Looks like it is holygpu8a18104.
>
> -- Will
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 6 Will Shanks 2025-03-24 12:25:27 MDT

Thank you for the logs. On a first pass it looks like you have a lot of errors related to AcctGatherInterconnectType[1] in your slurm.conf, but I expect that is probably unrelated to the current issue.

>cat holygpu8a18104-03-24-25.log | grep slurm | grep error | cut -d' ' -f7- | sed 's/JOB [[:digit:]]\+ /JOB JOBID /' | sed 's/09:[[:digit:]][[:digit:]]:[[:digit:]][[:digit:]]/TIME/' |sort | uniq -c | sort -n | tail -n5
>      1 stepd_cleanup: done with step (rc[0x100]:Unknown error 256, cleanup_rc[0x0]:No error)
>     50 stepd_cleanup: done with step (rc[0xf]:Block device required, cleanup_rc[0x0]:No error)
>    111 debug levels are stderr='error', logfile='info', syslog='verbose'
>    111 error: *** JOB JOBID ON holygpu8a18104 CANCELLED AT 2025-03-24TTIME DUE TO PREEMPTION ***
>    222 error: TRES ic/sysfs not configured

I will let you know when I have a better idea of what is going on.

-- Will


[1]: https://slurm.schedmd.com/slurm.conf.html#OPT_acct_gather_interconnect/sysfs

Comment 7 Paul Edmon 2025-03-24 12:31:39 MDT

Thanks for the tip about sysfs. I totally missed that setting. I will 
get that fixed.

-Paul Edmon-

On 3/24/2025 2:25 PM, bugs@schedmd.com wrote:
>
> *Comment # 6 <https://support.schedmd.com/show_bug.cgi?id=22421#c6> on 
> ticket 22421 <https://support.schedmd.com/show_bug.cgi?id=22421> from 
> Will Shanks <mailto:will@schedmd.com> *
> Thank you for the logs. On a first pass it looks like you have a lot of errors
> related to AcctGatherInterconnectType[1] in your slurm.conf, but I expect that
> is probably unrelated to the current issue.
>
> >cat holygpu8a18104-03-24-25.log | grep slurm | grep error | cut -d' ' -f7- | sed 's/JOB [[:digit:]]\+ /JOB JOBID /' | sed 's/09:[[:digit:]][[:digit:]]:[[:digit:]][[:digit:]]/TIME/' |sort | uniq -c | sort -n | tail -n5 > 1 stepd_cleanup: done with step (rc[0x100]:Unknown error 256, 
> cleanup_rc[0x0]:No error) > 50 stepd_cleanup: done with step 
> (rc[0xf]:Block device required, cleanup_rc[0x0]:No error) > 111 debug 
> levels are stderr='error', logfile='info', syslog='verbose' > 111 
> error: *** JOB JOBID ON holygpu8a18104 CANCELLED AT 2025-03-24TTIME 
> DUE TO PREEMPTION *** > 222 error: TRES ic/sysfs not configured
>
> I will let you know when I have a better idea of what is going on.
>
> -- Will
>
>
> [1]:
> https://slurm.schedmd.com/slurm.conf.html#OPT_acct_gather_interconnect/sysfs
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 9 Will Shanks 2025-04-01 11:55:40 MDT

Hello,

I'm having trouble reproducing this, would it be possible to set DebugFlags=Backfill[1] and SlurmctldDebug=debug3[2] for at least the duration of a few of these cycles and share the slurmctld.log? I'm hoping this will help me narrow down the search space for reproducing this locally.

-- Will

[1]:https://slurm.schedmd.com/slurm.conf.html#OPT_Backfill
[2]:https://slurm.schedmd.com/slurm.conf.html#OPT_debug3

Comment 10 Paul Edmon 2025-04-01 12:04:14 MDT

Sure, I can set it next time this happens. The issue is definitely 
sensitive to cluster state as its not happening right now. However when 
I see a job again in this state I will hike the debugging and then dump 
the slurmctld log for you.

-Paul Edmon-

On 4/1/25 1:55 PM, bugs@schedmd.com wrote:
>
> *Comment # 9 <https://support.schedmd.com/show_bug.cgi?id=22421#c9> on 
> ticket 22421 <https://support.schedmd.com/show_bug.cgi?id=22421> from 
> Will Shanks <mailto:will@schedmd.com> *
> Hello,
>
> I'm having trouble reproducing this, would it be possible to set
> DebugFlags=Backfill[1] and SlurmctldDebug=debug3[2] for at least the duration
> of a few of these cycles and share the slurmctld.log? I'm hoping this will help
> me narrow down the search space for reproducing this locally.
>
> -- Will
>
> [1]:https://slurm.schedmd.com/slurm.conf.html#OPT_Backfill
> [2]:https://slurm.schedmd.com/slurm.conf.html#OPT_debug3
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>