Occasionally we will run into a situation where jobs from a higher priority partition will not preempt those in a lower priority partition and thus the jobs will get stuck in spin lock in the scheduler. Below is an example. I don't understand why the backfill loop will not schedule this job. The error I'm seeing is: Mar 24 09:29:36 holy-slurm02 slurmctld[1004136]: slurmctld: preempted JobId=7292566 has been requeued to reclaim resources for JobId=7821033 Mar 24 09:29:36 holy-slurm02 slurmctld[1004136]: preempted JobId=7292566 has been requeued to reclaim resources for JobId=7821033 Mar 24 09:29:36 holy-slurm02 slurmctld[1004136]: sched/backfill: _start_job: Failed to start JobId=7821033 with holygpu8a[16103-16104,16201-16204,16302,16304,16401-16404,16502,16601,18101-18102,18201-18203,25204-25205,25305,25405-25406,27201,27301,27401,27503,29105-29106,29201-29203,29305-29306,29401-29402,31105-31106,31201-31203,31304] avail: Requested nodes are busy Mar 24 09:31:23 holy-slurm02 slurmctld[1004136]: sched/backfill: _start_job: Failed to start JobId=7821033 with holygpu8a[16103-16104,16201-16204,16302,16304,16401-16404,16502,16601,18101-18102,18104,18201-18203,25205,25305,25405-25406,27201,27301,27401,27503,29105-29106,29201-29203,29305-29306,29401-29402,31105-31106,31201-31203,31304] avail: Requested nodes are busy The job in question is: [root@holy-slurm02 ~]# scontrol show job 7821033 JobId=7821033 JobName=architecture_config0_2036_model_id3 UserId=usirin(61157) GroupId=idreos_lab(403333) MCS_label=N/A Priority=2762461 Nice=0 Account=idreos_lab QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=12:00:00 TimeMin=N/A SubmitTime=2025-03-23T13:36:22 EligibleTime=2025-03-23T13:36:22 AccrueTime=2025-03-23T13:36:22 StartTime=2025-03-24T09:30:13 EndTime=2025-03-24T21:30:13 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-03-24T09:30:12 Scheduler=Backfill:* Partition=seas_gpu AllocNode:Sid=holygpu7c26106:1191568 ReqNodeList=(null) ExcNodeList=holygpu8a16301 NodeList= SchedNodeList=holygpu8a18104 NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=8,mem=32000M,node=1,billing=342,gres/gpu=1 AllocTRES=(null) Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=8 MinMemoryTRES=100M MinTmpDiskNode=0 Features=a100|h100 DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/n/home04/usirin/mygithub/imagecalculator_training/utku/simsiam/run_ic_job.sh WorkDir=/n/home04/usirin/mygithub/imagecalculator_training/utku/simsiam StdErr=/n/home04/usirin/mygithub/imagecalculator_training/utku/simsiam/./sbatch_outs/architecture_config0_2036_model_id3_7821033.err StdIn=/dev/null StdOut=/n/home04/usirin/mygithub/imagecalculator_training/utku/simsiam/./sbatch_outs/architecture_config0_2036_model_id3_7821033.out MemPerTres=gpu:100 TresPerNode=gres/gpu:1 TresPerTask=cpu=8 [root@holy-slurm02 ~]# sacct -B -j 7821033 Batch Script for 7821033 -------------------------------------------------------------------------------- #!/bin/bash #SBATCH -c 8 # Number of cores (-c) #SBATCH --gres=gpu:1 # GPU #SBATCH -t 0-12:00 # Runtime in D-HH:MM, minimum of 10 minutes #SBATCH --mem=32000 # Memory pool for all cores (see also --mem-per-cpu) #SBATCH -o ./sbatch_outs/%x_%j.out # File to which STDOUT will be written, %j inserts jobid #SBATCH -e ./sbatch_outs/%x_%j.err # File to which STDERR will be written, %j inserts jobid #SBATCH --constraint=a100|h100 #SBATCH -p seas_gpu #SBATCH -x holygpu8a16301 # training setup num_gpu=1 batch_size=64 # per-gpu # ic fixed dimensions # test variables experiment_config_path=$1 use_jpeg_baseline=$2 model_id=$3 input_size=$4 port_num=$(python -c 'import socket; s = socket.socket(socket.AF_INET, socket.SOCK_STREAM); s.bind(("", 0)); print(s.getsockname()[1])') echo "model_id in run_ic_job.sh: $model_id" echo "port_num: $port_num" python image_calculator_build_accuracy.py $input_size $num_gpu $batch_size $port_num $model_id $use_jpeg_baseline $experiment_config_path This is the partition information afor the partition it is in: [root@holy-slurm02 ~]# scontrol show partition seas_gpu PartitionName=seas_gpu AllowGroups=seas,slurm-admin DenyAccounts=kempner_albergo_lab,kempner_alvarez_lab,kempner_ba_lab,kempner_barak_lab,kempner_bsabatini_lab,kempner_dam_lab,kempner_emalach_lab,kempner_gershman_lab,kempner_grads,kempner_hms,kempner_kdbrantley_lab,kempner_konkle_lab,kempner_krajan_lab,kempner_lab,kempner_murphy_lab,kempner_mzitnik_lab,kempner_pehlevan_lab,kempner_pslade_lab,kempner_sham_lab,kempner_sompolinsky_lab,kempner_undergrads,kempner_wcarvalho_lab AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=00:10:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED Nodes=holygpu8a[16101-16104,16201-16204,16301-16304,16401-16404,16501-16502,16601,18101-18104,18201-18203,25204-25205,25305-25306,25405-25406,27101,27201,27301,27401,27502-27503,29105-29106,29201-29203,29304-29306,29401-29402,31104-31106,31201-31203,31304-31306] PriorityJobFactor=1 PriorityTier=3 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=4480 TotalNodes=57 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED TRES=cpu=4480,mem=56194919M,node=57,billing=83294,gres/gpu=228,gres/gpu:nvidia_a100-sxm4-80gb=124,gres/gpu:nvidia_h100_80gb_hbm3=104 TRESBillingWeights=CPU=0.9,Mem=0.06G,Gres/gpu=333.2 The node it notes it is trying to schedule it on is: [root@holy8a24507 ~]# /usr/bin/squeue -w holygpu8a18104 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7178100 gpu_reque 2020-10- siacus R 0:03 1 holygpu8a18104 7177255 gpu_reque 2022-10- siacus R 0:59 1 holygpu8a18104 7178285 gpu_reque 2020-12- siacus R 0:59 1 holygpu8a18104 7178729 gpu_reque 2015-04- siacus R 0:59 1 holygpu8a18104 The partition that should be preempted is defined as: [root@holy-slurm02 ~]# scontrol show partition gpu_requeue PartitionName=gpu_requeue AllowGroups=cluster_users,cluster_users_2,slurm-admin DenyAccounts=kempner_albergo_lab,kempner_alvarez_lab,kempner_ba_lab,kempner_barak_lab,kempner_bsabatini_lab,kempner_dam_lab,kempner_emalach_lab,kempner_gershman_lab,kempner_grads,kempner_hms,kempner_kdbrantley_lab,kempner_konkle_lab,kempner_krajan_lab,kempner_lab,kempner_murphy_lab,kempner_mzitnik_lab,kempner_pehlevan_lab,kempner_pslade_lab,kempner_sham_lab,kempner_sompolinsky_lab,kempner_undergrads,kempner_wcarvalho_lab AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=00:10:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED Nodes=holygpu2c[0923,1121],holygpu7c[0920,1311,1315,1317,1323],holygpu8a[11101-11104,11201-11204,11301-11304,11401-11404,11501-11504,11601-11604,13101-13104,13201-13204,13301-13304,13401-13404,13501-13504,13601-13604,15101-15104,15201-15204,15301-15304,15401-15404,15501-15504,15601-15604,16101-16104,16201-16204,16301-16304,16401-16404,16501-16502,16601,17101-17104,17201-17204,17301-17304,17401-17404,17501-17504,17601-17604,18101-18104,18201-18204,18301-18304,18401-18404,18501-18502,18601-18602,19101-19106,19201-19206,19301-19306,19401-19406,19501-19506,19601-19606,22101-22106,22201-22206,22301-22306,22401-22406,22501-22506,22601-22606,24404-24405,25104-25106,25204-25206,25305-25306,25404-25406,26304,26504-26506,27101-27103,27201-27203,27301-27303,27401-27403,27501-27503,27601,29104-29106,29201-29203,29304-29306,29401-29402,29406,29504-29506,30505,31104-31106,31201-31203,31304-31306,31401-31402,31406,31504-31506] PriorityJobFactor=1 PriorityTier=2 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=REQUEUE State=UP TotalCPUs=22448 TotalNodes=278 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED TRES=cpu=22448,mem=334394881M,node=278,billing=172333,gres/gpu=1150,gres/gpu:nvidia_a100-sxm4-40gb=176,gres/gpu:nvidia_a100-sxm4-80gb=343,gres/gpu:nvidia_a100_1g.10gb=7,gres/gpu:nvidia_a100_1g.5gb=28,gres/gpu:nvidia_a40=16,gres/gpu:nvidia_h100_80gb_hbm3=568,gres/gpu:tesla_v100-pcie-32gb=8,gres/gpu:tesla_v100s-pcie-32gb=4 TRESBillingWeights=CPU=0.5,Mem=0.125G,Gres/gpu=104.6 An example of one of the jobs that should be preempted is: [root@holy8a24507 ~]# scontrol show job 7178100 JobId=7178100 JobName=2020-10-10.parquet UserId=siacus(62339) GroupId=siacus_lab(10662) MCS_label=N/A Priority=999999999 Nice=0 Account=siacus_lab QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=16 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:10 TimeLimit=2-00:00:00 TimeMin=N/A SubmitTime=2025-03-24T07:44:51 EligibleTime=2025-03-24T07:46:52 AccrueTime=2025-03-24T07:46:52 StartTime=2025-03-24T09:39:30 EndTime=2025-03-26T09:39:30 Deadline=N/A PreemptEligibleTime=2025-03-24T09:39:30 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-03-24T09:39:30 Scheduler=Main Partition=gpu_requeue AllocNode:Sid=holylogin06:361304 ReqNodeList=(null) ExcNodeList=holygpu7c0920,holygpu8a[19604,25104] NodeList=holygpu8a18104 BatchHost=holygpu8a18104 NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=2 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=2,mem=25G,node=1,billing=108,gres/gpu=1 AllocTRES=cpu=2,mem=25G,node=1,billing=108,gres/gpu=1,gres/gpu:nvidia_h100_80gb_hbm3=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=2 MinMemoryTRES=100M MinTmpDiskNode=0 Features=h100|a100 DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/n/netscratch/siacus_lab/Lab/scripts/process_USA-fasrc-cuda.sbatch WorkDir=/n/netscratch/siacus_lab/Lab/scripts StdErr=/n/netscratch/siacus_lab/Lab/log//tweets_USA/2020-10-10.parquet.log StdIn=/dev/null StdOut=/n/netscratch/siacus_lab/Lab/log//tweets_USA/2020-10-10.parquet.log MemPerTres=gpu:100 TresPerNode=gres/gpu:1,gres/gpu:1 TresPerTask=cpu=2 MailUser=siacus@iq.harvard.edu MailType=FAIL [root@holy8a24507 ~]# sacct -B -j 7178100 Batch Script for 7178100 -------------------------------------------------------------------------------- #!/bin/sh -l # FILENAME: process_classify-batch.sbatch #SBATCH --gres=gpu:1 # Request 1 GPU #SBATCH --gpus-per-node=1 # Number of GPUs per node #SBATCH --partition=gpu_requeue # Partition name for GPU jobs #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --requeue #SBATCH --mail-user=siacus@iq.harvard.edu #SBATCH --mail-type=FAIL # Send email to above address at begin and end of job #SBATCH --cpus-per-task=2 # Request 2 CPU core #SBATCH --constraint="h100|a100" # do not change this !!!! #SBATCH --mem=25G #SBATCH -x holygpu7c0920,holygpu8a25104,holygpu8a19604 # ANVIL: Manage processing environment, load compilers and applications. # module purge # # Load necessary modules # module load modtree/gpu # default gcc and cuda version too old # module load cuda/11 # the version of cuda and gcc shold match on this cluster # module load gcc/11 # module load anaconda # module list # conda activate jago # FASRC: module load nvhpc/23.7-fasrc01 module load cuda/12.2.0-fasrc01 module load gcc/12.2.0-fasrc01 # Print the hostname of the compute node on which this job is running. hostname cd /n/netscratch/siacus_lab/Lab/scripts /n/home11/siacus/miniconda3/envs/cuda/bin/python classify-fasrc.py ${1} While the individual priority of the gpu_requeue jobs is higher than the one in seas_gpu, seas_gpu has a partition priority that is higher than gpu_requeue and thus should preempt regardless of the individual job priority. I don't see anything perse that would preclude preemption the gpu_requeue job. I tried dialing up the Backfill debugging info but it didn't give me any more details as to why those specific nodes could not be use other than they were busy. Any insight into why this is happening would be appreciated. This could be a weird edge case in the scheduler that needs to be remediated.
Created attachment 41238 [details] Current slurm.conf
Created attachment 41239 [details] Current topology.conf
Hello Paul, I'm still digging through all of the info you have already given, but would it be possible to get the slurmd log for the "busy" node the seas_gpu job won't run on too? Looks like it is holygpu8a18104. -- Will
Created attachment 41240 [details] slurmd log for holygpu8a18104 for 03-24-25 9-10am
Yup, I've attached it for the hour around that log snippet. Happy to pull more. -Paul Edmon- On 3/24/2025 1:01 PM, bugs@schedmd.com wrote: > > *Comment # 3 <https://support.schedmd.com/show_bug.cgi?id=22421#c3> on > ticket 22421 <https://support.schedmd.com/show_bug.cgi?id=22421> from > Will Shanks <mailto:will@schedmd.com> * > Hello Paul, > > I'm still digging through all of the info you have already given, but would it > be possible to get the slurmd log for the "busy" node the seas_gpu job won't > run on too? Looks like it is holygpu8a18104. > > -- Will > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
Thank you for the logs. On a first pass it looks like you have a lot of errors related to AcctGatherInterconnectType[1] in your slurm.conf, but I expect that is probably unrelated to the current issue. >cat holygpu8a18104-03-24-25.log | grep slurm | grep error | cut -d' ' -f7- | sed 's/JOB [[:digit:]]\+ /JOB JOBID /' | sed 's/09:[[:digit:]][[:digit:]]:[[:digit:]][[:digit:]]/TIME/' |sort | uniq -c | sort -n | tail -n5 > 1 stepd_cleanup: done with step (rc[0x100]:Unknown error 256, cleanup_rc[0x0]:No error) > 50 stepd_cleanup: done with step (rc[0xf]:Block device required, cleanup_rc[0x0]:No error) > 111 debug levels are stderr='error', logfile='info', syslog='verbose' > 111 error: *** JOB JOBID ON holygpu8a18104 CANCELLED AT 2025-03-24TTIME DUE TO PREEMPTION *** > 222 error: TRES ic/sysfs not configured I will let you know when I have a better idea of what is going on. -- Will [1]: https://slurm.schedmd.com/slurm.conf.html#OPT_acct_gather_interconnect/sysfs
Thanks for the tip about sysfs. I totally missed that setting. I will get that fixed. -Paul Edmon- On 3/24/2025 2:25 PM, bugs@schedmd.com wrote: > > *Comment # 6 <https://support.schedmd.com/show_bug.cgi?id=22421#c6> on > ticket 22421 <https://support.schedmd.com/show_bug.cgi?id=22421> from > Will Shanks <mailto:will@schedmd.com> * > Thank you for the logs. On a first pass it looks like you have a lot of errors > related to AcctGatherInterconnectType[1] in your slurm.conf, but I expect that > is probably unrelated to the current issue. > > >cat holygpu8a18104-03-24-25.log | grep slurm | grep error | cut -d' ' -f7- | sed 's/JOB [[:digit:]]\+ /JOB JOBID /' | sed 's/09:[[:digit:]][[:digit:]]:[[:digit:]][[:digit:]]/TIME/' |sort | uniq -c | sort -n | tail -n5 > 1 stepd_cleanup: done with step (rc[0x100]:Unknown error 256, > cleanup_rc[0x0]:No error) > 50 stepd_cleanup: done with step > (rc[0xf]:Block device required, cleanup_rc[0x0]:No error) > 111 debug > levels are stderr='error', logfile='info', syslog='verbose' > 111 > error: *** JOB JOBID ON holygpu8a18104 CANCELLED AT 2025-03-24TTIME > DUE TO PREEMPTION *** > 222 error: TRES ic/sysfs not configured > > I will let you know when I have a better idea of what is going on. > > -- Will > > > [1]: > https://slurm.schedmd.com/slurm.conf.html#OPT_acct_gather_interconnect/sysfs > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
Hello, I'm having trouble reproducing this, would it be possible to set DebugFlags=Backfill[1] and SlurmctldDebug=debug3[2] for at least the duration of a few of these cycles and share the slurmctld.log? I'm hoping this will help me narrow down the search space for reproducing this locally. -- Will [1]:https://slurm.schedmd.com/slurm.conf.html#OPT_Backfill [2]:https://slurm.schedmd.com/slurm.conf.html#OPT_debug3
Sure, I can set it next time this happens. The issue is definitely sensitive to cluster state as its not happening right now. However when I see a job again in this state I will hike the debugging and then dump the slurmctld log for you. -Paul Edmon- On 4/1/25 1:55 PM, bugs@schedmd.com wrote: > > *Comment # 9 <https://support.schedmd.com/show_bug.cgi?id=22421#c9> on > ticket 22421 <https://support.schedmd.com/show_bug.cgi?id=22421> from > Will Shanks <mailto:will@schedmd.com> * > Hello, > > I'm having trouble reproducing this, would it be possible to set > DebugFlags=Backfill[1] and SlurmctldDebug=debug3[2] for at least the duration > of a few of these cycles and share the slurmctld.log? I'm hoping this will help > me narrow down the search space for reproducing this locally. > > -- Will > > [1]:https://slurm.schedmd.com/slurm.conf.html#OPT_Backfill > [2]:https://slurm.schedmd.com/slurm.conf.html#OPT_debug3 > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >