Created attachment 16123 [details] slurm log files We have a job that is getting starved - 517033. Constanly getting rescheduled. Please find the attached slurmlogs.
The slurmsched.log file suggests that it isn't running because of an accounting limit: sched: [2020-10-01T09:02:27.580] JobId=517033 delayed for accounting policy You should be able to see what accounting limit is hit with the command "squeue" or "scontrol show job 517033". I can't find out anything else from what you've provided. The slurmctld log level is too low so it isn't showing what accounting policy it failed. The scontrol-show-job.rtf is garbled - looking at it directly with a text editor is garbage, and using unrtf --text scontrol-show-job.rtf also shows garbage. Can you do the following? - Increase slurmctld debug level to "debug" by running scontrol setdebug debug - Set the "backfill" debug flag by running scontrol setdebugflags +backfill - Wait for 10 minutes - Reset the slurmctld debug level back to what you want. I recommend setting it to "verbose" or "debug". The output level of "info" is often not verbose enough to get useful information, but you also don't want too much logging. "verbose" and "debug" tend to be a good middle ground. scontrol setdebug verbose (or info or just leave it at debug) - Remove the "backfill" debug flag: scontrol setdebugflags -backfill - Run: scontrol -d show job 517033 squeue -a - Upload the output of these commands and the slurmctld log file and slurmctld sched log file in plain text (not rtf, since I'm having trouble with it). For reference, here's what I see when trying to look at the scontrol-show-job.rtf file: $ unrtf scontol-show-job.rtf <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta http-equiv="content-type" content="text/html; charset=utf-8"> <!-- Translation from RTF performed by UnRTF, version 0.21.9 --> </head> <body>_c#7OcfFXXhHld$dK F ;"2#VD(Uwk?'|W8dh868<|+uXth>ksc/:fX0NT=.~a&WYuB@t<zO?'<lt:On=)~;?3/Wn>x>]h^F>~wwNW<x6~7^~tt?Y/~+?=OOk~$k~o~^SWb;7N?O/_>??WO6>o`o;?o3_<wa[Ox_kOVq_uW_Sba_uWFI~v:;tV-.<Xj:Z/_bv<q9qe0l|9|9>?gu?l>>Q|>_fx>W7?gFll|6Ng*t_m2cgTz=i;Xj4?Qrrr9|1:]=zEpZ&xj<z<[f.?w?e8_<y>z=<~?N>G_>'_|><G|zGl.N?Ch2>hWcWo'g2m7Tyxg0|*YolM7Mvl;:u#oaveUcS/]BNnYy/i[ vzk;~qUd;;omf1l+&dKg6/s>Q;I5SC]'NyqryM[IlM61o0IYReuP%>vecRIA$5)m^yTCTyC7^e0u^Um+CD^0d`.CV:6m^Vw'[6dwzp0ckle]>t+t>>YeM6ji6/Cdigm6:f<M@7y;7e6/+V);HT5v2A[[l237WDv'cWymOa=<6"/uzY5va<[,)M!xx7hUJ;+[(y$^v01mOVI4.L8$ m^: &KmW=i4CMXlV &Yfe4VqoqAlr i6&/S4N"yr5 v +^V_2 ,^VqFIlyfXVvjxu)ju^kW%Qx/G;(uvj=Em6moIb2gh[Q>tmq.#g= PA0_]|5JR<ILj/(Wou_OZxgg?''7oG7o/zOy_kck:b]P__leh9].dxVgqg<vvouz~bVd/8b|<MG+<<?I=]Nwp5*8TG>kg?:V?9<8u?p__<uCtd<.fxGJ?f=vAi~Vx2_,]?9;|Efdq|~uO.]L^s//_/O8tW:sm<]?zxz>]_t]|UnA? _m^'z4M'yBQ)xtuglm~3jiM.mty|9tcg>GZg||hzndvvGN^ApGg'm<N>|'_Gi*.>>d(<r!/=Sh9Casy%MKn[bvR&=>sw?3yw4'/g`W.vN&zgfby"'8&slBoO^|u/78PX~OgOGe+otz<&fZI1W~q/7>x59`Y6Qen~zx>"|#MG*taOf~0KOFgy?]]-Vk,OWpECOGW+<^>tU%06!^}t>rsa&u3tQ3C'!>]N|2W-xpX@_xt,9MWe9G.;^O]0~7/e#~X"7A]9d~le-V'nn%VV'nn9qL.8ca1/^</body> </html>
Created attachment 16143 [details] Slurm logs -recent
(In reply to Marshall Garey from comment #1) > The slurmsched.log file suggests that it isn't running because of an > accounting limit: > > sched: [2020-10-01T09:02:27.580] JobId=517033 delayed for accounting policy > > You should be able to see what accounting limit is hit with the command > "squeue" or "scontrol show job 517033". > > I can't find out anything else from what you've provided. The slurmctld log > level is too low so it isn't showing what accounting policy it failed. The > scontrol-show-job.rtf is garbled - looking at it directly with a text editor > is garbage, and using unrtf --text scontrol-show-job.rtf also shows garbage. > Can you do the following? > > - Increase slurmctld debug level to "debug" by running > scontrol setdebug debug > - Set the "backfill" debug flag by running > scontrol setdebugflags +backfill > - Wait for 10 minutes > - Reset the slurmctld debug level back to what you want. I recommend setting > it to "verbose" or "debug". The output level of "info" is often not verbose > enough to get useful information, but you also don't want too much logging. > "verbose" and "debug" tend to be a good middle ground. > scontrol setdebug verbose > (or info or just leave it at debug) > - Remove the "backfill" debug flag: > scontrol setdebugflags -backfill > - Run: > scontrol -d show job 517033 > squeue -a > - Upload the output of these commands and the slurmctld log file and > slurmctld sched log file in plain text (not rtf, since I'm having trouble > with it). > > > > > > For reference, here's what I see when trying to look at the > scontrol-show-job.rtf file: > > $ unrtf scontol-show-job.rtf > > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> > > <html> > > <head> > > <meta http-equiv="content-type" content="text/html; charset=utf-8"> > > <!-- Translation from RTF performed by UnRTF, version 0.21.9 --> > > </head> > > <body>_c#7OcfFXXhHld$dK F > ;"2#VD(Uwk?'|W8dh868<|+uXth>ksc/:fX0NT=.~a&WYuB@t<zO?'< > lt:On=)~;?3/Wn>x>]h^F>~wwNW<x6~7^~tt?Y/~+?=OOk~$k~o~^SWb;7N?O/ > _>??WO6>o`o;?o3_<wa[Ox_kOVq_uW_Sba_uWFI~v:;tV-.<Xj:Z/_bv< > q9qe0l|9|9>?gu?l>>Q|>_fx>W7?gFll|6Ng*t_m2cgTz=i;Xj4?Qrrr9|1: > ]=zEpZ&xj<z<[f.?w?e8_<y>z=<~?N>G_>'_|><G|zGl. > N?Ch2>hWcWo'g2m7Tyxg0|*YolM7Mvl;:u#oaveUcS/]BNnYy/i[ > vzk;~qUd;;omf1l+&dKg6/s>Q;I5SC]'NyqryM[IlM61o0IYReuP%> > vecRIA$5)m^yTCTyC7^e0u^Um+CD^0d`.CV:6m^Vw'[6dwzp0ckle]>t+t>>YeM6ji6/ > Cdigm6:f<M@7y;7e6/+V);HT5v2A[[l237WDv'cWymOa=<6"/uzY5va<[,)M! > xx7hUJ;+[(y$^v01mOVI4.L8$ m^: &KmW=i4CMXlV &Yfe4VqoqAlr > i6&/S4N"yr5 v +^V_2 > ,^VqFIlyfXVvjxu)ju^kW%Qx/G;(uvj=Em6moIb2gh[Q>tmq.#g= > PA0_]|5JR<ILj/(Wou_OZxgg?''7oG7o/zOy_kck:b]P__leh9].dxVgqg<vvouz~bVd/ > 8b|<MG+<<?I=]Nwp5*8TG>kg?:V?9<8u?p__<uCtd<. > fxGJ?f=vAi~Vx2_,]?9;|Efdq|~uO.]L^s//_/O8tW:sm<]?zxz>]_t]|UnA? > _m^'z4M'yBQ)xtuglm~3jiM.mty|9tcg>GZg||hzndvvGN^ApGg'm<N>|'_Gi*.> > >d(<r!/=Sh9Casy%MKn[bvR&=>sw?3yw4'/g`W.vN&zgfby"'8& > slBoO^|u/78PX~OgOGe+otz<&fZI1W~q/7>x59`Y6Qen~zx>" > |#MG*taOf~0KOFgy?]]-Vk,OWpECOGW+<^>tU%06!^}t>rsa&u3tQ3C'!> > ]N|2W-xpX@_xt,9MWe9G.;^O]0~7/e#~X"7A]9d~le-V'nn%VV'nn9qL.8ca1/^</body> > </html> Hi, Please find the attached new set of log files in plain text format.
From scontrol show job: JobId=517033 JobName=lstestSCRIPT UserId=asndcy(1001) GroupId=analyst(10000) MCS_label=N/A Priority=9141 Nice=0 Account=users QOS=bigmem JobState=PENDING Reason=QOSMaxMemoryPerJob Dependency=(null) ... Partition=dmc-sr950,knl,class,benchmark,dmc-ivy-bridge,dmc-haswell,dmc-broadwell,gpu_kepler,gpu_pascal,gpu_volta,dmc-skylake AllocNode:Sid=dmcvlogin3:15595 ... TRES=cpu=1,mem=500000M,node=1,billing=1 The "Reason=QOSMaxMemoryPerJob" tells me that QOS "bigmem" has a limit on MaxMemPerJob that is smaller than 500000M. Is that correct? That explains why the job isn't getting scheduled. Note that this job isn't the only job that is pending because of this reason (see job 525608). You can check QOS bigmem with: sacctmgr show qos bigmem and you can format this to show specific fields with the "format" option, like this: sacctmgr show qos bigmem format=maxtresperjob
(In reply to Marshall Garey from comment #5) > From scontrol show job: > > JobId=517033 JobName=lstestSCRIPT > UserId=asndcy(1001) GroupId=analyst(10000) MCS_label=N/A > Priority=9141 Nice=0 Account=users QOS=bigmem > JobState=PENDING Reason=QOSMaxMemoryPerJob Dependency=(null) > ... > > Partition=dmc-sr950,knl,class,benchmark,dmc-ivy-bridge,dmc-haswell,dmc- > broadwell,gpu_kepler,gpu_pascal,gpu_volta,dmc-skylake > AllocNode:Sid=dmcvlogin3:15595 > > ... > TRES=cpu=1,mem=500000M,node=1,billing=1 > > > > The "Reason=QOSMaxMemoryPerJob" tells me that QOS "bigmem" has a limit on > MaxMemPerJob that is smaller than 500000M. Is that correct? That explains > why the job isn't getting scheduled. Note that this job isn't the only job > that is pending because of this reason (see job 525608). > > You can check QOS bigmem with: > > sacctmgr show qos bigmem > > > and you can format this to show specific fields with the "format" option, > like this: > > sacctmgr show qos bigmem format=maxtresperjob Here is the output of: sacctmgr show qos bigmem format=maxtresperjob%100 MaxTRES ----------------------------------------------------- cpu=32,mem=500G
Does this job use --hint=nomultithread? Or the environment variable SLURM_HINT=nomultithread?
(In reply to Marshall Garey from comment #7) > Does this job use --hint=nomultithread? Or the environment variable > SLURM_HINT=nomultithread? No, job is not using --hint flag and SLURM_HINT environment variable was not used.
Does this job use --mem-per-cpu? I can reproduce it by using --mem-per-cpu: $ sacctmgr show qos maxmem format=name,maxtresperjob Name MaxTRES ---------- ------------- maxmem mem=2000M $ srun --qos=maxmem --mem-per-cpu=2000 -c1 whereami srun: job 35485 queued and waiting for resources $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 35483 debug whereami marshall PD 0:00 1 (QOSMaxMemoryPerJob) The reason this happens is that, even though I'm requesting 1 CPU, I have 2 threads per core, so the job ends up with both threads on the core (2 CPUs) and Slurm multiplies 2 CPUs by --mem-per-cpu.
Are you still able to reproduce this issue? If you can easily reproduce it, then can you do the following for me? It's basically the same as before, but this time we're catching the job submission as well as getting more detailed logging, which should give more clues as to why the job isn't running. 1. scontrol setdebug debug3 2. Submit the job. Save the exact job submission command and the job submission script (if you're using sbatch). 3. Make the job priority the highest priority job in the queue. You can do this as a Slurm administrator. Something like this: scontrol update jobid=35496 priority=1000000 You can use the "sprio" command to see what the priorities of the jobs in the queue are so you can know what number to use for priority. 4. Wait for 10 minutes. 5. scontrol setdebug info 6. Upload the slurmctld log file and the job submission command (and job script if using sbatch). Thanks
I forgot to mention that I'd also like to follow up on my previous comment - comment 9. Sorry for the extra email. (In reply to Marshall Garey from comment #9) > Does this job use --mem-per-cpu? I can reproduce it by using --mem-per-cpu: > > > $ sacctmgr show qos maxmem format=name,maxtresperjob > Name MaxTRES > ---------- ------------- > maxmem mem=2000M > > > $ srun --qos=maxmem --mem-per-cpu=2000 -c1 whereami > srun: job 35485 queued and waiting for resources > > $ squeue > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > 35483 debug whereami marshall PD 0:00 1 > (QOSMaxMemoryPerJob) > > > The reason this happens is that, even though I'm requesting 1 CPU, I have 2 > threads per core, so the job ends up with both threads on the core (2 CPUs) > and Slurm multiplies 2 CPUs by --mem-per-cpu.
Created attachment 16457 [details] Slurmctld-log-latest
(In reply to Marshall Garey from comment #10) > Are you still able to reproduce this issue? If you can easily reproduce it, > then can you do the following for me? It's basically the same as before, but > this time we're catching the job submission as well as getting more detailed > logging, which should give more clues as to why the job isn't running. > > > 1. scontrol setdebug debug3 > 2. Submit the job. Save the exact job submission command and the job > submission script (if you're using sbatch). > 3. Make the job priority the highest priority job in the queue. You can do > this as a Slurm administrator. Something like this: > > scontrol update jobid=35496 priority=1000000 > > You can use the "sprio" command to see what the priorities of the jobs in > the queue are so you can know what number to use for priority. > > 4. Wait for 10 minutes. > 5. scontrol setdebug info > 6. Upload the slurmctld log file and the job submission command (and job > script if using sbatch). > > > Thanks Please find the attached slurmctld.log file. Here are the job submit flags: sbatch --qos=bigmem -J lstestSCRIPT --begin=2020-11-02T14:04:23 --requeue --mail-user=dyoung@asc.edu -o lstestSCRIPT.o$SLURM_JOB_ID --mail-type=FAIL,END,TIME_LIMIT -t 360:00:00 -N 1-1 -n 1 --mem-per-cpu=500000mb --constraint=dmc We have a large memory node(dmc53) with ThreadsPerCore=2 enabled in slurm.conf During our upcoming maintenance window, I can change this to ThreadsPerCore=1, If that helps. Here is the output from "scontrol show job 534837" JobId=534837 JobName=lstestSCRIPT UserId=asndcy(1001) GroupId=analyst(10000) MCS_label=N/A Priority=1000000 Nice=0 Account=users QOS=bigmem JobState=PENDING Reason=QOSMaxMemoryPerJob Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=15-00:00:00 TimeMin=N/A SubmitTime=2020-11-02T14:04:32 EligibleTime=2020-11-02T14:04:32 AccrueTime=2020-11-02T14:04:32 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-11-02T14:40:50 Partition=dmc-sr950,knl,class,benchmark,dmc-ivy-bridge,dmc-haswell,dmc-broadwell,gpu_kepler,gpu_pascal,gpu_volta,dmc-skylake AllocNode:Sid=dmcvlogin2:5049 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=500000M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=500000M MinTmpDiskNode=0 Features=dmc DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/mnt/beegfs/home/asndcy StdErr=/mnt/beegfs/home/asndcy/lstestSCRIPT.o534837 StdIn=/dev/null StdOut=/mnt/beegfs/home/asndcy/lstestSCRIPT.o534837 Power= MailUser=dyoung@asc.edu MailType=END,FAIL,TIME_LIMIT
Thanks for that information. I can confirm that this issue is happening because of the interaction between ThreadsPerCore=2 and --mem-per-cpu. Changing the node to ThreadsPerCore=1 does make the job run. I know that's not a fix for the bug but it's a workaround. I might have suggested using --hint=nomultithread as a way to tell Slurm to only use one hardware thread in the core, but Slurm would still charge for twice as much memory at the moment, so the job would still be pending. This is closely related to bug 9153. So, if the workaround of changing ThreadsPerCore=1 is good enough for you, I'll mark this bug as a duplicate of 9153.
I'm closing this as a duplicate of bug 9724. I recently found that bug 9153 is also a duplicate of bug 9724 and we have potential fixes being reviewed in 9724. *** This ticket has been marked as a duplicate of ticket 9724 ***