9931 – Starved job

Ticket 9931 - Starved job

Summary: Starved job

Status:	RESOLVED DUPLICATE of ticket 9724

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	20.02.2
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Marshall Garey
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-10-05 09:22 MDT by Sudhakar Lakkaraju
Modified:	2020-11-10 14:59 MST (History)
CC List:	0 users

See Also:
Site:	ASC
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	CentOS
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm log files (8.44 MB, application/gzip) 2020-10-05 09:22 MDT, Sudhakar Lakkaraju	Details
Slurm logs -recent (17.55 MB, application/gzip) 2020-10-06 12:49 MDT, Sudhakar Lakkaraju	Details
Slurmctld-log-latest (110.39 MB, application/zip) 2020-11-02 14:34 MST, Sudhakar Lakkaraju	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Sudhakar Lakkaraju 2020-10-05 09:22:04 MDT

Created attachment 16123 [details]
slurm log files

We have a job that is getting starved - 517033. Constanly getting rescheduled.
Please find the attached slurmlogs.

Comment 1 Marshall Garey 2020-10-06 11:40:33 MDT

The slurmsched.log file suggests that it isn't running because of an accounting limit:

sched: [2020-10-01T09:02:27.580] JobId=517033 delayed for accounting policy

You should be able to see what accounting limit is hit with the command "squeue" or "scontrol show job 517033".

I can't find out anything else from what you've provided. The slurmctld log level is too low so it isn't showing what accounting policy it failed. The scontrol-show-job.rtf is garbled - looking at it directly with a text editor is garbage, and using unrtf --text scontrol-show-job.rtf also shows garbage. Can you do the following?

- Increase slurmctld debug level to "debug" by running
    scontrol setdebug debug
- Set the "backfill" debug flag by running
    scontrol setdebugflags +backfill
- Wait for 10 minutes
- Reset the slurmctld debug level back to what you want. I recommend setting it to "verbose" or "debug". The output level of "info" is often not verbose enough to get useful information, but you also don't want too much logging. "verbose" and "debug" tend to be a good middle ground.
    scontrol setdebug verbose
    (or info or just leave it at debug)
- Remove the "backfill" debug flag:
    scontrol setdebugflags -backfill
- Run:
    scontrol -d show job 517033
    squeue -a
- Upload the output of these commands and the slurmctld log file and slurmctld sched log file in plain text (not rtf, since I'm having trouble with it).





For reference, here's what I see when trying to look at the scontrol-show-job.rtf file:

$ unrtf scontol-show-job.rtf                                                            
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">                                                            
<html>                                                                                                                     
<head>                                                                                                                     
<meta http-equiv="content-type" content="text/html; charset=utf-8">                                                        
<!-- Translation from RTF performed by UnRTF, version 0.21.9 -->                                                           
</head>                                                                                                                    
<body>_c#7OcfFXXhHld$dK F ;&quot;2#VD(Uwk?'|W8dh868&lt;|+uXth&gt;ksc/:fX0NT=.~a&amp;WYuB@t&lt;zO?'&lt;lt:On=)~;?3/Wn&gt;x&gt;]h^F&gt;~wwNW&lt;x6~7^~tt?Y/~+?=OOk~$k~o~^SWb;7N?O/_&gt;??WO6&gt;o`o;?o3_&lt;wa[Ox_kOVq_uW_Sba_uWFI~v:;tV-.&lt;Xj:Z/_bv&lt;q9qe0l|9|9&gt;?gu?l&gt;&gt;Q|&gt;_fx&gt;W7?gFll|6Ng*t_m2cgTz=i;Xj4?Qrrr9|1:]=zEpZ&amp;xj&lt;z&lt;[f.?w?e8_&lt;y&gt;z=&lt;~?N&gt;G_&gt;'_|&gt;&lt;G|zGl.N?Ch2&gt;hWcWo'g2m7Tyxg0|*YolM7Mvl;:u#oaveUcS/]BNnYy/i[ vzk;~qUd;;omf1l+&amp;dKg6/s&gt;Q;I5SC]'NyqryM[IlM61o0IYReuP%&gt;vecRIA$5)m^yTCTyC7^e0u^Um+CD^0d`.CV:6m^Vw'[6dwzp0ckle]&gt;t+t&gt;&gt;YeM6ji6/Cdigm6:f&lt;M@7y;7e6/+V);HT5v2A[[l237WDv'cWymOa=&lt;6&quot;/uzY5va&lt;[,)M!xx7hUJ;+[(y$^v01mOVI4.L8$ m^: &amp;KmW=i4CMXlV &amp;Yfe4VqoqAlr i6&amp;/S4N&quot;yr5 v +^V_2 ,^VqFIlyfXVvjxu)ju^kW%Qx/G;(uvj=Em6moIb2gh[Q&gt;tmq.#g= PA0_]|5JR&lt;ILj/(Wou_OZxgg?''7oG7o/zOy_kck:b]P__leh9].dxVgqg&lt;vvouz~bVd/8b|&lt;MG+&lt;&lt;?I=]Nwp5*8TG&gt;kg?:V?9&lt;8u?p__&lt;uCtd&lt;.fxGJ?f=vAi~Vx2_,]?9;|Efdq|~uO.]L^s//_/O8tW:sm&lt;]?zxz&gt;]_t]|UnA? _m^'z4M'yBQ)xtuglm~3jiM.mty|9tcg&gt;GZg||hzndvvGN^ApGg'm&lt;N&gt;|'_Gi*.&gt;&gt;d(&lt;r!/=Sh9Casy%MKn[bvR&amp;=&gt;sw?3yw4'/g`W.vN&amp;zgfby&quot;'8&amp;slBoO^|u/78PX~OgOGe+otz&lt;&amp;fZI1W~q/7&gt;x59`Y6Qen~zx&gt;&quot;|#MG*taOf~0KOFgy?]]-Vk,OWpECOGW+&lt;^&gt;tU%06!^}t&gt;rsa&amp;u3tQ3C'!&gt;]N|2W-xpX@_xt,9MWe9G.;^O]0~7/e#~X&quot;7A]9d~le-V'nn%VV'nn9qL.8ca1/^</body>
</html>

Comment 2 Sudhakar Lakkaraju 2020-10-06 12:49:48 MDT

Created attachment 16143 [details]
Slurm logs -recent

Comment 3 Sudhakar Lakkaraju 2020-10-06 12:51:36 MDT

(In reply to Marshall Garey from comment #1)
> The slurmsched.log file suggests that it isn't running because of an
> accounting limit:
> 
> sched: [2020-10-01T09:02:27.580] JobId=517033 delayed for accounting policy
> 
> You should be able to see what accounting limit is hit with the command
> "squeue" or "scontrol show job 517033".
> 
> I can't find out anything else from what you've provided. The slurmctld log
> level is too low so it isn't showing what accounting policy it failed. The
> scontrol-show-job.rtf is garbled - looking at it directly with a text editor
> is garbage, and using unrtf --text scontrol-show-job.rtf also shows garbage.
> Can you do the following?
> 
> - Increase slurmctld debug level to "debug" by running
>     scontrol setdebug debug
> - Set the "backfill" debug flag by running
>     scontrol setdebugflags +backfill
> - Wait for 10 minutes
> - Reset the slurmctld debug level back to what you want. I recommend setting
> it to "verbose" or "debug". The output level of "info" is often not verbose
> enough to get useful information, but you also don't want too much logging.
> "verbose" and "debug" tend to be a good middle ground.
>     scontrol setdebug verbose
>     (or info or just leave it at debug)
> - Remove the "backfill" debug flag:
>     scontrol setdebugflags -backfill
> - Run:
>     scontrol -d show job 517033
>     squeue -a
> - Upload the output of these commands and the slurmctld log file and
> slurmctld sched log file in plain text (not rtf, since I'm having trouble
> with it).
> 
> 
> 
> 
> 
> For reference, here's what I see when trying to look at the
> scontrol-show-job.rtf file:
> 
> $ unrtf scontol-show-job.rtf                                                
> 
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">             
> 
> <html>                                                                      
> 
> <head>                                                                      
> 
> <meta http-equiv="content-type" content="text/html; charset=utf-8">         
> 
> <!-- Translation from RTF performed by UnRTF, version 0.21.9 -->            
> 
> </head>                                                                     
> 
> <body>_c#7OcfFXXhHld$dK F
> ;&quot;2#VD(Uwk?'|W8dh868&lt;|+uXth&gt;ksc/:fX0NT=.~a&amp;WYuB@t&lt;zO?'&lt;
> lt:On=)~;?3/Wn&gt;x&gt;]h^F&gt;~wwNW&lt;x6~7^~tt?Y/~+?=OOk~$k~o~^SWb;7N?O/
> _&gt;??WO6&gt;o`o;?o3_&lt;wa[Ox_kOVq_uW_Sba_uWFI~v:;tV-.&lt;Xj:Z/_bv&lt;
> q9qe0l|9|9&gt;?gu?l&gt;&gt;Q|&gt;_fx&gt;W7?gFll|6Ng*t_m2cgTz=i;Xj4?Qrrr9|1:
> ]=zEpZ&amp;xj&lt;z&lt;[f.?w?e8_&lt;y&gt;z=&lt;~?N&gt;G_&gt;'_|&gt;&lt;G|zGl.
> N?Ch2&gt;hWcWo'g2m7Tyxg0|*YolM7Mvl;:u#oaveUcS/]BNnYy/i[
> vzk;~qUd;;omf1l+&amp;dKg6/s&gt;Q;I5SC]'NyqryM[IlM61o0IYReuP%&gt;
> vecRIA$5)m^yTCTyC7^e0u^Um+CD^0d`.CV:6m^Vw'[6dwzp0ckle]&gt;t+t&gt;&gt;YeM6ji6/
> Cdigm6:f&lt;M@7y;7e6/+V);HT5v2A[[l237WDv'cWymOa=&lt;6&quot;/uzY5va&lt;[,)M!
> xx7hUJ;+[(y$^v01mOVI4.L8$ m^: &amp;KmW=i4CMXlV &amp;Yfe4VqoqAlr
> i6&amp;/S4N&quot;yr5 v +^V_2
> ,^VqFIlyfXVvjxu)ju^kW%Qx/G;(uvj=Em6moIb2gh[Q&gt;tmq.#g=
> PA0_]|5JR&lt;ILj/(Wou_OZxgg?''7oG7o/zOy_kck:b]P__leh9].dxVgqg&lt;vvouz~bVd/
> 8b|&lt;MG+&lt;&lt;?I=]Nwp5*8TG&gt;kg?:V?9&lt;8u?p__&lt;uCtd&lt;.
> fxGJ?f=vAi~Vx2_,]?9;|Efdq|~uO.]L^s//_/O8tW:sm&lt;]?zxz&gt;]_t]|UnA?
> _m^'z4M'yBQ)xtuglm~3jiM.mty|9tcg&gt;GZg||hzndvvGN^ApGg'm&lt;N&gt;|'_Gi*.&gt;
> &gt;d(&lt;r!/=Sh9Casy%MKn[bvR&amp;=&gt;sw?3yw4'/g`W.vN&amp;zgfby&quot;'8&amp;
> slBoO^|u/78PX~OgOGe+otz&lt;&amp;fZI1W~q/7&gt;x59`Y6Qen~zx&gt;&quot;
> |#MG*taOf~0KOFgy?]]-Vk,OWpECOGW+&lt;^&gt;tU%06!^}t&gt;rsa&amp;u3tQ3C'!&gt;
> ]N|2W-xpX@_xt,9MWe9G.;^O]0~7/e#~X&quot;7A]9d~le-V'nn%VV'nn9qL.8ca1/^</body>
> </html>

Hi,
Please find the attached new set of log files in plain text format.

Comment 5 Marshall Garey 2020-10-06 17:01:42 MDT

From scontrol show job:

JobId=517033 JobName=lstestSCRIPT
   UserId=asndcy(1001) GroupId=analyst(10000) MCS_label=N/A
   Priority=9141 Nice=0 Account=users QOS=bigmem
   JobState=PENDING Reason=QOSMaxMemoryPerJob Dependency=(null)
   ...
   Partition=dmc-sr950,knl,class,benchmark,dmc-ivy-bridge,dmc-haswell,dmc-broadwell,gpu_kepler,gpu_pascal,gpu_volta,dmc-skylake AllocNode:Sid=dmcvlogin3:15595

   ...
   TRES=cpu=1,mem=500000M,node=1,billing=1



The "Reason=QOSMaxMemoryPerJob" tells me that QOS "bigmem" has a limit on MaxMemPerJob that is smaller than 500000M. Is that correct? That explains why the job isn't getting scheduled. Note that this job isn't the only job that is pending because of this reason (see job 525608).

You can check QOS bigmem with:

sacctmgr show qos bigmem


and you can format this to show specific fields with the "format" option, like this:

sacctmgr show qos bigmem format=maxtresperjob

Comment 6 Sudhakar Lakkaraju 2020-10-07 08:12:46 MDT

(In reply to Marshall Garey from comment #5)
> From scontrol show job:
> 
> JobId=517033 JobName=lstestSCRIPT
>    UserId=asndcy(1001) GroupId=analyst(10000) MCS_label=N/A
>    Priority=9141 Nice=0 Account=users QOS=bigmem
>    JobState=PENDING Reason=QOSMaxMemoryPerJob Dependency=(null)
>    ...
>   
> Partition=dmc-sr950,knl,class,benchmark,dmc-ivy-bridge,dmc-haswell,dmc-
> broadwell,gpu_kepler,gpu_pascal,gpu_volta,dmc-skylake
> AllocNode:Sid=dmcvlogin3:15595
> 
>    ...
>    TRES=cpu=1,mem=500000M,node=1,billing=1
> 
> 
> 
> The "Reason=QOSMaxMemoryPerJob" tells me that QOS "bigmem" has a limit on
> MaxMemPerJob that is smaller than 500000M. Is that correct? That explains
> why the job isn't getting scheduled. Note that this job isn't the only job
> that is pending because of this reason (see job 525608).
> 
> You can check QOS bigmem with:
> 
> sacctmgr show qos bigmem
> 
> 
> and you can format this to show specific fields with the "format" option,
> like this:
> 
> sacctmgr show qos bigmem format=maxtresperjob


Here is the output of: sacctmgr show qos bigmem format=maxtresperjob%100
                                                                                             MaxTRES -----------------------------------------------------
                                                                                     cpu=32,mem=500G

Comment 7 Marshall Garey 2020-10-07 09:42:31 MDT

Does this job use --hint=nomultithread? Or the environment variable SLURM_HINT=nomultithread?

Comment 8 Sudhakar Lakkaraju 2020-10-07 11:33:30 MDT

(In reply to Marshall Garey from comment #7)
> Does this job use --hint=nomultithread? Or the environment variable
> SLURM_HINT=nomultithread?

No, job is not using --hint flag and SLURM_HINT environment variable was not used.

Comment 9 Marshall Garey 2020-10-16 11:33:22 MDT

Does this job use --mem-per-cpu? I can reproduce it by using --mem-per-cpu:


$ sacctmgr show qos maxmem format=name,maxtresperjob
      Name       MaxTRES 
---------- ------------- 
    maxmem     mem=2000M


$ srun --qos=maxmem --mem-per-cpu=2000 -c1 whereami
srun: job 35485 queued and waiting for resources

$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
             35483     debug whereami marshall PD       0:00      1 (QOSMaxMemoryPerJob)


The reason this happens is that, even though I'm requesting 1 CPU, I have 2 threads per core, so the job ends up with both threads on the core (2 CPUs) and Slurm multiplies 2 CPUs by --mem-per-cpu.

Comment 10 Marshall Garey 2020-11-02 10:27:35 MST

Are you still able to reproduce this issue? If you can easily reproduce it, then can you do the following for me? It's basically the same as before, but this time we're catching the job submission as well as getting more detailed logging, which should give more clues as to why the job isn't running.


1. scontrol setdebug debug3
2. Submit the job. Save the exact job submission command and the job submission script (if you're using sbatch).
3. Make the job priority the highest priority job in the queue. You can do this as a Slurm administrator. Something like this:

scontrol update jobid=35496 priority=1000000

You can use the "sprio" command to see what the priorities of the jobs in the queue are so you can know what number to use for priority.

4. Wait for 10 minutes.
5. scontrol setdebug info
6. Upload the slurmctld log file and the job submission command (and job script if using sbatch).


Thanks

Comment 11 Marshall Garey 2020-11-02 10:28:39 MST

I forgot to mention that I'd also like to follow up on my previous comment - comment 9. Sorry for the extra email.

(In reply to Marshall Garey from comment #9)
> Does this job use --mem-per-cpu? I can reproduce it by using --mem-per-cpu:
> 
> 
> $ sacctmgr show qos maxmem format=name,maxtresperjob
>       Name       MaxTRES 
> ---------- ------------- 
>     maxmem     mem=2000M
> 
> 
> $ srun --qos=maxmem --mem-per-cpu=2000 -c1 whereami
> srun: job 35485 queued and waiting for resources
> 
> $ squeue 
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON) 
>              35483     debug whereami marshall PD       0:00      1
> (QOSMaxMemoryPerJob)
> 
> 
> The reason this happens is that, even though I'm requesting 1 CPU, I have 2
> threads per core, so the job ends up with both threads on the core (2 CPUs)
> and Slurm multiplies 2 CPUs by --mem-per-cpu.

Comment 12 Sudhakar Lakkaraju 2020-11-02 14:34:38 MST

Created attachment 16457 [details]
Slurmctld-log-latest

Comment 13 Sudhakar Lakkaraju 2020-11-02 14:41:47 MST

(In reply to Marshall Garey from comment #10)
> Are you still able to reproduce this issue? If you can easily reproduce it,
> then can you do the following for me? It's basically the same as before, but
> this time we're catching the job submission as well as getting more detailed
> logging, which should give more clues as to why the job isn't running.
> 
> 
> 1. scontrol setdebug debug3
> 2. Submit the job. Save the exact job submission command and the job
> submission script (if you're using sbatch).
> 3. Make the job priority the highest priority job in the queue. You can do
> this as a Slurm administrator. Something like this:
> 
> scontrol update jobid=35496 priority=1000000
> 
> You can use the "sprio" command to see what the priorities of the jobs in
> the queue are so you can know what number to use for priority.
> 
> 4. Wait for 10 minutes.
> 5. scontrol setdebug info
> 6. Upload the slurmctld log file and the job submission command (and job
> script if using sbatch).
> 
> 
> Thanks


Please find the attached slurmctld.log file.

Here are the job submit flags:

sbatch --qos=bigmem -J lstestSCRIPT --begin=2020-11-02T14:04:23 --requeue --mail-user=dyoung@asc.edu -o lstestSCRIPT.o$SLURM_JOB_ID --mail-type=FAIL,END,TIME_LIMIT -t 360:00:00 -N 1-1 -n 1 --mem-per-cpu=500000mb --constraint=dmc

We have a large memory node(dmc53) with ThreadsPerCore=2 enabled in slurm.conf
During our upcoming maintenance window, I can change this to ThreadsPerCore=1,
If that helps.

Here is the output from "scontrol show job 534837"

JobId=534837 JobName=lstestSCRIPT
   UserId=asndcy(1001) GroupId=analyst(10000) MCS_label=N/A
   Priority=1000000 Nice=0 Account=users QOS=bigmem
   JobState=PENDING Reason=QOSMaxMemoryPerJob Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=15-00:00:00 TimeMin=N/A
   SubmitTime=2020-11-02T14:04:32 EligibleTime=2020-11-02T14:04:32
   AccrueTime=2020-11-02T14:04:32
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-11-02T14:40:50
   Partition=dmc-sr950,knl,class,benchmark,dmc-ivy-bridge,dmc-haswell,dmc-broadwell,gpu_kepler,gpu_pascal,gpu_volta,dmc-skylake AllocNode:Sid=dmcvlogin2:5049
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=500000M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=500000M MinTmpDiskNode=0
   Features=dmc DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/mnt/beegfs/home/asndcy
   StdErr=/mnt/beegfs/home/asndcy/lstestSCRIPT.o534837
   StdIn=/dev/null
   StdOut=/mnt/beegfs/home/asndcy/lstestSCRIPT.o534837
   Power=
   MailUser=dyoung@asc.edu MailType=END,FAIL,TIME_LIMIT

Comment 14 Marshall Garey 2020-11-03 14:44:47 MST

Thanks for that information. I can confirm that this issue is happening because of the interaction between ThreadsPerCore=2 and --mem-per-cpu. Changing the node to ThreadsPerCore=1 does make the job run. I know that's not a fix for the bug but it's a workaround.

I might have suggested using --hint=nomultithread as a way to tell Slurm to only use one hardware thread in the core, but Slurm would still charge for twice as much memory at the moment, so the job would still be pending. This is closely related to bug 9153. So, if the workaround of changing ThreadsPerCore=1 is good enough for you, I'll mark this bug as a duplicate of 9153.

Comment 15 Marshall Garey 2020-11-10 14:59:03 MST

I'm closing this as a duplicate of bug 9724. I recently found that bug 9153 is also a duplicate of bug 9724 and we have potential fixes being reviewed in 9724.

*** This ticket has been marked as a duplicate of ticket 9724 ***