Ticket 626

Summary:	Native SLURM: Suspend/Resume - When a job queued and waiting for resources, it blocks all other jobs to launch.
Product:	Slurm	Reporter:	tchoi
Component:	Scheduling	Assignee:	David Bigagli <david>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	2 - High Impact
Priority:	---	CC:	da, david.gloe, rgross
Version:	14.03.x
Hardware:	Linux
OS:	Linux
Site:	CRAY	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf Fix for sharing nodes

Description tchoi 2014-03-05 05:35:53 MST

Created attachment 679 [details]
slurm.conf

Native SLURM: Suspend/Resume - When a job queued and waiting for resources, it blocks all other jobs to launch. 

 snake-p3(nid00018): /tchoi => srun --version
slurm 14.03.0

# Launch first application on nid00024:
 snake-p3(nid00018): /tchoi => srun -n 1 -w nid00024 sleep 1000 &
[1]     16660

squeue -l:
 JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)
               117     workq    sleep    tchoi  RUNNING       0:12   1:00:00      1 nid00024


# Launch second application on nid00024 (the same node as first application) 
# without suspending first job:

 snake-p3(nid00018): /tchoi => srun -n 1 -w nid00024 sleep 10000 &
[2]     16677

snake-p3(nid00018): /tchoi => srun: job 118 queued and waiting for resources

             JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)
               117     workq    sleep    tchoi  RUNNING       0:21   1:00:00      1 nid00024
               118     workq    sleep    tchoi  PENDING       0:00   1:00:00      1 (Resources)


# Try to launch third application on the other node (nid00025).
# All other jobs are pending even they don't try to run on nid00024 (the same node as first application).

 snake-p3(nid00018): /tchoi => srun -n 1 -w nid00025 sleep 10000 &

squeue -l:
             JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)
               117     workq    sleep    tchoi  RUNNING       3:28   1:00:00      1 nid00024
               118     workq    sleep    tchoi  PENDING       0:00   1:00:00      1 (Resources)
               119     workq  corefin    pavek  PENDING       0:00   1:00:00      1 (Priority)
               120     workq    sleep    tchoi  PENDING       0:00   1:00:00      1 (Priority)

Comment 1 David Bigagli 2014-03-05 05:49:14 MST

Hi,
    the jobs 119 and 120 are pending with reason (Priority) this is 
expected because the job 118 which has been submitted ahead of them is 
pending waiting for resources. The backfill scheduler should dispatch 
the 2 pending jobs as soon as it starts. Are they running yet?

On 03/05/2014 11:35 AM, bugs@schedmd.com wrote:
> Site 	CRAY
> Bug ID 	626 <http://bugs.schedmd.com/show_bug.cgi?id=626>
> Summary 	Native SLURM: Suspend/Resume - When a job queued and waiting
> for resources, it blocks all other jobs to launch.
> Product 	SLURM
> Version 	14.03.x
> Hardware 	Linux
> OS 	Linux
> Status 	UNCONFIRMED
> Severity 	2 - High Impact
> Priority 	---
> Component 	Scheduling
> Assignee 	david@schedmd.com
> Reporter 	tchoi@cray.com
> CC 	da@schedmd.com, david@schedmd.com, jette@schedmd.com
>
> Createdattachment 679  <attachment.cgi?id=679>  [details]  <attachment.cgi?id=679&action=edit>
> slurm.conf
>
> Native SLURM: Suspend/Resume - When a job queued and waiting for resources, it
> blocks all other jobs to launch.
>
>   snake-p3(nid00018): /tchoi => srun --version
> slurm 14.03.0
>
> # Launch first application on nid00024:
>   snake-p3(nid00018): /tchoi => srun -n 1 -w nid00024 sleep 1000 &
> [1]     16660
>
> squeue -l:
>   JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES
> NODELIST(REASON)
>                 117     workq    sleep    tchoi  RUNNING       0:12   1:00:00
>    1 nid00024
>
>
> # Launch second application on nid00024 (the same node as first application)
> # without suspending first job:
>
>   snake-p3(nid00018): /tchoi => srun -n 1 -w nid00024 sleep 10000 &
> [2]     16677
>
> snake-p3(nid00018): /tchoi => srun: job 118 queued and waiting for resources
>
>               JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT
> NODES NODELIST(REASON)
>                 117     workq    sleep    tchoi  RUNNING       0:21   1:00:00
>    1 nid00024
>                 118     workq    sleep    tchoi  PENDING       0:00   1:00:00
>    1 (Resources)
>
>
> # Try to launch third application on the other node (nid00025).
> # All other jobs are pending even they don't try to run on nid00024 (the same
> node as first application).
>
>   snake-p3(nid00018): /tchoi => srun -n 1 -w nid00025 sleep 10000 &
>
> squeue -l:
>               JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT
> NODES NODELIST(REASON)
>                 117     workq    sleep    tchoi  RUNNING       3:28   1:00:00
>    1 nid00024
>                 118     workq    sleep    tchoi  PENDING       0:00   1:00:00
>    1 (Resources)
>                 119     workq  corefin    pavek  PENDING       0:00   1:00:00
>    1 (Priority)
>                 120     workq    sleep    tchoi  PENDING       0:00   1:00:00
>    1 (Priority)
>
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You are the assignee for the bug.
>   * You are watching someone on the CC list of the bug.
>   * You are watching the assignee of the bug.
>

Comment 2 David Gloe 2014-03-05 06:10:24 MST

This seems like very inefficient scheduling to me. So if one job is waiting on one node no other jobs can run on any other nodes on the system?

Comment 3 Moe Jette 2014-03-05 06:16:52 MST

(In reply to David Gloe from comment #2)
> This seems like very inefficient scheduling to me. So if one job is waiting
> on one node no other jobs can run on any other nodes on the system?

It's FIFO except when the backfill scheduling kicks in (every 30 seconds by default).

Is it common that users submit jobs to run on a specific node?
That's the root cause of this delay.

Comment 4 David Gloe 2014-03-05 06:33:11 MST

(In reply to Moe Jette from comment #3)
> (In reply to David Gloe from comment #2)
> > This seems like very inefficient scheduling to me. So if one job is waiting
> > on one node no other jobs can run on any other nodes on the system?
> 
> It's FIFO except when the backfill scheduling kicks in (every 30 seconds by
> default).
> 
> Is it common that users submit jobs to run on a specific node?
> That's the root cause of this delay.

These jobs are staying pending for much longer than 30 seconds. Tom reported the issue at 1:40 and they were still pending when I looked at it ~2:10. I have another one now that's been pending for 15 minutes.

Perhaps we have a bad backfill configuration?

Comment 5 tchoi 2014-03-05 06:45:27 MST

We tried to run first two jobs on the same node. 
For examples, first job is running on nid00024.
Then we try to launch second job on the same node, nid00024, without suspending first job. Now second job is pending until first job is suspending.
And then we try to launch third and fourth jobs on different nodes (i.e. nid000025, nid00026) during second job is pending. All other jobs (third and fourth) are pending until second job is running or cancelled. This is a bug.

Comment 6 David Gloe 2014-03-06 05:48:44 MST

I tried this today on another internal Slurm system and the backfill scheduler worked as designed, placing the job ~30s after it was submitted.

On that system we have
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill

Perhaps the SchedulerTimeSlice was set incorrectly on snake-p3, or not set at all?
Unfortunately snake-p3 is down now so I can't check.

Comment 7 Moe Jette 2014-03-06 06:24:39 MST

Created attachment 682 [details]
Fix for sharing nodes

The root problem of this is the same one reported by Jim Norby and should be fixed with the attached patch.

Comment 8 Danny Auble 2014-03-20 06:45:00 MDT

David, can you please verify this works so we can close the bug?

Comment 9 David Bigagli 2014-03-31 07:58:49 MDT

Closing, please reopen if necessary.

David