Ticket 2031

Summary:	Unexpected timeslicing (scontrol suspend/resume will cause timeslicing to start)
Product:	Slurm	Reporter:	Sergey Meirovich <sergey_meirovich>
Component:	Configuration	Assignee:	Moe Jette <jette>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	Normal	CC:	brian, da
Version:	15.08.0
Hardware:	Linux
OS:	Linux
Site:	AMAT	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	15.08.2
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf slurmctld log

Description Sergey Meirovich 2015-10-15 07:55:58 MDT

Created attachment 2307 [details]
slurm.conf

Hi,

We don't want time-slicing to happen hence set SchedulerType=sched/builtin (as per http://slurm.schedmd.com/gang_scheduling.html : "...Without timeslicing and without the backfill scheduler enabled, job 14 has to wait for job 13 to finish."

However, time-slicing is still happening for us. See below:

==========================================================================================
[root@DCA-DELL-HPC-HEAD ~]# date; squeue -j 1420,1119,1118,1117
Thu Oct 15 14:35:36 CDT 2015
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1420    normal     wrap  e153547  R      44:17      1 DCA-DELL-HPC08
              1119    normal s02c_del  e116763  S 5-21:10:27      1 DCA-DELL-HPC08
              1118    normal s02b_del  e116763  S 5-21:10:32      1 DCA-DELL-HPC08
              1117    normal s02a_del  e116763  S 5-21:10:42      1 DCA-DELL-HPC08
[root@DCA-DELL-HPC-HEAD ~]# date; squeue -j 1420,1119,1118,1117
Thu Oct 15 14:35:45 CDT 2015
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1420    normal     wrap  e153547  R      44:26      1 DCA-DELL-HPC08
              1119    normal s02c_del  e116763  S 5-21:10:27      1 DCA-DELL-HPC08
              1118    normal s02b_del  e116763  S 5-21:10:32      1 DCA-DELL-HPC08
              1117    normal s02a_del  e116763  S 5-21:10:42      1 DCA-DELL-HPC08
[root@DCA-DELL-HPC-HEAD ~]# date; squeue -j 1420,1119,1118,1117
Thu Oct 15 14:35:53 CDT 2015
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1119    normal s02c_del  e116763  R 5-21:10:31      1 DCA-DELL-HPC08
              1118    normal s02b_del  e116763  R 5-21:10:36      1 DCA-DELL-HPC08
              1117    normal s02a_del  e116763  R 5-21:10:46      1 DCA-DELL-HPC08
              1420    normal     wrap  e153547  S      44:30      1 DCA-DELL-HPC08
[root@DCA-DELL-HPC-HEAD ~]# scontrol show jobid=1420
JobId=1420 JobName=wrap
   UserId=e153547(25928) GroupId=users(2080)
   Priority=3432757 Nice=0 Account=e153547_definite QOS=qchem
   JobState=SUSPENDED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:45:30 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2015-10-15T13:06:55 EligibleTime=2015-10-15T13:06:55
   StartTime=2015-10-15T13:06:57 EndTime=Unknown
   PreemptTime=None SuspendTime=2015-10-15T14:37:49 SecsPreSuspend=2730
   Partition=normal AllocNode:Sid=DCA-DELL-HPC-HEAD:14440
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=DCA-DELL-HPC08
   BatchHost=DCA-DELL-HPC08
   NumNodes=1 NumCPUs=36 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=36,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/cae/e153547/TaN/TaN/101/N-term
   StdErr=/cae/e153547/TaN/TaN/101/N-term/slurm-1420.out
   StdIn=/dev/null
   StdOut=/cae/e153547/TaN/TaN/101/N-term/slurm-1420.out
   Power= SICP=0

[root@DCA-DELL-HPC-HEAD ~]# scontrol show jobid=1117
JobId=1117 JobName=s02a_dell
   UserId=e116763(4996) GroupId=users(2080)
   Priority=3433358 Nice=0 Account=e116763_definite QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=5-21:11:58 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2015-10-09T16:37:24 EligibleTime=2015-10-09T16:37:24
   StartTime=2015-10-09T16:37:24 EndTime=Unknown
   PreemptTime=None SuspendTime=2015-10-15T14:37:49 SecsPreSuspend=508302
   Partition=normal AllocNode:Sid=DCA-DELL-HPC-HEAD:28139
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=DCA-DELL-HPC08
   BatchHost=DCA-DELL-HPC08
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=./s02a_dell_aug14.exe
   WorkDir=/cae/e116763/cases/shamash/s02a_dell
   Power= SICP=0

[root@DCA-DELL-HPC-HEAD ~]# scontrol show jobid=1118
JobId=1118 JobName=s02b_dell
   UserId=e116763(4996) GroupId=users(2080)
   Priority=3433358 Nice=0 Account=e116763_definite QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=5-21:11:50 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2015-10-09T16:37:34 EligibleTime=2015-10-09T16:37:34
   StartTime=2015-10-09T16:37:34 EndTime=Unknown
   PreemptTime=None SuspendTime=2015-10-15T14:37:49 SecsPreSuspend=508292
   Partition=normal AllocNode:Sid=DCA-DELL-HPC-HEAD:28139
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=DCA-DELL-HPC08
   BatchHost=DCA-DELL-HPC08
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=./s02b_dell_aug14.exe
   WorkDir=/cae/e116763/cases/shamash/s02b_dell
   Power= SICP=0

[root@DCA-DELL-HPC-HEAD ~]# scontrol show jobid=1119
JobId=1119 JobName=s02c_dell
   UserId=e116763(4996) GroupId=users(2080)
   Priority=3433358 Nice=0 Account=e116763_definite QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=5-21:11:47 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2015-10-09T16:37:39 EligibleTime=2015-10-09T16:37:39
   StartTime=2015-10-09T16:37:39 EndTime=Unknown
   PreemptTime=None SuspendTime=2015-10-15T14:37:49 SecsPreSuspend=508287
   Partition=normal AllocNode:Sid=DCA-DELL-HPC-HEAD:28139
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=DCA-DELL-HPC08
   BatchHost=DCA-DELL-HPC08
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=./s02c_dell_aug14.exe
   WorkDir=/cae/e116763/cases/shamash/s02c_dell
   Power= SICP=0

[root@DCA-DELL-HPC-HEAD ~]# 

==========================================================================================

slurm.conf is attached.

Shall we go with "Shared=FORCE:1" to achieve our goal(per-partition preemption and no time-slicing)?

Comment 1 Danny Auble 2015-10-15 10:27:26 MDT

Sergey,

Are you manually suspending jobs?  If so I can reproduce this.  I'll look into how this is happening.  I would suggest you not manually suspend the jobs though and use the partitions to do the preempting for you.  I am fairly sure you can use backfill here to accomplish your goals.  I believe the document you are looking at is incorrect, I will verify and let you know.

Could you send your slurmctld.log?

I also have a few suggestions/changes to your slurm.conf.  Please open a new ticket and we can discuss it there.

Comment 2 Sergey Meirovich 2015-10-15 13:13:19 MDT

Created attachment 2308 [details]
slurmctld log

Comment 3 Sergey Meirovich 2015-10-15 13:20:06 MDT

We had not manually suspended any job during the times we observed this. We exactly want to use just partition preemption.

Not sure that I am following you. Do you mean that scontrol suspend/scontrol resume justsomehow enabling "local backfilling" for unrelated jobs (that were not part of suspend/resume)?

Comment 4 Danny Auble 2015-10-15 13:38:52 MDT

In my testing scontrol suspend/resume would cause timeslicing to start.  It shouldn't happen if you only let slurm handle the suspending.  Are you saying you did not run scontrol suspend/resume? 

I only mentioned backfill since the document you referenced mentioned you needed to use the builtin plugin when doing gang scheduling.  Backfill isn't being turned on locally nor is that possible. 

On October 15, 2015 6:20:06 PM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=2031
>
>--- Comment #3 from Sergey Meirovich <sergey_meirovich@amat.com> ---
>We had not manually suspended any job during the times we observed
>this. We
>exactly want to use just partition preemption.
>
>Not sure that I am following you. Do you mean that scontrol
>suspend/scontrol
>resume justsomehow enabling "local backfilling" for unrelated jobs
>(that were
>not part of suspend/resume)?
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
>You are the assignee for the bug.

Comment 5 Sergey Meirovich 2015-10-15 13:47:31 MDT

We have already submitted ticket about configuration suggestion. http://bugs.schedmd.com/show_bug.cgi?id=2016

Well,

- some suggestions are applicable to us right now (E.g. timout settings).
- Somes are not as we are using bright cluster manager and some artifacts and tautologies in configs are due to it.
- Features - we are using them to allow for user to restrict selection of nodes. That is a prototype - real cluster there will be different CPUs, different memory size onboard etc. Not an elegant solution and probably partitions will work better here but our business unit mandated that style.
 - chroups - probably in the future. We are running RHEL6 now and saw cimiliar issues http://bugs.schedmd.com/show_bug.cgi?id=1410. Cgroups support is too rudimnatary to me in RHEL6. We are going to upgrade to RHEL7 in middle term cgroups are postponed for now
- "You have "PreemptMode=suspend,gang" configured, but in order to suspend jobs, you'll need sufficient memory on the nodes for multiple jobs." and that is fine. Our users are lazy and don't like to annotate each and every job with --mem and are not ready to accept consequences if they are not specifying --mem. So in the short term we are read to observe oom_kill sometimes.
-- we might buy additional node for Bright manager, that would be the moment when BackupController suggestion could be useful.

Comment 6 Sergey Meirovich 2015-10-15 13:51:52 MDT

We indeed used scontrol suspend few times.

But why it should enable timeslicing for unrelated jobs? Some of those even started long time after manual suspend/resume event. That smells like a bug to me...

Comment 7 Sergey Meirovich 2015-10-15 13:54:58 MDT

But If I remember correctly I restarted slurmctld and all slurmds few times after I had used scontrol suspend/resume. That should be visible in slurmctld logs....

Comment 8 Danny Auble 2015-10-15 13:59:36 MDT

I would update that bug with these concerns so they can be dealt with.  I see it is marked resolved and it doesn't appear to be the case. 

On October 15, 2015 6:47:31 PM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=2031
>
>--- Comment #5 from Sergey Meirovich <sergey_meirovich@amat.com> ---
>We have already submitted ticket about configuration suggestion.
>http://bugs.schedmd.com/show_bug.cgi?id=2016
>
>Well,
>
>- some suggestions are applicable to us right now (E.g. timout
>settings).
>- Somes are not as we are using bright cluster manager and some
>artifacts and
>tautologies in configs are due to it.
>- Features - we are using them to allow for user to restrict selection
>of
>nodes. That is a prototype - real cluster there will be different CPUs,
>different memory size onboard etc. Not an elegant solution and probably
>partitions will work better here but our business unit mandated that
>style.
>- chroups - probably in the future. We are running RHEL6 now and saw
>cimiliar
>issues http://bugs.schedmd.com/show_bug.cgi?id=1410. Cgroups support is
>too
>rudimnatary to me in RHEL6. We are going to upgrade to RHEL7 in middle
>term
>cgroups are postponed for now
>- "You have "PreemptMode=suspend,gang" configured, but in order to
>suspend
>jobs, you'll need sufficient memory on the nodes for multiple jobs."
>and that
>is fine. Our users are lazy and don't like to annotate each and every
>job with
>--mem and are not ready to accept consequences if they are not
>specifying
>--mem. So in the short term we are read to observe oom_kill sometimes.
>-- we might buy additional node for Bright manager, that would be the
>moment
>when BackupController suggestion could be useful.
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
>You are the assignee for the bug.

Comment 9 Danny Auble 2015-10-15 14:01:09 MDT

Agreed, it appears to be unexpected we are looking into it.  The problem should go away it you refrain from the practice though :). 

On October 15, 2015 6:51:52 PM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=2031
>
>--- Comment #6 from Sergey Meirovich <sergey_meirovich@amat.com> ---
>We indeed used scontrol suspend few times.
>
>But why it should enable timeslicing for unrelated jobs? Some of those
>even
>started long time after manual suspend/resume event. That smells like a
>bug to
>me...
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
>You are the assignee for the bug.

Comment 10 Danny Auble 2015-10-15 14:02:45 MDT

It doesn't matter, any manual intervention will set you on this course for the affected jobs with no reversal. 

On October 15, 2015 6:54:58 PM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=2031
>
>--- Comment #7 from Sergey Meirovich <sergey_meirovich@amat.com> ---
>But If I remember correctly I restarted slurmctld and all slurmds few
>times
>after I had used scontrol suspend/resume. That should be visible in
>slurmctld
>logs....
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
>You are the assignee for the bug.

Comment 11 Sergey Meirovich 2015-10-15 14:06:31 MDT

Hmm,

Indeed I had used suspend some time before I observed the issue.

[2015-10-15T09:12:40.137] Processing RPC: REQUEST_SUSPEND(suspend) from uid=0
[2015-10-15T09:12:40.137] _slurm_rpc_suspend(suspend) for 1103 usec=7150


But the jobid=1103 was even run on another hosts (not the host where we obvserver time-slicing.

[root@DCA-DELL-HPC-HEAD ~]# squeue -j 1103
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1103    normal     wrap  e153547  R 6-06:07:55      2 DCA-DELL-HPC[03-04]
[root@DCA-DELL-HPC-HEAD ~]# 


Again that sounds like a bug to me when manual suspend starts time-slicing globally behind the scene.

Comment 12 Sergey Meirovich 2015-10-16 06:54:29 MDT

Shall I resubmit as another bug? I will appreciate if you look at this as you stated in the comment #1

Comment 13 Danny Auble 2015-10-16 07:00:22 MDT

Just reopen that other bug. 

On October 16, 2015 11:54:29 AM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=2031
>
>--- Comment #12 from Sergey Meirovich <sergey_meirovich@amat.com> ---
>Shall I resubmit as another bug? I will appreciate if you look at this
>as you
>stated in the comment #1
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
>You are the assignee for the bug.

Comment 14 Sergey Meirovich 2015-10-16 07:27:55 MDT

Sorry, I meant shall I open another bug for "... scontrol suspend/resume would cause timeslicing to start"
?

Comment 15 Danny Auble 2015-10-16 08:38:28 MDT

No, from my view point this is the bug.  I am looking into it as noted in comment 1.

Comment 16 Moe Jette 2015-10-19 08:56:06 MDT

Your configuration allows one job per resource (core) from each partition. By suspending a job, you released its cores to be allocated to other jobs. When you resumed that job, that left you with multiple jobs allocated the same cores, which the gang scheduling logic time sliced as you observed. This demonstrates one of the dangers of manually suspending/resuming jobs.

Rather than suspending/resuming the job, how about sending the job SIGSTOP/SIGCONT instead (using the scancel command)? That does the same thing as suspend/resume except that it is available for any user to do and does not release the cores to be allocated to other jobs, which would eliminate the time slicing. Would that satisfy your requirements?

The other option would be to drain the resources from the newly started jobs so that they would be available for the resumed job without conflicts.

Comment 17 Moe Jette 2015-10-20 04:56:27 MDT

I've added documentation warning of the dangers of manual job suspend/resume:
https://github.com/SchedMD/slurm/commit/2929a2895457f4f60eab29e32ab5d5b0e035d6df

I've also modified the code so that if a job is manually resumed and it's resources have already been allocated to other jobs, that job will be started and will gang scheduled with the other jobs, but at least the bookkeeping for jobs will be correct.
https://github.com/SchedMD/slurm/commit/d2d9206047967bd2bd7806d5f39ea7218216972e

These changes will be in version 15.08.2 when released, probably in a week or so.

If you want to prevent the gang scheduling, you need to prevent other jobs from being allocated resources from the job that you stop. Using SIGSTOP/SIGCONT to stop the job is probably your best option.

Comment 18 Sergey Meirovich 2016-03-18 04:30:11 MDT

Hi,

Has the behavior described in this ticked eventually been addressed by:

https://github.com/SchedMD/slurm/commit/344d2eab7ee6243b63cb71104af20b04495f7f81.patch

?

Comment 19 Moe Jette 2016-03-21 02:36:13 MDT

(In reply to Sergey Meirovich from comment #18)
> Hi,
> 
> Has the behavior described in this ticked eventually been addressed by:
> 
> https://github.com/SchedMD/slurm/commit/
> 344d2eab7ee6243b63cb71104af20b04495f7f81.patch
> 
> ?

What that patch does is immediately notify the gang scheduling module that someone manually suspended or resumed a job, which can then trigger an immediate context switch.

For example, say job 10 has been allocated CPUs 0-15 on node 1. Once that job has been suspended, its CPUs become available for another job to use. After job 10 is suspended, let's say that job 11 from the same partition gets allocated CPUs 0-7 and job 12 gets allocated CPUs 8-15. Lets then say a sys admin then resumes job 10.  If gang scheduling is NOT configured, then the kernel will allocate CPU time to the various processes of the jobs (which overlap), likely causing the job's run time to increase dramatically and quite possibly fail due to reaching its time limit. If gang scheduling is enabled, job 10 will time slice with jobs 11 and 12 until either job 10 ends or jobs 11 and 12 both end. The job's time limit is based upon the time when it is in run state only (not the time suspended by gang scheduling). If the applications have any real parallelism, gang scheduling them will get them done faster.

In any case, manually suspending and resuming jobs will result in oversubscribed CPUs and gang scheduling is likely the best way to proceed.