Created attachment 2307 [details] slurm.conf Hi, We don't want time-slicing to happen hence set SchedulerType=sched/builtin (as per http://slurm.schedmd.com/gang_scheduling.html : "...Without timeslicing and without the backfill scheduler enabled, job 14 has to wait for job 13 to finish." However, time-slicing is still happening for us. See below: ========================================================================================== [root@DCA-DELL-HPC-HEAD ~]# date; squeue -j 1420,1119,1118,1117 Thu Oct 15 14:35:36 CDT 2015 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1420 normal wrap e153547 R 44:17 1 DCA-DELL-HPC08 1119 normal s02c_del e116763 S 5-21:10:27 1 DCA-DELL-HPC08 1118 normal s02b_del e116763 S 5-21:10:32 1 DCA-DELL-HPC08 1117 normal s02a_del e116763 S 5-21:10:42 1 DCA-DELL-HPC08 [root@DCA-DELL-HPC-HEAD ~]# date; squeue -j 1420,1119,1118,1117 Thu Oct 15 14:35:45 CDT 2015 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1420 normal wrap e153547 R 44:26 1 DCA-DELL-HPC08 1119 normal s02c_del e116763 S 5-21:10:27 1 DCA-DELL-HPC08 1118 normal s02b_del e116763 S 5-21:10:32 1 DCA-DELL-HPC08 1117 normal s02a_del e116763 S 5-21:10:42 1 DCA-DELL-HPC08 [root@DCA-DELL-HPC-HEAD ~]# date; squeue -j 1420,1119,1118,1117 Thu Oct 15 14:35:53 CDT 2015 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1119 normal s02c_del e116763 R 5-21:10:31 1 DCA-DELL-HPC08 1118 normal s02b_del e116763 R 5-21:10:36 1 DCA-DELL-HPC08 1117 normal s02a_del e116763 R 5-21:10:46 1 DCA-DELL-HPC08 1420 normal wrap e153547 S 44:30 1 DCA-DELL-HPC08 [root@DCA-DELL-HPC-HEAD ~]# scontrol show jobid=1420 JobId=1420 JobName=wrap UserId=e153547(25928) GroupId=users(2080) Priority=3432757 Nice=0 Account=e153547_definite QOS=qchem JobState=SUSPENDED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:45:30 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2015-10-15T13:06:55 EligibleTime=2015-10-15T13:06:55 StartTime=2015-10-15T13:06:57 EndTime=Unknown PreemptTime=None SuspendTime=2015-10-15T14:37:49 SecsPreSuspend=2730 Partition=normal AllocNode:Sid=DCA-DELL-HPC-HEAD:14440 ReqNodeList=(null) ExcNodeList=(null) NodeList=DCA-DELL-HPC08 BatchHost=DCA-DELL-HPC08 NumNodes=1 NumCPUs=36 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=36,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/cae/e153547/TaN/TaN/101/N-term StdErr=/cae/e153547/TaN/TaN/101/N-term/slurm-1420.out StdIn=/dev/null StdOut=/cae/e153547/TaN/TaN/101/N-term/slurm-1420.out Power= SICP=0 [root@DCA-DELL-HPC-HEAD ~]# scontrol show jobid=1117 JobId=1117 JobName=s02a_dell UserId=e116763(4996) GroupId=users(2080) Priority=3433358 Nice=0 Account=e116763_definite QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=5-21:11:58 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2015-10-09T16:37:24 EligibleTime=2015-10-09T16:37:24 StartTime=2015-10-09T16:37:24 EndTime=Unknown PreemptTime=None SuspendTime=2015-10-15T14:37:49 SecsPreSuspend=508302 Partition=normal AllocNode:Sid=DCA-DELL-HPC-HEAD:28139 ReqNodeList=(null) ExcNodeList=(null) NodeList=DCA-DELL-HPC08 BatchHost=DCA-DELL-HPC08 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=./s02a_dell_aug14.exe WorkDir=/cae/e116763/cases/shamash/s02a_dell Power= SICP=0 [root@DCA-DELL-HPC-HEAD ~]# scontrol show jobid=1118 JobId=1118 JobName=s02b_dell UserId=e116763(4996) GroupId=users(2080) Priority=3433358 Nice=0 Account=e116763_definite QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=5-21:11:50 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2015-10-09T16:37:34 EligibleTime=2015-10-09T16:37:34 StartTime=2015-10-09T16:37:34 EndTime=Unknown PreemptTime=None SuspendTime=2015-10-15T14:37:49 SecsPreSuspend=508292 Partition=normal AllocNode:Sid=DCA-DELL-HPC-HEAD:28139 ReqNodeList=(null) ExcNodeList=(null) NodeList=DCA-DELL-HPC08 BatchHost=DCA-DELL-HPC08 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=./s02b_dell_aug14.exe WorkDir=/cae/e116763/cases/shamash/s02b_dell Power= SICP=0 [root@DCA-DELL-HPC-HEAD ~]# scontrol show jobid=1119 JobId=1119 JobName=s02c_dell UserId=e116763(4996) GroupId=users(2080) Priority=3433358 Nice=0 Account=e116763_definite QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=5-21:11:47 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2015-10-09T16:37:39 EligibleTime=2015-10-09T16:37:39 StartTime=2015-10-09T16:37:39 EndTime=Unknown PreemptTime=None SuspendTime=2015-10-15T14:37:49 SecsPreSuspend=508287 Partition=normal AllocNode:Sid=DCA-DELL-HPC-HEAD:28139 ReqNodeList=(null) ExcNodeList=(null) NodeList=DCA-DELL-HPC08 BatchHost=DCA-DELL-HPC08 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=./s02c_dell_aug14.exe WorkDir=/cae/e116763/cases/shamash/s02c_dell Power= SICP=0 [root@DCA-DELL-HPC-HEAD ~]# ========================================================================================== slurm.conf is attached. Shall we go with "Shared=FORCE:1" to achieve our goal(per-partition preemption and no time-slicing)?
Sergey, Are you manually suspending jobs? If so I can reproduce this. I'll look into how this is happening. I would suggest you not manually suspend the jobs though and use the partitions to do the preempting for you. I am fairly sure you can use backfill here to accomplish your goals. I believe the document you are looking at is incorrect, I will verify and let you know. Could you send your slurmctld.log? I also have a few suggestions/changes to your slurm.conf. Please open a new ticket and we can discuss it there.
Created attachment 2308 [details] slurmctld log
We had not manually suspended any job during the times we observed this. We exactly want to use just partition preemption. Not sure that I am following you. Do you mean that scontrol suspend/scontrol resume justsomehow enabling "local backfilling" for unrelated jobs (that were not part of suspend/resume)?
In my testing scontrol suspend/resume would cause timeslicing to start. It shouldn't happen if you only let slurm handle the suspending. Are you saying you did not run scontrol suspend/resume? I only mentioned backfill since the document you referenced mentioned you needed to use the builtin plugin when doing gang scheduling. Backfill isn't being turned on locally nor is that possible. On October 15, 2015 6:20:06 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=2031 > >--- Comment #3 from Sergey Meirovich <sergey_meirovich@amat.com> --- >We had not manually suspended any job during the times we observed >this. We >exactly want to use just partition preemption. > >Not sure that I am following you. Do you mean that scontrol >suspend/scontrol >resume justsomehow enabling "local backfilling" for unrelated jobs >(that were >not part of suspend/resume)? > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug.
We have already submitted ticket about configuration suggestion. http://bugs.schedmd.com/show_bug.cgi?id=2016 Well, - some suggestions are applicable to us right now (E.g. timout settings). - Somes are not as we are using bright cluster manager and some artifacts and tautologies in configs are due to it. - Features - we are using them to allow for user to restrict selection of nodes. That is a prototype - real cluster there will be different CPUs, different memory size onboard etc. Not an elegant solution and probably partitions will work better here but our business unit mandated that style. - chroups - probably in the future. We are running RHEL6 now and saw cimiliar issues http://bugs.schedmd.com/show_bug.cgi?id=1410. Cgroups support is too rudimnatary to me in RHEL6. We are going to upgrade to RHEL7 in middle term cgroups are postponed for now - "You have "PreemptMode=suspend,gang" configured, but in order to suspend jobs, you'll need sufficient memory on the nodes for multiple jobs." and that is fine. Our users are lazy and don't like to annotate each and every job with --mem and are not ready to accept consequences if they are not specifying --mem. So in the short term we are read to observe oom_kill sometimes. -- we might buy additional node for Bright manager, that would be the moment when BackupController suggestion could be useful.
We indeed used scontrol suspend few times. But why it should enable timeslicing for unrelated jobs? Some of those even started long time after manual suspend/resume event. That smells like a bug to me...
But If I remember correctly I restarted slurmctld and all slurmds few times after I had used scontrol suspend/resume. That should be visible in slurmctld logs....
I would update that bug with these concerns so they can be dealt with. I see it is marked resolved and it doesn't appear to be the case. On October 15, 2015 6:47:31 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=2031 > >--- Comment #5 from Sergey Meirovich <sergey_meirovich@amat.com> --- >We have already submitted ticket about configuration suggestion. >http://bugs.schedmd.com/show_bug.cgi?id=2016 > >Well, > >- some suggestions are applicable to us right now (E.g. timout >settings). >- Somes are not as we are using bright cluster manager and some >artifacts and >tautologies in configs are due to it. >- Features - we are using them to allow for user to restrict selection >of >nodes. That is a prototype - real cluster there will be different CPUs, >different memory size onboard etc. Not an elegant solution and probably >partitions will work better here but our business unit mandated that >style. >- chroups - probably in the future. We are running RHEL6 now and saw >cimiliar >issues http://bugs.schedmd.com/show_bug.cgi?id=1410. Cgroups support is >too >rudimnatary to me in RHEL6. We are going to upgrade to RHEL7 in middle >term >cgroups are postponed for now >- "You have "PreemptMode=suspend,gang" configured, but in order to >suspend >jobs, you'll need sufficient memory on the nodes for multiple jobs." >and that >is fine. Our users are lazy and don't like to annotate each and every >job with >--mem and are not ready to accept consequences if they are not >specifying >--mem. So in the short term we are read to observe oom_kill sometimes. >-- we might buy additional node for Bright manager, that would be the >moment >when BackupController suggestion could be useful. > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug.
Agreed, it appears to be unexpected we are looking into it. The problem should go away it you refrain from the practice though :). On October 15, 2015 6:51:52 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=2031 > >--- Comment #6 from Sergey Meirovich <sergey_meirovich@amat.com> --- >We indeed used scontrol suspend few times. > >But why it should enable timeslicing for unrelated jobs? Some of those >even >started long time after manual suspend/resume event. That smells like a >bug to >me... > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug.
It doesn't matter, any manual intervention will set you on this course for the affected jobs with no reversal. On October 15, 2015 6:54:58 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=2031 > >--- Comment #7 from Sergey Meirovich <sergey_meirovich@amat.com> --- >But If I remember correctly I restarted slurmctld and all slurmds few >times >after I had used scontrol suspend/resume. That should be visible in >slurmctld >logs.... > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug.
Hmm, Indeed I had used suspend some time before I observed the issue. [2015-10-15T09:12:40.137] Processing RPC: REQUEST_SUSPEND(suspend) from uid=0 [2015-10-15T09:12:40.137] _slurm_rpc_suspend(suspend) for 1103 usec=7150 But the jobid=1103 was even run on another hosts (not the host where we obvserver time-slicing. [root@DCA-DELL-HPC-HEAD ~]# squeue -j 1103 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1103 normal wrap e153547 R 6-06:07:55 2 DCA-DELL-HPC[03-04] [root@DCA-DELL-HPC-HEAD ~]# Again that sounds like a bug to me when manual suspend starts time-slicing globally behind the scene.
Shall I resubmit as another bug? I will appreciate if you look at this as you stated in the comment #1
Just reopen that other bug. On October 16, 2015 11:54:29 AM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=2031 > >--- Comment #12 from Sergey Meirovich <sergey_meirovich@amat.com> --- >Shall I resubmit as another bug? I will appreciate if you look at this >as you >stated in the comment #1 > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug.
Sorry, I meant shall I open another bug for "... scontrol suspend/resume would cause timeslicing to start" ?
No, from my view point this is the bug. I am looking into it as noted in comment 1.
Your configuration allows one job per resource (core) from each partition. By suspending a job, you released its cores to be allocated to other jobs. When you resumed that job, that left you with multiple jobs allocated the same cores, which the gang scheduling logic time sliced as you observed. This demonstrates one of the dangers of manually suspending/resuming jobs. Rather than suspending/resuming the job, how about sending the job SIGSTOP/SIGCONT instead (using the scancel command)? That does the same thing as suspend/resume except that it is available for any user to do and does not release the cores to be allocated to other jobs, which would eliminate the time slicing. Would that satisfy your requirements? The other option would be to drain the resources from the newly started jobs so that they would be available for the resumed job without conflicts.
I've added documentation warning of the dangers of manual job suspend/resume: https://github.com/SchedMD/slurm/commit/2929a2895457f4f60eab29e32ab5d5b0e035d6df I've also modified the code so that if a job is manually resumed and it's resources have already been allocated to other jobs, that job will be started and will gang scheduled with the other jobs, but at least the bookkeeping for jobs will be correct. https://github.com/SchedMD/slurm/commit/d2d9206047967bd2bd7806d5f39ea7218216972e These changes will be in version 15.08.2 when released, probably in a week or so. If you want to prevent the gang scheduling, you need to prevent other jobs from being allocated resources from the job that you stop. Using SIGSTOP/SIGCONT to stop the job is probably your best option.
Hi, Has the behavior described in this ticked eventually been addressed by: https://github.com/SchedMD/slurm/commit/344d2eab7ee6243b63cb71104af20b04495f7f81.patch ?
(In reply to Sergey Meirovich from comment #18) > Hi, > > Has the behavior described in this ticked eventually been addressed by: > > https://github.com/SchedMD/slurm/commit/ > 344d2eab7ee6243b63cb71104af20b04495f7f81.patch > > ? What that patch does is immediately notify the gang scheduling module that someone manually suspended or resumed a job, which can then trigger an immediate context switch. For example, say job 10 has been allocated CPUs 0-15 on node 1. Once that job has been suspended, its CPUs become available for another job to use. After job 10 is suspended, let's say that job 11 from the same partition gets allocated CPUs 0-7 and job 12 gets allocated CPUs 8-15. Lets then say a sys admin then resumes job 10. If gang scheduling is NOT configured, then the kernel will allocate CPU time to the various processes of the jobs (which overlap), likely causing the job's run time to increase dramatically and quite possibly fail due to reaching its time limit. If gang scheduling is enabled, job 10 will time slice with jobs 11 and 12 until either job 10 ends or jobs 11 and 12 both end. The job's time limit is based upon the time when it is in run state only (not the time suspended by gang scheduling). If the applications have any real parallelism, gang scheduling them will get them done faster. In any case, manually suspending and resuming jobs will result in oversubscribed CPUs and gang scheduling is likely the best way to proceed.