| Summary: | Unexpected timeslicing (scontrol suspend/resume will cause timeslicing to start) | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Sergey Meirovich <sergey_meirovich> |
| Component: | Configuration | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | Normal | CC: | brian, da |
| Version: | 15.08.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | AMAT | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 15.08.2 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurm.conf
slurmctld log |
||
|
Description
Sergey Meirovich
2015-10-15 07:55:58 MDT
Sergey, Are you manually suspending jobs? If so I can reproduce this. I'll look into how this is happening. I would suggest you not manually suspend the jobs though and use the partitions to do the preempting for you. I am fairly sure you can use backfill here to accomplish your goals. I believe the document you are looking at is incorrect, I will verify and let you know. Could you send your slurmctld.log? I also have a few suggestions/changes to your slurm.conf. Please open a new ticket and we can discuss it there. Created attachment 2308 [details]
slurmctld log
We had not manually suspended any job during the times we observed this. We exactly want to use just partition preemption. Not sure that I am following you. Do you mean that scontrol suspend/scontrol resume justsomehow enabling "local backfilling" for unrelated jobs (that were not part of suspend/resume)? In my testing scontrol suspend/resume would cause timeslicing to start. It shouldn't happen if you only let slurm handle the suspending. Are you saying you did not run scontrol suspend/resume? I only mentioned backfill since the document you referenced mentioned you needed to use the builtin plugin when doing gang scheduling. Backfill isn't being turned on locally nor is that possible. On October 15, 2015 6:20:06 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=2031 > >--- Comment #3 from Sergey Meirovich <sergey_meirovich@amat.com> --- >We had not manually suspended any job during the times we observed >this. We >exactly want to use just partition preemption. > >Not sure that I am following you. Do you mean that scontrol >suspend/scontrol >resume justsomehow enabling "local backfilling" for unrelated jobs >(that were >not part of suspend/resume)? > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug. We have already submitted ticket about configuration suggestion. http://bugs.schedmd.com/show_bug.cgi?id=2016 Well, - some suggestions are applicable to us right now (E.g. timout settings). - Somes are not as we are using bright cluster manager and some artifacts and tautologies in configs are due to it. - Features - we are using them to allow for user to restrict selection of nodes. That is a prototype - real cluster there will be different CPUs, different memory size onboard etc. Not an elegant solution and probably partitions will work better here but our business unit mandated that style. - chroups - probably in the future. We are running RHEL6 now and saw cimiliar issues http://bugs.schedmd.com/show_bug.cgi?id=1410. Cgroups support is too rudimnatary to me in RHEL6. We are going to upgrade to RHEL7 in middle term cgroups are postponed for now - "You have "PreemptMode=suspend,gang" configured, but in order to suspend jobs, you'll need sufficient memory on the nodes for multiple jobs." and that is fine. Our users are lazy and don't like to annotate each and every job with --mem and are not ready to accept consequences if they are not specifying --mem. So in the short term we are read to observe oom_kill sometimes. -- we might buy additional node for Bright manager, that would be the moment when BackupController suggestion could be useful. We indeed used scontrol suspend few times. But why it should enable timeslicing for unrelated jobs? Some of those even started long time after manual suspend/resume event. That smells like a bug to me... But If I remember correctly I restarted slurmctld and all slurmds few times after I had used scontrol suspend/resume. That should be visible in slurmctld logs.... I would update that bug with these concerns so they can be dealt with. I see it is marked resolved and it doesn't appear to be the case. On October 15, 2015 6:47:31 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=2031 > >--- Comment #5 from Sergey Meirovich <sergey_meirovich@amat.com> --- >We have already submitted ticket about configuration suggestion. >http://bugs.schedmd.com/show_bug.cgi?id=2016 > >Well, > >- some suggestions are applicable to us right now (E.g. timout >settings). >- Somes are not as we are using bright cluster manager and some >artifacts and >tautologies in configs are due to it. >- Features - we are using them to allow for user to restrict selection >of >nodes. That is a prototype - real cluster there will be different CPUs, >different memory size onboard etc. Not an elegant solution and probably >partitions will work better here but our business unit mandated that >style. >- chroups - probably in the future. We are running RHEL6 now and saw >cimiliar >issues http://bugs.schedmd.com/show_bug.cgi?id=1410. Cgroups support is >too >rudimnatary to me in RHEL6. We are going to upgrade to RHEL7 in middle >term >cgroups are postponed for now >- "You have "PreemptMode=suspend,gang" configured, but in order to >suspend >jobs, you'll need sufficient memory on the nodes for multiple jobs." >and that >is fine. Our users are lazy and don't like to annotate each and every >job with >--mem and are not ready to accept consequences if they are not >specifying >--mem. So in the short term we are read to observe oom_kill sometimes. >-- we might buy additional node for Bright manager, that would be the >moment >when BackupController suggestion could be useful. > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug. Agreed, it appears to be unexpected we are looking into it. The problem should go away it you refrain from the practice though :). On October 15, 2015 6:51:52 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=2031 > >--- Comment #6 from Sergey Meirovich <sergey_meirovich@amat.com> --- >We indeed used scontrol suspend few times. > >But why it should enable timeslicing for unrelated jobs? Some of those >even >started long time after manual suspend/resume event. That smells like a >bug to >me... > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug. It doesn't matter, any manual intervention will set you on this course for the affected jobs with no reversal. On October 15, 2015 6:54:58 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=2031 > >--- Comment #7 from Sergey Meirovich <sergey_meirovich@amat.com> --- >But If I remember correctly I restarted slurmctld and all slurmds few >times >after I had used scontrol suspend/resume. That should be visible in >slurmctld >logs.... > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug. Hmm,
Indeed I had used suspend some time before I observed the issue.
[2015-10-15T09:12:40.137] Processing RPC: REQUEST_SUSPEND(suspend) from uid=0
[2015-10-15T09:12:40.137] _slurm_rpc_suspend(suspend) for 1103 usec=7150
But the jobid=1103 was even run on another hosts (not the host where we obvserver time-slicing.
[root@DCA-DELL-HPC-HEAD ~]# squeue -j 1103
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1103 normal wrap e153547 R 6-06:07:55 2 DCA-DELL-HPC[03-04]
[root@DCA-DELL-HPC-HEAD ~]#
Again that sounds like a bug to me when manual suspend starts time-slicing globally behind the scene.
Shall I resubmit as another bug? I will appreciate if you look at this as you stated in the comment #1 Just reopen that other bug. On October 16, 2015 11:54:29 AM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=2031 > >--- Comment #12 from Sergey Meirovich <sergey_meirovich@amat.com> --- >Shall I resubmit as another bug? I will appreciate if you look at this >as you >stated in the comment #1 > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug. Sorry, I meant shall I open another bug for "... scontrol suspend/resume would cause timeslicing to start" ? No, from my view point this is the bug. I am looking into it as noted in comment 1. Your configuration allows one job per resource (core) from each partition. By suspending a job, you released its cores to be allocated to other jobs. When you resumed that job, that left you with multiple jobs allocated the same cores, which the gang scheduling logic time sliced as you observed. This demonstrates one of the dangers of manually suspending/resuming jobs. Rather than suspending/resuming the job, how about sending the job SIGSTOP/SIGCONT instead (using the scancel command)? That does the same thing as suspend/resume except that it is available for any user to do and does not release the cores to be allocated to other jobs, which would eliminate the time slicing. Would that satisfy your requirements? The other option would be to drain the resources from the newly started jobs so that they would be available for the resumed job without conflicts. I've added documentation warning of the dangers of manual job suspend/resume: https://github.com/SchedMD/slurm/commit/2929a2895457f4f60eab29e32ab5d5b0e035d6df I've also modified the code so that if a job is manually resumed and it's resources have already been allocated to other jobs, that job will be started and will gang scheduled with the other jobs, but at least the bookkeeping for jobs will be correct. https://github.com/SchedMD/slurm/commit/d2d9206047967bd2bd7806d5f39ea7218216972e These changes will be in version 15.08.2 when released, probably in a week or so. If you want to prevent the gang scheduling, you need to prevent other jobs from being allocated resources from the job that you stop. Using SIGSTOP/SIGCONT to stop the job is probably your best option. Hi, Has the behavior described in this ticked eventually been addressed by: https://github.com/SchedMD/slurm/commit/344d2eab7ee6243b63cb71104af20b04495f7f81.patch ? (In reply to Sergey Meirovich from comment #18) > Hi, > > Has the behavior described in this ticked eventually been addressed by: > > https://github.com/SchedMD/slurm/commit/ > 344d2eab7ee6243b63cb71104af20b04495f7f81.patch > > ? What that patch does is immediately notify the gang scheduling module that someone manually suspended or resumed a job, which can then trigger an immediate context switch. For example, say job 10 has been allocated CPUs 0-15 on node 1. Once that job has been suspended, its CPUs become available for another job to use. After job 10 is suspended, let's say that job 11 from the same partition gets allocated CPUs 0-7 and job 12 gets allocated CPUs 8-15. Lets then say a sys admin then resumes job 10. If gang scheduling is NOT configured, then the kernel will allocate CPU time to the various processes of the jobs (which overlap), likely causing the job's run time to increase dramatically and quite possibly fail due to reaching its time limit. If gang scheduling is enabled, job 10 will time slice with jobs 11 and 12 until either job 10 ends or jobs 11 and 12 both end. The job's time limit is based upon the time when it is in run state only (not the time suspended by gang scheduling). If the applications have any real parallelism, gang scheduling them will get them done faster. In any case, manually suspending and resuming jobs will result in oversubscribed CPUs and gang scheduling is likely the best way to proceed. |