Created attachment 21571 [details] slurm.conf Yesterday I discovered that SLURM does not process new jobs for scheduling while other jobs are in CG state as epilogue on them was still running. sched.log shows entries like: sched: [2021-09-30T16:50:31.364] schedule() returning, some job is still completing is this on purpose, a bug or a configuration issue?
Michael, This is related to CompleteWait which you have set to 1200. https://slurm.schedmd.com/slurm.conf.html#OPT_CompleteWait -Scott
but why is slurm waiting on nodes in CG state? why does it not simply ignore those nodes and consider them busy until they are really free?
Michael, CG stands for COMPLETING. >why does it not simply ignore those nodes and consider them busy until they are really free? That is what it does by default. When in COMPLETING state slurm cannot allocate those resources to other jobs. It must be fully completed. CompleteWait holds all scheduling so that new jobs can have a blank slate to allocate resources from. (i.e. if a job is completing and the epilog finishes on node 3 and 7 first the next job will grab them instead of node 0 and 1 which are closer together.) https://slurm.schedmd.com/slurm.conf.html#OPT_CompleteWait If you want your jobs to start as fast as possible I recommend turning CompleteWait off as is default. Let me know if you have questions about this. -Scott
Thanks for pointing that out - but that behavior is not really logical. If I have 100 nodes, 20 busy with jobs, 15 busy with a job in completing state - why does scheduling not do anything with the remaining 55? Is there no other way to prevent jobs from starting on a node in CG state but prevent scheduling on ALL nodes?
Michael, This flag will make this wait for nodes in the same partitions. (Or overlapping partitons) https://slurm.schedmd.com/slurm.conf.html#OPT_reduce_completing_frag What is your use case? Why do you need CompleteWait? -Scott
our epilogue takes up to 15min and must not overlap with a prologue from a new job start.
Michael, As I said before, jobs will not be allocated to a node with an epilog running. This is default behavior, not associated with CompleteWait. Is there a reason you also need the CompleteWait feature as I described or as is specified in the documentation? Unless there is I would suggest turning the feature off or setting it to 0 (default). -Scott
and I said before I had instances of prologues starting while epilogue is still running. I need to prevent that - unfortunately SLURM design philosophies are at odds with a cluster that is built on unreliable hardware, needs 15min prologue/epilogues and has all nodes in a job waiting until the prologue/epilogue from the head node is complete. CompleteWait seemed to be an option, but unfortunately is completely at odds with throughput. Originally I understood it would only affect scheduling to nodes in a single completing job. That being not the case I will need to find an alternative solution - any ideas?
Michael, Yes I see we recommended CompleteWait in bug 12445. Sorry for the misunderstanding. CompleteWait was designed for a different use case and has significant drawbacks in high throughput environments especially with the high time frame you need. I don't currently have a good solution for your use case. However, if you keep CompleteWait for now I believe the "reduce_completing_frag" flag will reduce the affect to just the partition with the completing job. https://slurm.schedmd.com/slurm.conf.html#OPT_reduce_completing_frag -Scott
That does not help. Most of the nodes are in multiple partitions with different priority as well as convenience (users should find all nodes they can access in one queue - some queues have access to additional nodes, but a lot of nodes are shared between queues). What I need is a way to prevent nodes from being scheduled into a new job while epilogue is still running. Should I take the nodes offline at the start of epilogue?
Michael, What are you doing in your prolog and epilog? Are you messing with the node configuration or power state? May I see the scripts? -Scott
Michael, It seem you have quite a complicated situation. Could you give us a list of requirements for your situation regarding job setup and teardown? Perhaps with that we can think about which features would work best for you. -Scott
Modules are both for checking and configuring items on the OS not available without root permissions. All in all about 3k lines of python and shell code. I'm willing to share under NDA but not willing to put it on this open bug report. check_downclock nologin clean_turbostat set_all_cpus userid watchdog cpufreq nm_and_chassis fix_permissions memkind clean_dev_shm getmeminfo logger set_numa_balancing getcfsinfo msr_safe sharpd rsyslog test_swap clean_sep_files tmp_free_size rogue_test hugepages ntpd fuse intel_gpu nvideacard thp cpuonoff fix_truescale dmidecode read_mce config_ofed dev_shm_huge check_mounts config_lustre
Trying to put everything together: long prologue/epilogue runtimes nodes can be reconfigured and rebooted during prologue/epilogue (changing BIOS options, boot to different kernel or different kernel options...) prologue/epilogue run from head node only - currently epilogues on nodes are simple "sleep" commands to mimic behavior SLURM expects only a single instance of our special prologue/epilogue script is allowed to run strictly serial prologue/job/epilogue behaviour nodes must not be reused in new job before previous job including epilogue is completely done SLUM must ignore changes in #cpus/cores during run (due to Intel Speed Select Technology) low scheduling rate: 15k jobs per month job size 1-350 nodes minimum job runtime 10 min exclusive nodes only scheduling only to constraints no scheduling for GPU no scheduling to specific cores/threads actually - almost all slurm scheduling options are not used access regulated via partitions - some nodes are only in specific queues, many nodes are in at least 3 partitions
Michael, I am resolving this issue out as invalid, and I will explain why below. I also want to point out that any future bug you open trying to have us support your rebooting solution will be closed out as invalid. The method that you are using to reboot nodes is not supportable due to the many issues we have previously mentioned, and your use of CompleteWait with this configuration is not something that this was designed for. As mentioned to you previously, you need to migrate to using the node features helper. This plugin is designed to have nodes reboot where prolog and epilog is not. We have an outstanding document node_features plugin ticket, however we do support this today, and you should be using this instead of your current solution. https://bugs.schedmd.com/show_bug.cgi?id=12331 You can find an example of how to configure this until we document this on the website. https://bugs.schedmd.com/attachment.cgi?id=20965&action=diff
a) "You are not authorized to access bug #12331" b) the second link does not explain how users are supposed to request a feature c) even without reboots prologs/epilogues take 30+ seconds and must be run exclusively. That has nothing to do with reboots and you consider that usage invalid as well? d) from the description: NOTE that any feature under the control of a Helper cannot use the more complex specification language. If any of those more complex specifications are specified by a job using constraints with "[]()!*" then the job will be rejected. that makes the nodefeature plugin already barely usable for my cluster. Is there any plan to change that in the future?
Michael, I plan to send you an update later today. I am catching up from the long holiday weekend. I wanted to at least let you know that I did see this update and intend to reply.
> a) "You are not authorized to access bug #12331" This is an internal bug/task to fully document this and is not meant to be viewed generally. b) the second link does not explain how users are supposed to request a feature I apologize I should have pointed to the parent bug tracking this. https://bugs.schedmd.com/show_bug.cgi?id=9567 This builds off KNL where users would use the -C or --constraints option of Slurm's job submission commands: salloc, sbatch, and srun. > And that's all, a key-value format handled by custom external binaries is the gist of this proposal. > To match the KNL plugin there are also configuration options to tweak boot time and node reboot weight, > but it's straightforward. > The user-facing API is no different than using static constraints declared at the node level: > $ srun --constraint nps=1 ... c) even without reboots prologs/epilogues take 30+ seconds and must be run exclusively. That has nothing to do with reboots and you consider that usage invalid as well? This has been covered already by Marcin and Scott. https://bugs.schedmd.com/show_bug.cgi?id=12102 https://bugs.schedmd.com/show_bug.cgi?id=12588 We do not have any further interest in devoting time into the subject. The way in which you are using these features is just something Slurm was not designed to do. I would suggest looking into other ways to obtain exclusivity on that node in a way that Slurm would expect. This could be moving tasks to a dependent job that runs directly after the primary job, putting a reservation on nodes that last the duration of that job+epilog. As mentioned by Scott in 12588 CompleteWait 'holds all scheduling so that new jobs can have a blank slate to allocate resources from.' this would not scale well in a HTC environment. > d) from the description: NOTE that any feature under the control of a Helper cannot use the more complex specification language. > If any of those more complex specifications are specified by a job using constraints with "[]()!*" then the job will be rejected. > that makes the nodefeature plugin already barely usable for my cluster. Is there any plan to change that in the future? We currently do not have plans to expand this.
So your answer essentially is: SLURM cannot do what you need it to do. We also have no interest to develop the changes you'd need. Please leave us alone and go back to LSF. I won't bother you with these problems anymore, but you can't expect me to be happy about it.
Hi Michael > So your answer essentially is: SLURM cannot do what you need it to do. We also have no interest to develop the changes you'd need. Please leave us alone and go back to LSF. > I won't bother you with these problems anymore, but you can't expect me to be happy about it. I can understand the frustration here. Our goal here is to have you move onto a platform that we can support, such as with node features. Locking in a node to block jobs with epilogue presents several issues for Slurm, (which we have discussed), and I just do not see a solution with epilogue that would be supportable. I am aware of some other mechanisms that might do this, however, it too takes advantage of flipping nodes states. I am not keen on proposing it since it introduces other issues. --Previous comment-- > d) from the description: NOTE that any feature under the control of a Helper cannot use the more complex specification language. > If any of those more complex specifications are specified by a job using constraints with "[]()!*" then the job will be rejected. > that makes the nodefeature plugin already barely usable for my cluster. Is there any plan to change that in the future? Even though there are no plans to expand this now, I would still suggest proposing use cases that would make this more viable for you. Looking into the future, there are tasks underway that may make things easier for both us and you. Although I do not have the specifics to share right now, we do mention this in the Roadmap (Page 37, Truly Dynamic Nodes). https://slurm.schedmd.com/SLUG21/Roadmap.pdf
ok, please keep me informed once situation changes.