| Summary: | Scheduling issue using multiple partitions and node weights | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Lee Reynolds <Lee.Reynolds> |
| Component: | Scheduling | Assignee: | Felip Moll <felip.moll> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | nate |
| Version: | 18.08.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=8549 | ||
| Site: | ASU | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | sdiag1.zip | ||
|
Description
Lee Reynolds
2019-12-05 18:08:32 MST
The reason code for almost all of the pending jobs is (Priority) I'd also like to add that the nodes which are stuck idle are all parallel nodes. The serial-only nodes are not idle. Lee, Please attach your slurm.conf, slurmctld log, and the output of sdiag. Thanks, --Nate Lee, Please make sure to change your slurm database password as it was included in the slurmdbd.conf. Thanks, --Nate Hi Lee, I see your backfilling is not getting to the end of the queue. Can you add bf_continue parameter to SchedulerParameters to see if it helps? I want to study a bit your configuration, partitions and sdiag to give you some recommendations on the scheduler, but for being able to do so I'd also need: squeue -o "%i %u %.11v %.10P %.10m %.10q %.10Q %.19V %.19S %.19e %.7T %.10E %.10n %r %N" scontrol show nodes scontrol show job As you guessed is also possible that a larger job has reserved a big set of nodes and less priority jobs may not be able to be scheduled on those nodes. The nodes appear as IDLE even if they are reserved by the scheduler. Multi-partition jobs can also help to this situation, where a job is evaluated in the first partition of the list but not the second one. Provide me with the remaining mentioned information and I will analyze your specific case. Thanks Created attachment 12504 [details] sdiag1.zip Here is the info you requested. I’ve also included the current version of the slurm.conf file and the logs for slurmctld and slurmdbd I’ve added the bf_continue flag and also set the default log level to 4. Lee Reynolds Senior RC Architect ASU Research Computing T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu> researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/> How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>. From: bugs@schedmd.com <bugs@schedmd.com> Sent: Friday, December 6, 2019 10:38 AM To: Lee Reynolds <Lee.Reynolds@asu.edu> Subject: [Bug 8191] Scheduling issue using multiple partitions and node weights Comment # 7<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191-23c7&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=b4g0bK37ktnUslwfl6h_Z103lAYjYESEnCJ1Csk3M58&s=GXfs2eMj91XOfdyP2OijePrmSZYitZysx-UraGP356Q&e=> on bug 8191<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=b4g0bK37ktnUslwfl6h_Z103lAYjYESEnCJ1Csk3M58&s=tyop8NT9fMK2jRk_99ZGbvlnVB50jk8MqGwMvVxfX8E&e=> from Felip Moll<mailto:felip.moll@schedmd.com> Hi Lee, I see your backfilling is not getting to the end of the queue. Can you add bf_continue parameter to SchedulerParameters to see if it helps? I want to study a bit your configuration, partitions and sdiag to give you some recommendations on the scheduler, but for being able to do so I'd also need: squeue -o "%i %u %.11v %.10P %.10m %.10q %.10Q %.19V %.19S %.19e %.7T %.10E %.10n %r %N" scontrol show nodes scontrol show job As you guessed is also possible that a larger job has reserved a big set of nodes and less priority jobs may not be able to be scheduled on those nodes. The nodes appear as IDLE even if they are reserved by the scheduler. Multi-partition jobs can also help to this situation, where a job is evaluated in the first partition of the list but not the second one. Provide me with the remaining mentioned information and I will analyze your specific case. Thanks ________________________________ You are receiving this mail because: * You reported the bug. We’re now seeing an issue where the slurmctld daemon is getting stuck with the following message in its log: server_thread_count over limit (256), waiting Restarting the service “fixes” this, but I fear it will continue happening. Are there any settings we should scale back? Lee Reynolds Senior RC Architect ASU Research Computing T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu> researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/> How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>. From: Lee Reynolds <Lee.Reynolds@asu.edu> Sent: Friday, December 6, 2019 1:16 PM To: bugs@schedmd.com Cc: DL.ORG.KE.RC.Devops <DL.ORG.OKED.RC.DevOps@exchange.asu.edu> Subject: RE: [Bug 8191] Scheduling issue using multiple partitions and node weights Here is the info you requested. I’ve also included the current version of the slurm.conf file and the logs for slurmctld and slurmdbd I’ve added the bf_continue flag and also set the default log level to 4. Lee Reynolds Senior RC Architect ASU Research Computing T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu> researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/> How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>. From: bugs@schedmd.com<mailto:bugs@schedmd.com> <bugs@schedmd.com<mailto:bugs@schedmd.com>> Sent: Friday, December 6, 2019 10:38 AM To: Lee Reynolds <Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu>> Subject: [Bug 8191] Scheduling issue using multiple partitions and node weights Comment # 7<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191-23c7&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=b4g0bK37ktnUslwfl6h_Z103lAYjYESEnCJ1Csk3M58&s=GXfs2eMj91XOfdyP2OijePrmSZYitZysx-UraGP356Q&e=> on bug 8191<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=b4g0bK37ktnUslwfl6h_Z103lAYjYESEnCJ1Csk3M58&s=tyop8NT9fMK2jRk_99ZGbvlnVB50jk8MqGwMvVxfX8E&e=> from Felip Moll<mailto:felip.moll@schedmd.com> Hi Lee, I see your backfilling is not getting to the end of the queue. Can you add bf_continue parameter to SchedulerParameters to see if it helps? I want to study a bit your configuration, partitions and sdiag to give you some recommendations on the scheduler, but for being able to do so I'd also need: squeue -o "%i %u %.11v %.10P %.10m %.10q %.10Q %.19V %.19S %.19e %.7T %.10E %.10n %r %N" scontrol show nodes scontrol show job As you guessed is also possible that a larger job has reserved a big set of nodes and less priority jobs may not be able to be scheduled on those nodes. The nodes appear as IDLE even if they are reserved by the scheduler. Multi-partition jobs can also help to this situation, where a job is evaluated in the first partition of the list but not the second one. Provide me with the remaining mentioned information and I will analyze your specific case. Thanks ________________________________ You are receiving this mail because: * You reported the bug. Is there a phone number we can call on Monday? I’m being asked to escalate this issue. Lee Reynolds Senior RC Architect ASU Research Computing T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu> researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/> How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>. From: Lee Reynolds <Lee.Reynolds@asu.edu> Sent: Saturday, December 7, 2019 1:31 PM To: bugs@schedmd.com Cc: DL.ORG.KE.RC.Devops <DL.ORG.OKED.RC.DevOps@exchange.asu.edu> Subject: RE: [Bug 8191] Scheduling issue using multiple partitions and node weights We’re now seeing an issue where the slurmctld daemon is getting stuck with the following message in its log: server_thread_count over limit (256), waiting Restarting the service “fixes” this, but I fear it will continue happening. Are there any settings we should scale back? Lee Reynolds Senior RC Architect ASU Research Computing T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu> researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/> How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>. From: Lee Reynolds <Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu>> Sent: Friday, December 6, 2019 1:16 PM To: bugs@schedmd.com<mailto:bugs@schedmd.com> Cc: DL.ORG.KE.RC.Devops <DL.ORG.OKED.RC.DevOps@exchange.asu.edu<mailto:DL.ORG.OKED.RC.DevOps@exchange.asu.edu>> Subject: RE: [Bug 8191] Scheduling issue using multiple partitions and node weights Here is the info you requested. I’ve also included the current version of the slurm.conf file and the logs for slurmctld and slurmdbd I’ve added the bf_continue flag and also set the default log level to 4. Lee Reynolds Senior RC Architect ASU Research Computing T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu> researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/> How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>. From: bugs@schedmd.com<mailto:bugs@schedmd.com> <bugs@schedmd.com<mailto:bugs@schedmd.com>> Sent: Friday, December 6, 2019 10:38 AM To: Lee Reynolds <Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu>> Subject: [Bug 8191] Scheduling issue using multiple partitions and node weights Comment # 7<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191-23c7&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=b4g0bK37ktnUslwfl6h_Z103lAYjYESEnCJ1Csk3M58&s=GXfs2eMj91XOfdyP2OijePrmSZYitZysx-UraGP356Q&e=> on bug 8191<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=b4g0bK37ktnUslwfl6h_Z103lAYjYESEnCJ1Csk3M58&s=tyop8NT9fMK2jRk_99ZGbvlnVB50jk8MqGwMvVxfX8E&e=> from Felip Moll<mailto:felip.moll@schedmd.com> Hi Lee, I see your backfilling is not getting to the end of the queue. Can you add bf_continue parameter to SchedulerParameters to see if it helps? I want to study a bit your configuration, partitions and sdiag to give you some recommendations on the scheduler, but for being able to do so I'd also need: squeue -o "%i %u %.11v %.10P %.10m %.10q %.10Q %.19V %.19S %.19e %.7T %.10E %.10n %r %N" scontrol show nodes scontrol show job As you guessed is also possible that a larger job has reserved a big set of nodes and less priority jobs may not be able to be scheduled on those nodes. The nodes appear as IDLE even if they are reserved by the scheduler. Multi-partition jobs can also help to this situation, where a job is evaluated in the first partition of the list but not the second one. Provide me with the remaining mentioned information and I will analyze your specific case. Thanks ________________________________ You are receiving this mail because: * You reported the bug. (In reply to Lee Reynolds from comment #10) > Is there a phone number we can call on Monday? > > I’m being asked to escalate this issue. > Hi Lee, Unfortunately our support model does not include phone/videoconf support and we prefer to keep track of everything in bugzilla. Also this helps us to understand better the evolution of the bug. In any case I am dedicating the day to this bug, so my intention is to be quite responsive. I am already working on the issue, but I would need a few things from you: 1. Do not use numeric debug levels anymore if possible, this helps to better understand the debug level and not get any confusion. The translation is that one: 0: quiet 1: fatal 2: error 3: info 4: verbose 5: debug 6: debug2 7: debug3 8: debug4 9: debug5 Please, set your slurmctld debug level to 'debug2'. I may ask you *later* to enable DebugFlags=agent,backfill, but let's start by the beginning. 2. I need you to check your system tunning parameters, can you check this guide and see if your server parameters are properly sized? https://slurm.schedmd.com/high_throughput.html 3. Can you check the ctld server and network load? specifically look at the number of open files. Send me back the slurmctld.log to try to diagnose the rpc queue issues and a capture of sdiag every 10 seconds for 3 minutes. The issue may be because of a massive job submission + some server parameters too low. Let me also know if the server is now stable or you're still having issues. That's off topic: a) but these two errors would be nice if they get fixed: [2019-12-06T13:04:48.302] error: Ignoring invalid Allow/DenyQOS value: leereyno [2019-12-06T13:04:48.441] error: read_slurm_conf: default partition not set. b) CacheGroups option is deprecated since 16.05, you can remove the line in slurm.conf I see also this warning: [2019-12-06T13:04:48.445] TOPOLOGY: warning -- no switch can reach all nodes through its descendants.Do not use route/topology That means you have disjoint set of nodes in topology.conf, which implies that one job cannot use nodes from these two sets at the same time. Can you attach your topology.conf? Is it possible that the issue you were seeing was related to this, where nodes on one set were idle but the job requested nodes in the other set? Besides that, can you provide dmesg -T in ctld server? do you use NFS? Just a note, in comment 11, it would be very good to catch the logs when the RPC issue is happening or after it happened. I am studying now the sdiag + backfill parameters too for your initial problem. Lee, I suggest replacing your slurm.conf SchedulerParameters line with this one: SchedulerParameters=max_rpc_cnt=150,bf_yield_interval=1000000,bf_max_time=150,bf_continue,bf_resolution=300,sched_interval=45,default_queue_depth=300,preempt_reorder_count=3,max_switch_wait=86400,bf_window=20160 I worked on a set of settings for your site based on your situation, sdiag and output files. I think this can help to fix your issues. I explain the changed parameters below. Can you apply them and tell me how it goes? The adjustment of these parameters must be done empirically. So after applying we can analyze sdiag outputs and do more adjustments. I will be waiting for your feedback about this comment and the previous ones. Let me know if you have any question. ---- Make Backfill to release locks when this rpc count is reached, to avoid starving rpcs: max_rpc_cnt=150 Make Backfill to release locks more frequently to let RPCs to be served more often: bf_yield_interval=1000000 bf_max_time=150 Backfill will continue evaluating jobs in the list in the next iteration after breaking the current loop, opposite to start from the top again: bf_continue You have bf_resolution=30, which means 30 seconds. This can overload the scheduler, so I suggest to move to greater values. This can also cause backfill to be slow and, as seen in sdiag, evaluate very few jobs. Higher values gives better responsiveness, but scheduling can be a bit less precise. This could be read as: Looking bf_window minutes in the future with bf_resolution seconds resolution. Your bf_window is 20160 which would mean to look into the future for 14 days, and take in mind jobs finishing every 30 seconds to see if there's a gap available. I see it too much, so I'd suggest to change bf_resolution to at least 5 or 10 minutes: bf_resolution=300 Your sched_interval=10 indicates that the main scheduler (not the backfill) will run every 10 seconds, which can cause also performance issues. I'd suggest to increase considerably this value. The backfill will run every 30 seconds by default and should cover most of the jobs. Let's set it at least at 45 seconds: sched_interval=45 For default_queue_depth, specifying a large value will result in a poor system responsiveness since the scheduling logic will not release locks for other events to occur. You can keep this value, but you may be wanting to limit the time if issues with RPCs are still seen. default_queue_depth=300 Other parameter to consider: To limit the minimum time between the end of one scheduling cycle and the beginning of the next scheduling cycle use sched_min_interval. Even if you have sched_interval set, an event like a job submission may trigger the main scheduler and make it run too often thus causing locks starvation (and possibly RPC starvation). The default is 2us but if you observer the scheduler kicking in too often, consider increasing it. sched_min_interval=2000000 (default) I’m going through your emails from this morning and will have more info later today. Just wanted to let you know that your suggestion about the backfill scheduling setting seems to have done the trick as the cluster is now actively scheduling jobs. So the initial issue is solved, but I’d still like to ensure that the thread exhaustion problem is also dealt with. Lee Reynolds Senior RC Architect ASU Research Computing T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu> researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/> How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>. From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, December 9, 2019 10:15 AM To: Lee Reynolds <Lee.Reynolds@asu.edu> Subject: [Bug 8191] Scheduling issue using multiple partitions and node weights Comment # 13<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191-23c13&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=TJ4JJ16Eyv1wdoP5kt3BbJeWkJbrPcVlGuelPI4XET0&s=Pyc0tixw4UF2Gt02JgAquNp3u7Ynj7OyH89RsfPWXcg&e=> on bug 8191<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=TJ4JJ16Eyv1wdoP5kt3BbJeWkJbrPcVlGuelPI4XET0&s=6rPlY33HTG8X0f5j8XHRxcoJbMmy_1SWoA4Kt6GLDmc&e=> from Felip Moll<mailto:felip.moll@schedmd.com> Lee, I suggest replacing your slurm.conf SchedulerParameters line with this one: SchedulerParameters=max_rpc_cnt=150,bf_yield_interval=1000000,bf_max_time=150,bf_continue,bf_resolution=300,sched_interval=45,default_queue_depth=300,preempt_reorder_count=3,max_switch_wait=86400,bf_window=20160 I worked on a set of settings for your site based on your situation, sdiag and output files. I think this can help to fix your issues. I explain the changed parameters below. Can you apply them and tell me how it goes? The adjustment of these parameters must be done empirically. So after applying we can analyze sdiag outputs and do more adjustments. I will be waiting for your feedback about this comment and the previous ones. Let me know if you have any question. ---- Make Backfill to release locks when this rpc count is reached, to avoid starving rpcs: max_rpc_cnt=150 Make Backfill to release locks more frequently to let RPCs to be served more often: bf_yield_interval=1000000 bf_max_time=150 Backfill will continue evaluating jobs in the list in the next iteration after breaking the current loop, opposite to start from the top again: bf_continue You have bf_resolution=30, which means 30 seconds. This can overload the scheduler, so I suggest to move to greater values. This can also cause backfill to be slow and, as seen in sdiag, evaluate very few jobs. Higher values gives better responsiveness, but scheduling can be a bit less precise. This could be read as: Looking bf_window minutes in the future with bf_resolution seconds resolution. Your bf_window is 20160 which would mean to look into the future for 14 days, and take in mind jobs finishing every 30 seconds to see if there's a gap available. I see it too much, so I'd suggest to change bf_resolution to at least 5 or 10 minutes: bf_resolution=300 Your sched_interval=10 indicates that the main scheduler (not the backfill) will run every 10 seconds, which can cause also performance issues. I'd suggest to increase considerably this value. The backfill will run every 30 seconds by default and should cover most of the jobs. Let's set it at least at 45 seconds: sched_interval=45 For default_queue_depth, specifying a large value will result in a poor system responsiveness since the scheduling logic will not release locks for other events to occur. You can keep this value, but you may be wanting to limit the time if issues with RPCs are still seen. default_queue_depth=300 Other parameter to consider: To limit the minimum time between the end of one scheduling cycle and the beginning of the next scheduling cycle use sched_min_interval. Even if you have sched_interval set, an event like a job submission may trigger the main scheduler and make it run too often thus causing locks starvation (and possibly RPC starvation). The default is 2us but if you observer the scheduler kicking in too often, consider increasing it. sched_min_interval=2000000 (default) ________________________________ You are receiving this mail because: * You reported the bug. (In reply to Lee Reynolds from comment #14) > I’m going through your emails from this morning and will have more info > later today. > > Just wanted to let you know that your suggestion about the backfill > scheduling setting seems to have done the trick as the cluster is now > actively scheduling jobs. > > So the initial issue is solved, but I’d still like to ensure that the thread > exhaustion problem is also dealt with. I'm glad it helped. I am pretty sure that my last comment suggestions with the backfill and slurm.conf will also help with RPCs issues. Keep me posted. The scheduler is running well and our throughput has increased. However, just to be sure that everything is good, I’m sending you everything you asked for in the three emails you sent out on Monday. Here are the current settings: /proc/sys/fs/file-max: 5000000 /proc/sys/net/ipv4/tcp_max_syn_backlog: 256 /proc/sys/net/ipv4/tcp_syncookies : 1 /proc/sys/net/ipv4/tcp_synack_retries : 5 /proc/sys/net/core/somaxconn : 128 **** RAISING TO 4096 **** /proc/sys/net/ipv4/ip_local_port_range : 32768 60999 Munge has NOW been configured to use 10 threads Here are the limits for slurmctld: Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size unlimited unlimited bytes Max core file size unlimited unlimited bytes Max resident set unlimited unlimited bytes Max processes 31155 31155 processes Max open files 65536 65536 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 31155 31155 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us The slurmctld process currently has 74 open files. Here’s a link to the documents you requested: https://www.dropbox.com/sh/mn33yssnvuhtmpc/AAAzC0-OM0PDb1vCZhGLc9UTa?dl=0 Lee Reynolds Senior RC Architect ASU Research Computing T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu> researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/> How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>. From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, December 9, 2019 4:40 AM To: Lee Reynolds <Lee.Reynolds@asu.edu> Subject: [Bug 8191] Scheduling issue using multiple partitions and node weights Comment # 11<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191-23c11&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=FhL2gQ6xclaCSd2chJG46DrMhSB7TPV47Do9lC_jtrU&s=qn9wBjXnFsFp-6U_4w4BwFZa85GvGlEaXP7EJME8COo&e=> on bug 8191<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=FhL2gQ6xclaCSd2chJG46DrMhSB7TPV47Do9lC_jtrU&s=UYG2fYEHyjR8POJJb5ChmcTAIFWltRu6bZxTtZ_8lJI&e=> from Felip Moll<mailto:felip.moll@schedmd.com> (In reply to Lee Reynolds from comment #10<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191-23c10&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=FhL2gQ6xclaCSd2chJG46DrMhSB7TPV47Do9lC_jtrU&s=oNbB0V-wUeUpoZjZyzIMI6C-CWeOZuKtcuve5yeTJek&e=>) > Is there a phone number we can call on Monday? > > I’m being asked to escalate this issue. > Hi Lee, Unfortunately our support model does not include phone/videoconf support and we prefer to keep track of everything in bugzilla. Also this helps us to understand better the evolution of the bug. In any case I am dedicating the day to this bug, so my intention is to be quite responsive. I am already working on the issue, but I would need a few things from you: 1. Do not use numeric debug levels anymore if possible, this helps to better understand the debug level and not get any confusion. The translation is that one: 0: quiet 1: fatal 2: error 3: info 4: verbose 5: debug 6: debug2 7: debug3 8: debug4 9: debug5 Please, set your slurmctld debug level to 'debug2'. I may ask you *later* to enable DebugFlags=agent,backfill, but let's start by the beginning. 2. I need you to check your system tunning parameters, can you check this guide and see if your server parameters are properly sized? https://slurm.schedmd.com/high_throughput.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_high-5Fthroughput.html&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=FhL2gQ6xclaCSd2chJG46DrMhSB7TPV47Do9lC_jtrU&s=dj3KWL6MILA9iC98xFlt6_flLbRloII7AXdu7oVWPNg&e=> 3. Can you check the ctld server and network load? specifically look at the number of open files. Send me back the slurmctld.log to try to diagnose the rpc queue issues and a capture of sdiag every 10 seconds for 3 minutes. The issue may be because of a massive job submission + some server parameters too low. Let me also know if the server is now stable or you're still having issues. ________________________________ You are receiving this mail because: * You reported the bug. I’ve updated the slurm.conf file to remove the leereyno qos as it does not exist We do not use a default partition on the cluster, but instead use the job submission plugin to determine which partition a job should be sent to when no partition is defined by the user. We’ve been using this since 2017 and I do not believe it to be a factor in the issues we were having. The topology setting you’re seeing is what was recommended to us by SchedMD as a solution to a problem we were having with jobs attempting to run on nodes connected to different omnipath switches. See bug # 7950 for more info. We’ve been running the scheduler with the switches “disconnected” for over a month. I don’t think this was related to our issue either. I’ve included the topology.conf file in the dropbox link I’ve included “dmesg -T” as part of the dropbox share in my previous email We do use NFS. We have home directories on NFS and our main software package repository is stored on NFS. Lee Reynolds Senior RC Architect ASU Research Computing T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu> researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/> How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>. From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, December 9, 2019 6:02 AM To: Lee Reynolds <Lee.Reynolds@asu.edu> Subject: [Bug 8191] Scheduling issue using multiple partitions and node weights Comment # 12<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191-23c12&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=ott_1wTi2uD07zOL8lrwUFkySWFtiNeyCvbrbEQNRGE&s=c5h-ZNHgGmfHJW2qGXyu2bvrCCM5sYJLoqFEY3FJjow&e=> on bug 8191<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=ott_1wTi2uD07zOL8lrwUFkySWFtiNeyCvbrbEQNRGE&s=GsWftmOcRxRnvnCcHSDVpXhdiwzTy0pzcoemETf4lXE&e=> from Felip Moll<mailto:felip.moll@schedmd.com> That's off topic: a) but these two errors would be nice if they get fixed: [2019-12-06T13:04:48.302] error: Ignoring invalid Allow/DenyQOS value: leereyno [2019-12-06T13:04:48.441] error: read_slurm_conf: default partition not set. b) CacheGroups option is deprecated since 16.05, you can remove the line in slurm.conf I see also this warning: [2019-12-06T13:04:48.445] TOPOLOGY: warning -- no switch can reach all nodes through its descendants.Do not use route/topology That means you have disjoint set of nodes in topology.conf, which implies that one job cannot use nodes from these two sets at the same time. Can you attach your topology.conf? Is it possible that the issue you were seeing was related to this, where nodes on one set were idle but the job requested nodes in the other set? Besides that, can you provide dmesg -T in ctld server? do you use NFS? Just a note, in comment 11<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191-23c11&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=ott_1wTi2uD07zOL8lrwUFkySWFtiNeyCvbrbEQNRGE&s=VZg5t43JFP0sS799DrSgy3vdyxJfWcb7SmF8XY7jxmE&e=>, it would be very good to catch the logs when the RPC issue is happening or after it happened. I am studying now the sdiag + backfill parameters too for your initial problem. ________________________________ You are receiving this mail because: * You reported the bug. We implemented the changes to SchedulerParameters as recommended and the cluster has been working great since then. Does SchedMD have any documentation that would provide guidance on how to configure these sorts of settings? There is of course this page: https://slurm.schedmd.com/sched_config.html But it is pretty sparse on explanation. SchedMD should really look into creating a training and certification process, preferably with multiple levels. We’ve been studying the man pages and other documentation for years, but we would not have known to change the settings you recommended. Lee Reynolds Senior RC Architect ASU Research Computing T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu> researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/> How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>. From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, December 9, 2019 10:15 AM To: Lee Reynolds <Lee.Reynolds@asu.edu> Subject: [Bug 8191] Scheduling issue using multiple partitions and node weights Comment # 13<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191-23c13&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=TJ4JJ16Eyv1wdoP5kt3BbJeWkJbrPcVlGuelPI4XET0&s=Pyc0tixw4UF2Gt02JgAquNp3u7Ynj7OyH89RsfPWXcg&e=> on bug 8191<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=TJ4JJ16Eyv1wdoP5kt3BbJeWkJbrPcVlGuelPI4XET0&s=6rPlY33HTG8X0f5j8XHRxcoJbMmy_1SWoA4Kt6GLDmc&e=> from Felip Moll<mailto:felip.moll@schedmd.com> Lee, I suggest replacing your slurm.conf SchedulerParameters line with this one: SchedulerParameters=max_rpc_cnt=150,bf_yield_interval=1000000,bf_max_time=150,bf_continue,bf_resolution=300,sched_interval=45,default_queue_depth=300,preempt_reorder_count=3,max_switch_wait=86400,bf_window=20160 I worked on a set of settings for your site based on your situation, sdiag and output files. I think this can help to fix your issues. I explain the changed parameters below. Can you apply them and tell me how it goes? The adjustment of these parameters must be done empirically. So after applying we can analyze sdiag outputs and do more adjustments. I will be waiting for your feedback about this comment and the previous ones. Let me know if you have any question. ---- Make Backfill to release locks when this rpc count is reached, to avoid starving rpcs: max_rpc_cnt=150 Make Backfill to release locks more frequently to let RPCs to be served more often: bf_yield_interval=1000000 bf_max_time=150 Backfill will continue evaluating jobs in the list in the next iteration after breaking the current loop, opposite to start from the top again: bf_continue You have bf_resolution=30, which means 30 seconds. This can overload the scheduler, so I suggest to move to greater values. This can also cause backfill to be slow and, as seen in sdiag, evaluate very few jobs. Higher values gives better responsiveness, but scheduling can be a bit less precise. This could be read as: Looking bf_window minutes in the future with bf_resolution seconds resolution. Your bf_window is 20160 which would mean to look into the future for 14 days, and take in mind jobs finishing every 30 seconds to see if there's a gap available. I see it too much, so I'd suggest to change bf_resolution to at least 5 or 10 minutes: bf_resolution=300 Your sched_interval=10 indicates that the main scheduler (not the backfill) will run every 10 seconds, which can cause also performance issues. I'd suggest to increase considerably this value. The backfill will run every 30 seconds by default and should cover most of the jobs. Let's set it at least at 45 seconds: sched_interval=45 For default_queue_depth, specifying a large value will result in a poor system responsiveness since the scheduling logic will not release locks for other events to occur. You can keep this value, but you may be wanting to limit the time if issues with RPCs are still seen. default_queue_depth=300 Other parameter to consider: To limit the minimum time between the end of one scheduling cycle and the beginning of the next scheduling cycle use sched_min_interval. Even if you have sched_interval set, an event like a job submission may trigger the main scheduler and make it run too often thus causing locks starvation (and possibly RPC starvation). The default is 2us but if you observer the scheduler kicking in too often, consider increasing it. sched_min_interval=2000000 (default) ________________________________ You are receiving this mail because: * You reported the bug. (In reply to Lee Reynolds from comment #16) > The scheduler is running well and our throughput has increased. > > However, just to be sure that everything is good, I’m sending you everything > you asked for in the three emails you sent out on Monday. I reviewed your parameters and I have two suggestions. #1# You definetively want to increase somaxconn. /proc/sys/net/core/somaxconn Current value: 128 Recommended value: 4096 Explanation: This parameter limits the number of established connections and backlog to a socket 128 is way too low and may cause communication issues from clients or daemons, or even impede comunnication from ctld. As explained in the docs, a bursts of requests of 1024 rpcs needs 1024 connections to succeed. #2# You may want to increase tcp_max_syn_backlog value too. /proc/sys/net/ipv4/tcp_max_syn_backlog Current value: 256 Recommended value: 1024 Explanation: This parameter limits the number of possible half-connections queued for a socket. If you have massive requests in the slurmctld socket, like for example job completion RPCs, it may limit the throughput. From linux/net/core/request_sock.c /* * Maximum number of SYN_RECV sockets in queue per LISTEN socket. * One SYN_RECV socket costs about 80bytes on a 32bit machine * .... * The minimum value of it is 128. Experiments with real servers show that * it is absolutely not enough even at 100conn/sec. 256 cures most * of problems. * This value is adjusted to 128 for low memory machines, * and it will increase in proportion to the memory of machine. * Note : Dont forget somaxconn that may limit backlog too. */ Another suggestion: Check your logging generation since slurmctld.log is huge for only a couple of days. I think you have modified job_submit_partition.c and job_submit_defaults.c and both are generating tons of logs. Consider moving the log level to a debug2 or 3 to enable them only when necessary. This can cause a real performance issue in slurmctld. I guess the provided sdiag captures are from before applying all the backfill settings, since I see Queue lengths of about 2400 jobs but scheduler just trying to schedule ~250 jobs in each cycle. If it is from before the settings, you can now check how many jobs does the main and backfill scheduler consider. For example, compare new sdiag data from the previous provided sdiag: Main schedule statistics (microseconds): ... Mean depth cycle: 521 ... Last queue length: 2342 Backfilling stats (WARNING: data obtained in the middle of backfilling execution.) ... Last depth cycle (try sched): 248 Depth Mean (try depth): 259 Last queue length: 2342 Queue length mean: 1388 The depth should match a bit more with the queue length. As for the other files I see everything fine an nothing unusual. - About topology: OK, is perfectly fine to have disconnected switches. - About Default partition: It is OK, but if you want to get rid of the message just set one to DEFAULT, in the end you will be modifying that in the job submit plugin so it won't harm. - About NFS: Just take that in mind, that a stuck NFS can struggle ctld and cause timeouts or any issues. I am talking in general. - About documentation of backfill: The best way for you is, whenever you have an issue with the scheduler to analyze sdiag (man sdiag) captures during a period of time. You have to look at the performance of the main and backfill schedulers and see if the numbers make sense, the number of considered jobs but also the timings. You have to look also at your queue and see if there are bursts of jobs. Finally you need to see if there are massive RPCs sent, also seen in sdiag. All of this will help you to quantify your issue and then reading the description of all scheduler parameters in 'man slurm.conf' you should be able to tune them up. This case was quite obvious, the sdiag showed how the schedulers didn't arrive at the end of the job queue so it didn't schedule them, even if the nodes were idle, so I suggested bf_continue which just makes that in the next iteration the considered job is the next one from the previous cycle, and not starting from the beginning. After that, you had RPC issues, and we just tuned up some parameters to support more connections and rpcs. There's no perfect guide for this since every situation is very different. Components like NFS, network switches, technologies and architectures, job submission typology (burst HTC vs big jobs vs small jobs...), and a big etc. lead to very different situations. Now, from my suggestions you need to see if everything works as expected, and if not a second round may be needed. > SchedMD should really look into creating a training and certification process, preferably with multiple levels. Since a few months ago we have a new department in SchedMD dedicated exclusively to training. It provides some training modules for diferent Slurm areas. If you are interested you can contact with jess@schedmd.com . Hope everything is clear, if not please just ask. Also let me know if everything is fixed for you and if we can close this issue. Regards, Felip I am out of the office Thursday December 12th. I will be back on Friday the 13th. If you need help with Research Computing resources such as Agave, Saguaro or Ocotillo, please submit a service request through our support portal: https://rcstatus.asu.edu/servicerequest/ Hi Lee, Can you please confirm your issue has been solved and that we can close this bug? Thanks Yes, I believe this issue has been resolved. We’ll open a new ticket if we have any more problems. Lee Reynolds Senior RC Architect ASU Research Computing T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu> researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/> How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>. From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, December 19, 2019 9:35 AM To: Lee Reynolds <Lee.Reynolds@asu.edu> Subject: [Bug 8191] Scheduling issue using multiple partitions and node weights Comment # 21<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191-23c21&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=Kw1Bg8c1Stbbme1Oe34KZvUfhFDU1WO3EP9xIgCTorM&s=MifvkDfisIvDAaxpHCtvgtcE35ZdgIuhiq4tKLFYnEc&e=> on bug 8191<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=Kw1Bg8c1Stbbme1Oe34KZvUfhFDU1WO3EP9xIgCTorM&s=hIVCpJmI3M8OvcaoN2cxRjB1FmOW8WTdmD7L89OGdO0&e=> from Felip Moll<mailto:felip.moll@schedmd.com> Hi Lee, Can you please confirm your issue has been solved and that we can close this bug? Thanks ________________________________ You are receiving this mail because: * You reported the bug. Thanks, closing it then!. |