| Summary: | add bf_max_time seperate from bf_interval | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Doug Jacobsen <dmjacobsen> |
| Component: | Contributions | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | sts |
| Version: | 17.02.2 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| Site: | NERSC | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 17.11.0-0pre1 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | bf_max_time patch | ||
|
Description
Doug Jacobsen
2017-05-15 10:18:17 MDT
Doug - Just a gentle reminder, please send these as Sev 4 so we can do some up-front triage, and make sure things don't get misplaced. (I've adjusted our custom business logic plugin to re-mark these on future submissions automatically.) FYI, this is in use on cori and edison: bf_interval=30 bf_max_time=600 Thanks Doug, this has been added in commit c8c9694f8ca8a and will be in 17.11. FYI, I suggest also setting max_rpc_cnt=150 or so if using this capability. Doug why are you seeing this being warranted? Just want to give a good description in the man page instead of random advice :). Doug, ping? Sorry, I've been on travel and have been unable to properly manage by bugs. After deploying this patch and the continue-scheduling-with-completing-nodes modification I found that sometimes slurmctld would spend large amounts of time _only_ scheduling, sometimes running the primary scheduler repeatedly in the gaps between backfill lock releases. This then caused RPCs to get starved. Basically between this patch and the other we spent _far_ more time scheduling than we used to in the past, which is great for utilization, for starting user jobs, and ensuring our whole workload is reviewed frequently. I set the max_rpc_cnt to 150 as it generally balanced cori's rpc load with making useful progress on scheduling. It needs to be high enough that scheduling isn't always disabled, and low enough that our interactive workload can get through in a reasonable period of time. Certainly needs to be below 256 (the default RPC thread limit). Thanks Doug, I put a snip in commit 24c04bce06c6e. |