| Summary: | reconfigure occurrences | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Tony Racho <antonio-ii.racho> |
| Component: | Other | Assignee: | Jason Booth <jbooth> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | antonio-ii.racho |
| Version: | 20.11.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CRAY | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | Cray Internal | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Tony Racho
2021-05-16 22:29:16 MDT
> We just wanted to ask, when do these reconfigure messages appear in the logs? A re-configured happens when the controller receives a SIGHUP. https://github.com/SchedMD/slurm/blob/slurm-20-11-7-1/src/slurmctld/controller.c#L1077 This is the results of "scontrol reconfig" being executed from an authorized user e.g. SlurmUser or root. > - manual scontrol reconfig? Yes > - other instances when this gets run automatically? No, this is a manual triggered message from scontrol or from a "kill -1". > - during a failover from active to standby slurm controller? The backup does not send a SIGHUP. Hi Jason: I have a follow-up question on this. From the logs: _slurm_rpc_reconfigure_controller: usec=3255858 began=03:35:02.302 [2021-05-17T03:35:05.559] _slurm_rpc_reconfigure_controller: completed usec=3255858 [2021-05-17T03:35:17.213] server_thread_count over limit (256), waiting The 3rd line says about a thread count limit to be over the threshold (256), does the reconfigure uses this thread that has gone over the limit? Or will reconfigure cause the thread count to go over the limit? (especially in a very period?) Also, when the server_thread_count over limit is reached, what is the behaviour of the controller, will it still accept requests or process commands (i.e. sinfo, squeue) or stop accepting until the thread count limit is cleared, as what we saw was that the commands issued above hanged and after sometime the hang cleared and we assume the thread count limit cleared. Thanks, Tony Tony - Please do open a new bug for other questions that fall outside the original request. > 2021-05-17T03:35:17.213] server_thread_count over limit (256), waiting https://slurm.schedmd.com/sdiag.html#OPT_Server-thread-count Server thread count The number of current active slurmctld threads. A high number would mean a high load processing events like job submissions, jobs dispatching, jobs completing, etc. If this is often close to MAX_SERVER_THREADS it could point to a potential bottleneck. > The 3rd line says about a thread count limit to be over the threshold (256), > does the reconfigure uses this thread that has gone over the limit? Or will > reconfigure cause the thread count to go over the limit? > (especially in a very period?) It can be normal to see spikes when launching a lot of jobs or completing a lot of jobs. The important thing is that this value not always be pinned at 256 all the time. If this is the case, and the slurmctld is slow to respond to job submissions or client commands then please do open a new bug to discuss tuning your schedule. Sdig is also a great tool to see what the server is doing and also determine if there are uses looping client commands. You can view that under the "Remote Procedure Call statistics by user". > Also, when the server_thread_count over limit is reached, what is the > behaviour of the controller, will it still accept requests or process commands > (i.e. sinfo, squeue) or stop accepting until the thread count limit is cleared, > as what we saw was that the commands issued above hanged and after sometime > the hang cleared and we assume the thread count limit cleared. This means that the server will not be able to reply to new messages until it clears out other pending tasks such as job submission, launching, and job completions as mentioned above. If this is a pain point then I do highly suggest opening a new bug for us to investigate. |