Ticket 11629

Summary:	reconfigure occurrences
Product:	Slurm	Reporter:	Tony Racho <antonio-ii.racho>
Component:	Other	Assignee:	Jason Booth <jbooth>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	antonio-ii.racho
Version:	20.11.4
Hardware:	Linux
OS:	Linux
Site:	CRAY	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	Cray Internal	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Tony Racho 2021-05-16 22:29:16 MDT

Hello:

_slurm_rpc_reconfigure_controller: usec=3255858 began=03:35:02.302
[2021-05-17T03:35:05.559] _slurm_rpc_reconfigure_controller: completed usec=3255858
[2021-05-17T03:35:17.213] server_thread_count over limit (256), waiting

We just wanted to ask, when do these reconfigure messages appear in the logs?

When doing?

- manual scontrol reconfig?
- other instances when this gets run automatically?
- during a failover from active to standby slurm controller?

Cheers,
Tony

Comment 1 Jason Booth 2021-05-17 11:25:31 MDT

> We just wanted to ask, when do these reconfigure messages appear in the logs?

A re-configured happens when the controller receives a SIGHUP.

https://github.com/SchedMD/slurm/blob/slurm-20-11-7-1/src/slurmctld/controller.c#L1077

This is the results of "scontrol reconfig" being executed from an authorized user e.g. SlurmUser or root.


> - manual scontrol reconfig?
Yes
> - other instances when this gets run automatically?
No, this is a manual triggered message from scontrol or from a "kill -1".
> - during a failover from active to standby slurm controller?
The backup does not send a SIGHUP.

Comment 2 Tony Racho 2021-05-17 16:36:00 MDT

Hi Jason:

I have a follow-up question on this.

From the logs:

_slurm_rpc_reconfigure_controller: usec=3255858 began=03:35:02.302
[2021-05-17T03:35:05.559] _slurm_rpc_reconfigure_controller: completed usec=3255858
[2021-05-17T03:35:17.213] server_thread_count over limit (256), waiting

The 3rd line says about a thread count limit to be over the threshold (256), does the reconfigure uses this thread that has gone over the limit? Or will reconfigure cause the thread count to go over the limit? (especially in a very period?)

Also, when the server_thread_count over limit is reached, what is the behaviour of the controller, will it still accept requests or process commands (i.e. sinfo, squeue) or stop accepting until the thread count limit is cleared, as what we saw was that the commands issued above hanged and after sometime the hang cleared and we assume the thread count limit cleared.

Thanks,
Tony

Comment 3 Jason Booth 2021-05-17 16:58:39 MDT

Tony - Please do open a new bug for other questions that fall outside the original request.

> 2021-05-17T03:35:17.213] server_thread_count over limit (256), waiting


https://slurm.schedmd.com/sdiag.html#OPT_Server-thread-count


Server thread count
The number of current active slurmctld threads. A high number would mean a high load processing events like job submissions, jobs dispatching, jobs completing, etc. If this is often close to MAX_SERVER_THREADS it could point to a potential bottleneck.

> The 3rd line says about a thread count limit to be over the threshold (256),
> does the reconfigure uses this thread that has gone over the limit? Or will
> reconfigure cause the thread count to go over the limit? 
> (especially in a very period?)

It can be normal to see spikes when launching a lot of jobs or completing a lot of jobs.
The important thing is that this value not always be pinned at 256 all the time.

If this is the case, and the slurmctld is slow to respond to job submissions or client
commands then please do open a new bug to discuss tuning your schedule. 

Sdig is also a great tool to see what the server is doing and also determine if there
are uses looping client commands. You can view that under the 
"Remote Procedure Call statistics by user".

> Also, when the server_thread_count over limit is reached, what is the
> behaviour of the controller, will it still accept requests or process commands
> (i.e. sinfo, squeue) or stop accepting until the thread count limit is cleared,
> as what we saw was that the commands issued above hanged and after sometime
> the hang cleared and we assume the thread count limit cleared.

This means that the server will not be able to reply to new messages until it clears
out other pending tasks such as job submission, launching, and job completions
as mentioned above.


If this is a pain point then I do highly suggest opening a new bug for us to investigate.