| Summary: | RebootProgram - Slurm.conf | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Damien <damien.leong> |
| Component: | Configuration | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 14.11.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Monash University | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Damien
2016-06-09 11:25:12 MDT
From the slurm.conf man page:
RebootProgram
Program to be executed on each compute node to reboot it.
Invoked on each node once it becomes idle after the command
"scontrol reboot_nodes" is executed by an authorized user or a
job is submitted with the "--reboot" option. After being
rebooting, the node is returned to normal use. NOTE: This con‐
figuration option does not apply to IBM BlueGene systems.
From the scontrol man page:
reboot_nodes [NodeList]
Reboot all nodes in the system when they become idle using the
RebootProgram as configured in Slurm's slurm.conf file. Accepts
an option list of nodes to reboot. By default all nodes are
rebooted. NOTE: This command does not prevent additional jobs
from being scheduled on these nodes, so many jobs can be exe‐
cuted on the nodes prior to them being rebooted. You can explic‐
itly drain the nodes in order to reboot nodes as soon as possi‐
ble, but the nodes must also explicitly be returned to service
after being rebooted. You can alternately create an advanced
reservation to prevent additional jobs from being initiated on
nodes to be rebooted. NOTE: Nodes will be placed in a state of
"MAINT" until rebooted and returned to service with a normal
state. Alternately the node's state "MAINT" may be cleared by
using the scontrol command to set the node state to "RESUME",
which clears the "MAINT" flag.
One friendly reminder - SchedMD's Slurm support is offered as "level 3" support only; we expect sites to be comfortable looking through the documentation (web pages at http://slurm.schedmd.com and the various man pages) and testing out most operational changes before turning to us for help.
- Tim
Hi Tim Thanks for this reminder. I apologise for this. I am formerly from the corporate IT world. (Hope that explains) We will do more due diligence on our own, before approaching SchedMD support. Thanks for helping us. Cheers Damien |