Ticket 2811 - RebootProgram - Slurm.conf
Summary: RebootProgram - Slurm.conf
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other tickets)
Version: 14.11.4
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2016-06-09 11:25 MDT by Damien
Modified: 2016-06-13 12:11 MDT (History)
0 users

See Also:
Site: Monash University
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Damien 2016-06-09 11:25:12 MDT
Hi SLURM Support

This is not a bug request, but of an enquiry, very low priority.


We are experimenting with this parameter 'RebootProgram' in our slurm.conf, to try this reboot function in slurm.conf

As inside slurm.conf:
-----------------

RebootProgram = "/sbin/shutdown -r now"

-----------------

Is this correct ? How can we test this ? ( scontrol node test-hostname reboot ), Is there any other factors which we should watch of ?


Cheers

Damien
Comment 1 Tim Wickberg 2016-06-09 18:50:35 MDT
From the slurm.conf man page:

       RebootProgram
              Program  to  be  executed  on  each  compute  node to reboot it.
              Invoked on each node once it  becomes  idle  after  the  command
              "scontrol  reboot_nodes"  is executed by an authorized user or a
              job is  submitted  with  the  "--reboot"  option.   After  being
              rebooting,  the node is returned to normal use.  NOTE: This con‐
              figuration option does not apply to IBM BlueGene systems.


From the scontrol man page:

       reboot_nodes [NodeList]
              Reboot all nodes in the system when they become idle  using  the
              RebootProgram as configured in Slurm's slurm.conf file.  Accepts
              an option list of nodes to reboot.  By  default  all  nodes  are
              rebooted.   NOTE:  This command does not prevent additional jobs
              from being scheduled on these nodes, so many jobs  can  be  exe‐
              cuted on the nodes prior to them being rebooted. You can explic‐
              itly drain the nodes in order to reboot nodes as soon as  possi‐
              ble,  but  the nodes must also explicitly be returned to service
              after being rebooted. You can  alternately  create  an  advanced
              reservation  to  prevent additional jobs from being initiated on
              nodes to be rebooted.  NOTE: Nodes will be placed in a state  of
              "MAINT"  until  rebooted  and  returned to service with a normal
              state.  Alternately the node's state "MAINT" may be  cleared  by
              using  the  scontrol  command to set the node state to "RESUME",
              which clears the "MAINT" flag.


One friendly reminder - SchedMD's Slurm support is offered as "level 3" support only; we expect sites to be comfortable looking through the documentation (web pages at http://slurm.schedmd.com and the various man pages) and testing out most operational changes before turning to us for help.

- Tim
Comment 2 Damien 2016-06-13 12:11:37 MDT
Hi Tim

Thanks for this reminder.

I apologise for this. I am formerly from the corporate IT world. (Hope that explains)

We will do more due diligence on our own, before approaching SchedMD support.  


Thanks for helping us.


Cheers

Damien