Ticket 5505

Summary: sinfo doesn't include "reboot" in node state
Product: Slurm Reporter: Phil Schwan <phils>
Component: slurmctldAssignee: Brian Christiansen <brian>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 17.11.7   
Hardware: Linux   
OS: Linux   
Site: DownUnder GeoSolutions Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 17.11.9-2
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Phil Schwan 2018-07-31 04:14:17 MDT
$ sinfo --version
slurm 17.11.7

One can't rely on sinfo to understand which nodes are scheduled for / in the process of rebooting:

> $ sinfo -p all
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> all        down   infinite     55   drng lnod[0001-0006,0008,0010-0011,0013-0017,0019,0022,0024-0027,0041-0050,0052-0057,0059,0061-0074,0077-0080]
> all        down   infinite     25  drain lnod[0007,0009,0012,0018,0020-0021,0023,0028-0040,0051,0058,0060,0075-0076]

This one is, but you'd never know from sinfo:

> $ scontrol show node lnod0007
> NodeName=lnod0007 Arch=x86_64 CoresPerSocket=64
>    State=REBOOT+DRAIN ThreadsPerCore=4 TmpDisk=137881 Weight=1 Owner=N/A MCS_label=N/A

Especially since you've already designated an abbreviation for rebooting, @.  So drng@ and drain@ would seem fair, if the full status is too long.

That also leads to this situation:

> $ sinfo -p all
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> all        down   infinite     25   drng lnod[0002-0006,0010-0011,0013-0015,0017,0019,0042,0045,0050,0053-0056,0062,0065,0068,0074,0077,0079]
> all        down   infinite     21  drain lnod[0001,0016,0022,0024-0027,0047-0049,0057,0059,0063-0064,0067,0069,0071-0073,0078,0080]
> all        down   infinite     34  drain lnod[0007-0009,0012,0018,0020-0021,0023,0028-0041,0043-0044,0046,0051-0052,0058,0060-0061,0066,0070,0075-0076]

Someone looking at this might ask why there are two "drain" lines?  It's because one set of nodes are REBOOT+DRAIN, and the others are DOWN+DRAIN.

Cheers,

-Phil
Comment 1 Brian Christiansen 2018-07-31 14:47:58 MDT
Thanks for pointing that out. I'll look into it and get back with you.

Thanks,
Brian
Comment 8 Brian Christiansen 2018-08-10 16:40:09 MDT
This is fixed in the following commits:
https://github.com/SchedMD/slurm/commit/bf569fef2f8928594dc87ebdf6aa0659c10479fa
https://github.com/SchedMD/slurm/commit/f23411bc96be8055e3f295270c4a73709ce574b4

e.g.
brian@lappy:~/slurm/17.11/lappy$ sinfo -p debug -o %N,%T,%t
NODELIST,STATE,STATE
lappy[1-10],idle,idle

brian@lappy:~/slurm/17.11/lappy$ sbatch --wrap="sleep 600" -wlappy2
Submitted batch job 215543
brian@lappy:~/slurm/17.11/lappy$ sbatch --wrap="sleep 600" -wlappy3
Submitted batch job 215544

brian@lappy:~/slurm/17.11/lappy$ scontrol reboot lappy1
brian@lappy:~/slurm/17.11/lappy$ scontrol reboot lappy2
brian@lappy:~/slurm/17.11/lappy$ scontrol reboot asap lappy3

brian@lappy:~/slurm/17.11/lappy$ sinfo -p debug -o %N,%T,%t
NODELIST,STATE,STATE
lappy3,draining@,drng@
lappy2,mixed@,mix@
lappy1,reboot,boot
lappy[4-10],idle,idle


Thanks,
Brian