Ticket 5505 - sinfo doesn't include "reboot" in node state
Summary: sinfo doesn't include "reboot" in node state
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 17.11.7
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Brian Christiansen
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-07-31 04:14 MDT by Phil Schwan
Modified: 2018-08-10 16:40 MDT (History)
0 users

See Also:
Site: DownUnder GeoSolutions
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.11.9-2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Phil Schwan 2018-07-31 04:14:17 MDT
$ sinfo --version
slurm 17.11.7

One can't rely on sinfo to understand which nodes are scheduled for / in the process of rebooting:

> $ sinfo -p all
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> all        down   infinite     55   drng lnod[0001-0006,0008,0010-0011,0013-0017,0019,0022,0024-0027,0041-0050,0052-0057,0059,0061-0074,0077-0080]
> all        down   infinite     25  drain lnod[0007,0009,0012,0018,0020-0021,0023,0028-0040,0051,0058,0060,0075-0076]

This one is, but you'd never know from sinfo:

> $ scontrol show node lnod0007
> NodeName=lnod0007 Arch=x86_64 CoresPerSocket=64
>    State=REBOOT+DRAIN ThreadsPerCore=4 TmpDisk=137881 Weight=1 Owner=N/A MCS_label=N/A

Especially since you've already designated an abbreviation for rebooting, @.  So drng@ and drain@ would seem fair, if the full status is too long.

That also leads to this situation:

> $ sinfo -p all
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> all        down   infinite     25   drng lnod[0002-0006,0010-0011,0013-0015,0017,0019,0042,0045,0050,0053-0056,0062,0065,0068,0074,0077,0079]
> all        down   infinite     21  drain lnod[0001,0016,0022,0024-0027,0047-0049,0057,0059,0063-0064,0067,0069,0071-0073,0078,0080]
> all        down   infinite     34  drain lnod[0007-0009,0012,0018,0020-0021,0023,0028-0041,0043-0044,0046,0051-0052,0058,0060-0061,0066,0070,0075-0076]

Someone looking at this might ask why there are two "drain" lines?  It's because one set of nodes are REBOOT+DRAIN, and the others are DOWN+DRAIN.

Cheers,

-Phil
Comment 1 Brian Christiansen 2018-07-31 14:47:58 MDT
Thanks for pointing that out. I'll look into it and get back with you.

Thanks,
Brian
Comment 8 Brian Christiansen 2018-08-10 16:40:09 MDT
This is fixed in the following commits:
https://github.com/SchedMD/slurm/commit/bf569fef2f8928594dc87ebdf6aa0659c10479fa
https://github.com/SchedMD/slurm/commit/f23411bc96be8055e3f295270c4a73709ce574b4

e.g.
brian@lappy:~/slurm/17.11/lappy$ sinfo -p debug -o %N,%T,%t
NODELIST,STATE,STATE
lappy[1-10],idle,idle

brian@lappy:~/slurm/17.11/lappy$ sbatch --wrap="sleep 600" -wlappy2
Submitted batch job 215543
brian@lappy:~/slurm/17.11/lappy$ sbatch --wrap="sleep 600" -wlappy3
Submitted batch job 215544

brian@lappy:~/slurm/17.11/lappy$ scontrol reboot lappy1
brian@lappy:~/slurm/17.11/lappy$ scontrol reboot lappy2
brian@lappy:~/slurm/17.11/lappy$ scontrol reboot asap lappy3

brian@lappy:~/slurm/17.11/lappy$ sinfo -p debug -o %N,%T,%t
NODELIST,STATE,STATE
lappy3,draining@,drng@
lappy2,mixed@,mix@
lappy1,reboot,boot
lappy[4-10],idle,idle


Thanks,
Brian