5505 – sinfo doesn't include "reboot" in node state

Ticket 5505 - sinfo doesn't include "reboot" in node state

Summary: sinfo doesn't include "reboot" in node state

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	17.11.7
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Brian Christiansen
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-07-31 04:14 MDT by Phil Schwan
Modified:	2018-08-10 16:40 MDT (History)
CC List:	0 users

See Also:
Site:	DownUnder GeoSolutions
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	17.11.9-2
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Phil Schwan 2018-07-31 04:14:17 MDT

$ sinfo --version
slurm 17.11.7

One can't rely on sinfo to understand which nodes are scheduled for / in the process of rebooting:

> $ sinfo -p all
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> all        down   infinite     55   drng lnod[0001-0006,0008,0010-0011,0013-0017,0019,0022,0024-0027,0041-0050,0052-0057,0059,0061-0074,0077-0080]
> all        down   infinite     25  drain lnod[0007,0009,0012,0018,0020-0021,0023,0028-0040,0051,0058,0060,0075-0076]

This one is, but you'd never know from sinfo:

> $ scontrol show node lnod0007
> NodeName=lnod0007 Arch=x86_64 CoresPerSocket=64
>    State=REBOOT+DRAIN ThreadsPerCore=4 TmpDisk=137881 Weight=1 Owner=N/A MCS_label=N/A

Especially since you've already designated an abbreviation for rebooting, @.  So drng@ and drain@ would seem fair, if the full status is too long.

That also leads to this situation:

> $ sinfo -p all
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> all        down   infinite     25   drng lnod[0002-0006,0010-0011,0013-0015,0017,0019,0042,0045,0050,0053-0056,0062,0065,0068,0074,0077,0079]
> all        down   infinite     21  drain lnod[0001,0016,0022,0024-0027,0047-0049,0057,0059,0063-0064,0067,0069,0071-0073,0078,0080]
> all        down   infinite     34  drain lnod[0007-0009,0012,0018,0020-0021,0023,0028-0041,0043-0044,0046,0051-0052,0058,0060-0061,0066,0070,0075-0076]

Someone looking at this might ask why there are two "drain" lines?  It's because one set of nodes are REBOOT+DRAIN, and the others are DOWN+DRAIN.

Cheers,

-Phil

Comment 1 Brian Christiansen 2018-07-31 14:47:58 MDT

Thanks for pointing that out. I'll look into it and get back with you.

Thanks,
Brian

Comment 8 Brian Christiansen 2018-08-10 16:40:09 MDT

This is fixed in the following commits:
https://github.com/SchedMD/slurm/commit/bf569fef2f8928594dc87ebdf6aa0659c10479fa
https://github.com/SchedMD/slurm/commit/f23411bc96be8055e3f295270c4a73709ce574b4

e.g.
brian@lappy:~/slurm/17.11/lappy$ sinfo -p debug -o %N,%T,%t
NODELIST,STATE,STATE
lappy[1-10],idle,idle

brian@lappy:~/slurm/17.11/lappy$ sbatch --wrap="sleep 600" -wlappy2
Submitted batch job 215543
brian@lappy:~/slurm/17.11/lappy$ sbatch --wrap="sleep 600" -wlappy3
Submitted batch job 215544

brian@lappy:~/slurm/17.11/lappy$ scontrol reboot lappy1
brian@lappy:~/slurm/17.11/lappy$ scontrol reboot lappy2
brian@lappy:~/slurm/17.11/lappy$ scontrol reboot asap lappy3

brian@lappy:~/slurm/17.11/lappy$ sinfo -p debug -o %N,%T,%t
NODELIST,STATE,STATE
lappy3,draining@,drng@
lappy2,mixed@,mix@
lappy1,reboot,boot
lappy[4-10],idle,idle


Thanks,
Brian