Ticket 10541

Summary: reservation doesn't list all the nodes
Product: Slurm Reporter: Yann <yann.sagon>
Component: reservationsAssignee: Scott Hilton <scott>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 20.02.4   
Hardware: Linux   
OS: Linux   
Site: Université de Genève Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.d/nodes.conf

Description Yann 2021-01-04 08:49:31 MST
Dear team, I wish you a happy new year and thanks for your work.

I've created a reservation with MAINT flag for 11 nodes.

```
[root@admin1 slurm-20.11.2]# scontrol show reservation temperature_idle_node
ReservationName=temperature_idle_node StartTime=2020-12-11T14:00:21 EndTime=2021-04-10T15:00:21 Duration=120-00:00:00
   Nodes=cpu[008,010,028-029,041,044,054,057,073,080] NodeCnt=11 CoreCnt=396 Features=(null) PartitionName=public-cpu Flags=MAINT,IGNORE_JOBS
   TRES=cpu=396
```

You can see that NodeCnt=11, but Nodes=xxx only list 10 nodes.

I *guess* the missing node is this one:

```
[root@admin1 slurm-20.11.2]# scontrol show node cpu020
NodeName=cpu020 Arch=x86_64 CoresPerSocket=18
   CPUAlloc=0 CPUTot=36 CPULoad=0.01
   AvailableFeatures=XEON_GOLD_6240,V7_INTEL
   ActiveFeatures=XEON_GOLD_6240,V7_INTEL
   Gres=(null)
   NodeAddr=cpu020 NodeHostName=cpu020 Version=20.02.4
   OS=Linux 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019
   RealMemory=190000 AllocMem=0 FreeMem=185176 Sockets=2 Boards=1
   State=MAINT+DRAIN ThreadsPerCore=1 TmpDisk=150000 Weight=10 Owner=N/A MCS_label=N/A
   Partitions=public-cpu,shared-cpu
   BootTime=2020-11-09T21:08:57 SlurmdStartTime=2021-01-04T11:04:37
   CfgTRES=cpu=36,mem=190000M,billing=36
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=#430 [root@2020-12-17T16:37:45]
```

This is what appears in slurmctld log:

```
[2021-01-04T16:45:15.859] modified reservation temperature_idle_node due to unusable nodes, new nodes: cpu[008,010,028-029,041,044,054,057,073,080]
```

How can we see all the nodes that belongs to a given reservation?
Why is this log line appearing very frequently (~3 times per minute)?

we are using select/cons_tres.

Best
Comment 1 Scott Hilton 2021-01-07 15:45:38 MST
Yann, 

Could you send me the slurm.conf and slurmctld log for that day?

What was the command you used to create the reservation?

Do you have a way to reproduce this?

-Scott
Comment 2 Yann 2021-01-08 06:28:17 MST
Hello,

I think I created the reservation like that:

scontrol create     Reservation=temperature_idle_node     StartTime=NOW     Duration=120-0     Users=root,$(getent group hpc_admin | sed -e 's/^hpc_admin:.*[0-9]://g')     Flags=maint,ignore_jobs     NodeCnt=10 PartitionName=public-cpu

I don't remember when or if we added the extra node later.

According to the log, we first created the reservation, deleted it, re created it and updated it.

Please see here a log extract : (zgrep temperature_idle_node  slurmctld.log-202012* > res)

slurmctld.log-20201212.gz:[2020-12-11T13:55:50.243] _slurm_rpc_resv_create reservation=temperature_idle_node: Requested nodes are busy
slurmctld.log-20201212.gz:[2020-12-11T13:59:27.512] sched: Created reservation=temperature_idle_node users=root,falcone,sagon,ressegai,brero,capello nodes=cpu[008-009,013-014,018-021,027-030,032,035-037,040-042,044,048,050-054,057,060,067-068,070,072-074,076,081] cores=1296 licenses=(null) tres=cpu=1296 watts=4294967294 start=2020-12-11T13:59:27 end=2021-04-10T14:59:27 MaxStartDelay=
slurmctld.log-20201212.gz:[2020-12-11T14:00:12.630] _slurm_rpc_delete_reservation complete for temperature_idle_node usec=87
slurmctld.log-20201212.gz:[2020-12-11T14:00:21.213] sched: Created reservation=temperature_idle_node users=root,falcone,sagon,ressegai,brero,capello nodes=cpu[028-029,041,044,053-054,057,067,074,076] cores=360 licenses=(null) tres=cpu=360 watts=4294967294 start=2020-12-11T14:00:21 end=2021-04-10T15:00:21 MaxStartDelay=
slurmctld.log-20201212.gz:[2020-12-11T14:17:33.011] sched: Updated reservation=temperature_idle_node users=root,falcone,sagon,ressegai,brero,capello nodes=cpu[028-029,041,044,053-054,057,067-068,074,076] cores=396 licenses=(null) tres=cpu=396 watts=4294967294 start=2020-12-11T14:00:21 end=2021-04-10T15:00:21 MaxStartDelay=
slurmctld.log-20201212.gz:[2020-12-11T16:22:16.761] Recovered state of reservation temperature_idle_node
slurmctld.log-20201212.gz:[2020-12-11T17:47:29.850] Recovered state of reservation temperature_idle_node
slurmctld.log-20201212.gz:[2020-12-11T21:53:54.880] Recovered state of reservation temperature_idle_node
slurmctld.log-20201212.gz:[2020-12-11T22:32:07.853] Recovered state of reservation temperature_idle_node
slurmctld.log-20201213.gz:[2020-12-12T18:31:37.944] modified reservation temperature_idle_node due to unusable nodes, new nodes: cpu[020,028-029,041,044,053-054,057,068,074,076]


It is as if node 067 was removed from the reservation and replaced by node 020?

Is this possible if nodes weren't idle when we created the reservation and we asked for 10 nodes, not specific ones?

I'm attaching nodes.conf which is a subset of our slurm.conf. Let me know if you need more information.
Comment 3 Yann 2021-01-08 06:28:56 MST
Created attachment 17393 [details]
slurm.d/nodes.conf
Comment 4 Scott Hilton 2021-01-08 09:12:09 MST
Yann,

It looks like it was behaving correctly in the instance you show in comment 2, assuming node 67 was a DOWN, DRAINED/DRAINING, FAILING or NO_RESPOND node. Did something like this happen to node 67 at this time?

In the original instance I am concerned that this line keeps repeating ~3 times per minute. This shows it was unable to modify itself properly and kept retrying.
>[2021-01-04T16:45:15.859] modified reservation temperature_idle_node due to unusable nodes, new nodes: cpu[008,010,028-029,041,044,054,057,073,080]
I am also concerned that the list of Nodes and NodeCnt disagreed.

-Scott
Comment 5 Yann 2021-01-11 08:23:40 MST
Hi,

I'm not sure I understood correctly: 

We created reservation "temperature_idle_node" in two step (as listed in my #2 comment).

The final step was this one:

slurmctld.log-20201212.gz:[2020-12-11T14:17:33.011] sched: Updated reservation=temperature_idle_node users=root,falcone,sagon,ressegai,brero,capello nodes=cpu[028-029,041,044,053-054,057,067-068,074,076] cores=396 licenses=(null) tres=cpu=396 watts=4294967294 start=2020-12-11T14:00:21 end=2021-04-10T15:00:21 

And then this appears many time in the log:

slurmctld.log-20201213.gz:[2020-12-12T18:31:37.944] modified reservation temperature_idle_node due to unusable nodes, new nodes: cpu[020,028-029,041,044,053-054,057,068,074,076]

please note node number 020 wasn't present before and node 067 isn't present anymore. My question is: is this normal that a node can be added automagically to the the reservation if a node from this reservation fails? I would consider this problematic as this reservation wasn't meant to be used, but was set with the MAINT flag for the purpose of working (stoping, rebooting, re installing) the nodes.

It seems indeed node cpu067 had an issue:

[2020-12-12T18:32:54.385] Node cpu067 now responding

But I don't see in the log when it wasn't responding. Maybe it was only transient error, but it appears it was happening more or less at the same time we had the first "modified reservation" in the log.

I see as well we had issues with different slurm.conf during this period:( 

Maybe it doesn't worth to investigate more and I can re open the case if we face this issue again.

Best
Comment 6 Scott Hilton 2021-01-11 15:26:48 MST
Yann,

> My question is: is this normal that a node can be added
> automagically to the the reservation if a node from this reservation fails?
> I would consider this problematic as this reservation wasn't meant to be
> used, but was set with the MAINT flag for the purpose of working (stoping,
> rebooting, re installing) the nodes.
Yes, this is the correct behavior when you ask for NodeCnt=<num>. 

If you use the flag STATIC_ALLOC or if you ask for specific nodes this will not happen.
>scontrol create Reservation=<name> NodeCnt=<num> Flags=STATIC_ALLOC
>or
>scontrol create Reservation=<name> Nodes=<list of nodes>

-Scott
Comment 7 Yann 2021-01-14 02:12:30 MST
Hello,

thanks for your answer. I wasn't aware of the STATIC_ALLOC flags!

I'm not quite sure why the STATIC_ALLOC isn't implied by the MAINT flag as I don't see a use case who would need MAINT without SATIT_ALLOC.

It seems I'm able to reproduce the issue:

[root@admin1 ~]# scontrol create Reservation=test_reservation \
> StartTime=NOW \
> Duration=120-0 \
> Users=root,$(getent group hpc_admin | sed -e 's/^hpc_admin:.*[0-9]://g') \
> Flags=maint,ignore_jobs \
> NodeCnt=3 \
> PartitionName=shared-bigmem
Reservation created: test_reservation



[root@admin1 ~]# scontrol show reservation test_reservation
ReservationName=test_reservation StartTime=2021-01-14T10:02:22 EndTime=2021-05-14T11:02:22 Duration=120-00:00:00
   Nodes=cpu[120-122] NodeCnt=3 CoreCnt=108 Features=(null) PartitionName=shared-bigmem Flags=MAINT,IGNORE_JOBS
   TRES=cpu=108
   Users=root,falcone,sagon,ressegai,brero,capello Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)
   
   
[root@admin1 ~]# scontrol update node=cpu122 state=drain reason=test

[root@admin1 ~]# scontrol show reservation test_reservation
ReservationName=test_reservation StartTime=2021-01-14T10:02:22 EndTime=2021-05-14T11:02:22 Duration=120-00:00:00
   Nodes=cpu[120-121] NodeCnt=3 CoreCnt=108 Features=(null) PartitionName=shared-bigmem Flags=MAINT,IGNORE_JOBS
   TRES=cpu=108
   Users=root,falcone,sagon,ressegai,brero,capello Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)


In the last output, you can see that node122 disappeared from the reservation and only two nodes remains, but NodeCnt is still 3.

Then in slurmctld log ~3time per minutes:

[2021-01-14T10:09:32.743] modified reservation test_reservation due to unusable nodes, new nodes: cpu[120-121]

In this case I don't have any other idle node in this partition. But I do have other allocated nodes in this partition.

ps: is there a way to properly format code in the issue tracker (here)?

Best

Yann
Comment 8 Scott Hilton 2021-01-15 10:48:06 MST
Yann,

I am trying to reproduce the issue where it prints:
>[2021-01-14T10:09:32.743] modified reservation test_reservation due to unusable nodes, new nodes: cpu[120-121]
But evidently fails and keeps trying over and over. 

Could I get the rest of your slurm.conf to help me reproduce your environment? Does this happen if you have idle nodes available? 

-Scott
Comment 9 Scott Hilton 2021-02-24 10:21:41 MST
Yann,

It looks like you have a suitable solution. I will go ahead and close this ticket.

If you want us to look at the issue where it fails to modify the reservation properly and keeps retrying feel free to reopen the ticket

-Scott