| Summary: | reservation doesn't list all the nodes | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Yann <yann.sagon> |
| Component: | reservations | Assignee: | Scott Hilton <scott> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 20.02.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Université de Genève | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurm.d/nodes.conf | ||
|
Description
Yann
2021-01-04 08:49:31 MST
Yann, Could you send me the slurm.conf and slurmctld log for that day? What was the command you used to create the reservation? Do you have a way to reproduce this? -Scott Hello, I think I created the reservation like that: scontrol create Reservation=temperature_idle_node StartTime=NOW Duration=120-0 Users=root,$(getent group hpc_admin | sed -e 's/^hpc_admin:.*[0-9]://g') Flags=maint,ignore_jobs NodeCnt=10 PartitionName=public-cpu I don't remember when or if we added the extra node later. According to the log, we first created the reservation, deleted it, re created it and updated it. Please see here a log extract : (zgrep temperature_idle_node slurmctld.log-202012* > res) slurmctld.log-20201212.gz:[2020-12-11T13:55:50.243] _slurm_rpc_resv_create reservation=temperature_idle_node: Requested nodes are busy slurmctld.log-20201212.gz:[2020-12-11T13:59:27.512] sched: Created reservation=temperature_idle_node users=root,falcone,sagon,ressegai,brero,capello nodes=cpu[008-009,013-014,018-021,027-030,032,035-037,040-042,044,048,050-054,057,060,067-068,070,072-074,076,081] cores=1296 licenses=(null) tres=cpu=1296 watts=4294967294 start=2020-12-11T13:59:27 end=2021-04-10T14:59:27 MaxStartDelay= slurmctld.log-20201212.gz:[2020-12-11T14:00:12.630] _slurm_rpc_delete_reservation complete for temperature_idle_node usec=87 slurmctld.log-20201212.gz:[2020-12-11T14:00:21.213] sched: Created reservation=temperature_idle_node users=root,falcone,sagon,ressegai,brero,capello nodes=cpu[028-029,041,044,053-054,057,067,074,076] cores=360 licenses=(null) tres=cpu=360 watts=4294967294 start=2020-12-11T14:00:21 end=2021-04-10T15:00:21 MaxStartDelay= slurmctld.log-20201212.gz:[2020-12-11T14:17:33.011] sched: Updated reservation=temperature_idle_node users=root,falcone,sagon,ressegai,brero,capello nodes=cpu[028-029,041,044,053-054,057,067-068,074,076] cores=396 licenses=(null) tres=cpu=396 watts=4294967294 start=2020-12-11T14:00:21 end=2021-04-10T15:00:21 MaxStartDelay= slurmctld.log-20201212.gz:[2020-12-11T16:22:16.761] Recovered state of reservation temperature_idle_node slurmctld.log-20201212.gz:[2020-12-11T17:47:29.850] Recovered state of reservation temperature_idle_node slurmctld.log-20201212.gz:[2020-12-11T21:53:54.880] Recovered state of reservation temperature_idle_node slurmctld.log-20201212.gz:[2020-12-11T22:32:07.853] Recovered state of reservation temperature_idle_node slurmctld.log-20201213.gz:[2020-12-12T18:31:37.944] modified reservation temperature_idle_node due to unusable nodes, new nodes: cpu[020,028-029,041,044,053-054,057,068,074,076] It is as if node 067 was removed from the reservation and replaced by node 020? Is this possible if nodes weren't idle when we created the reservation and we asked for 10 nodes, not specific ones? I'm attaching nodes.conf which is a subset of our slurm.conf. Let me know if you need more information. Created attachment 17393 [details]
slurm.d/nodes.conf
Yann, It looks like it was behaving correctly in the instance you show in comment 2, assuming node 67 was a DOWN, DRAINED/DRAINING, FAILING or NO_RESPOND node. Did something like this happen to node 67 at this time? In the original instance I am concerned that this line keeps repeating ~3 times per minute. This shows it was unable to modify itself properly and kept retrying. >[2021-01-04T16:45:15.859] modified reservation temperature_idle_node due to unusable nodes, new nodes: cpu[008,010,028-029,041,044,054,057,073,080] I am also concerned that the list of Nodes and NodeCnt disagreed. -Scott Hi, I'm not sure I understood correctly: We created reservation "temperature_idle_node" in two step (as listed in my #2 comment). The final step was this one: slurmctld.log-20201212.gz:[2020-12-11T14:17:33.011] sched: Updated reservation=temperature_idle_node users=root,falcone,sagon,ressegai,brero,capello nodes=cpu[028-029,041,044,053-054,057,067-068,074,076] cores=396 licenses=(null) tres=cpu=396 watts=4294967294 start=2020-12-11T14:00:21 end=2021-04-10T15:00:21 And then this appears many time in the log: slurmctld.log-20201213.gz:[2020-12-12T18:31:37.944] modified reservation temperature_idle_node due to unusable nodes, new nodes: cpu[020,028-029,041,044,053-054,057,068,074,076] please note node number 020 wasn't present before and node 067 isn't present anymore. My question is: is this normal that a node can be added automagically to the the reservation if a node from this reservation fails? I would consider this problematic as this reservation wasn't meant to be used, but was set with the MAINT flag for the purpose of working (stoping, rebooting, re installing) the nodes. It seems indeed node cpu067 had an issue: [2020-12-12T18:32:54.385] Node cpu067 now responding But I don't see in the log when it wasn't responding. Maybe it was only transient error, but it appears it was happening more or less at the same time we had the first "modified reservation" in the log. I see as well we had issues with different slurm.conf during this period:( Maybe it doesn't worth to investigate more and I can re open the case if we face this issue again. Best Yann, > My question is: is this normal that a node can be added > automagically to the the reservation if a node from this reservation fails? > I would consider this problematic as this reservation wasn't meant to be > used, but was set with the MAINT flag for the purpose of working (stoping, > rebooting, re installing) the nodes. Yes, this is the correct behavior when you ask for NodeCnt=<num>. If you use the flag STATIC_ALLOC or if you ask for specific nodes this will not happen. >scontrol create Reservation=<name> NodeCnt=<num> Flags=STATIC_ALLOC >or >scontrol create Reservation=<name> Nodes=<list of nodes> -Scott Hello,
thanks for your answer. I wasn't aware of the STATIC_ALLOC flags!
I'm not quite sure why the STATIC_ALLOC isn't implied by the MAINT flag as I don't see a use case who would need MAINT without SATIT_ALLOC.
It seems I'm able to reproduce the issue:
[root@admin1 ~]# scontrol create Reservation=test_reservation \
> StartTime=NOW \
> Duration=120-0 \
> Users=root,$(getent group hpc_admin | sed -e 's/^hpc_admin:.*[0-9]://g') \
> Flags=maint,ignore_jobs \
> NodeCnt=3 \
> PartitionName=shared-bigmem
Reservation created: test_reservation
[root@admin1 ~]# scontrol show reservation test_reservation
ReservationName=test_reservation StartTime=2021-01-14T10:02:22 EndTime=2021-05-14T11:02:22 Duration=120-00:00:00
Nodes=cpu[120-122] NodeCnt=3 CoreCnt=108 Features=(null) PartitionName=shared-bigmem Flags=MAINT,IGNORE_JOBS
TRES=cpu=108
Users=root,falcone,sagon,ressegai,brero,capello Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
MaxStartDelay=(null)
[root@admin1 ~]# scontrol update node=cpu122 state=drain reason=test
[root@admin1 ~]# scontrol show reservation test_reservation
ReservationName=test_reservation StartTime=2021-01-14T10:02:22 EndTime=2021-05-14T11:02:22 Duration=120-00:00:00
Nodes=cpu[120-121] NodeCnt=3 CoreCnt=108 Features=(null) PartitionName=shared-bigmem Flags=MAINT,IGNORE_JOBS
TRES=cpu=108
Users=root,falcone,sagon,ressegai,brero,capello Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
MaxStartDelay=(null)
In the last output, you can see that node122 disappeared from the reservation and only two nodes remains, but NodeCnt is still 3.
Then in slurmctld log ~3time per minutes:
[2021-01-14T10:09:32.743] modified reservation test_reservation due to unusable nodes, new nodes: cpu[120-121]
In this case I don't have any other idle node in this partition. But I do have other allocated nodes in this partition.
ps: is there a way to properly format code in the issue tracker (here)?
Best
Yann
Yann,
I am trying to reproduce the issue where it prints:
>[2021-01-14T10:09:32.743] modified reservation test_reservation due to unusable nodes, new nodes: cpu[120-121]
But evidently fails and keeps trying over and over.
Could I get the rest of your slurm.conf to help me reproduce your environment? Does this happen if you have idle nodes available?
-Scott
Yann, It looks like you have a suitable solution. I will go ahead and close this ticket. If you want us to look at the issue where it fails to modify the reservation properly and keeps retrying feel free to reopen the ticket -Scott |