Ticket 9513 - scontrol reboot- node stays in drained state after reboot
Summary: scontrol reboot- node stays in drained state after reboot
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 20.02.3
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Marcin Stolarek
QA Contact: Brian Christiansen
URL:
: 9514 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2020-08-04 14:32 MDT by lhuang
Modified: 2020-09-09 04:25 MDT (History)
0 users

See Also:
Site: NY Genome
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.02.5 20.11.0pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm conf (4.54 KB, text/plain)
2020-08-04 14:32 MDT, lhuang
Details
v1 (1.03 KB, patch)
2020-08-05 06:32 MDT, Marcin Stolarek
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description lhuang 2020-08-04 14:32:19 MDT
Created attachment 15308 [details]
slurm conf

On our dev cluster, we are testing 19.05 upgrade to 20.02.3. After the upgrade we rebooted the nodes but they did not come back from the drained state. I've attached the slurm.conf

scontrol reboot devslurmvm0[1-2]

[2020-08-04T16:25:02.755] Set debug level to 6
[2020-08-04T16:25:14.852] debug2: Processing RPC: REQUEST_REBOOT_NODES from uid=0
[2020-08-04T16:25:14.852] reboot request queued for nodes devslurmvm[01-02]
[2020-08-04T16:25:15.178] debug2: Testing job time limits and checkpoints
[2020-08-04T16:25:15.178] debug:  Queuing reboot request for nodes devslurmvm[01-02]
[2020-08-04T16:25:15.178] debug2: Performing purge of old job records
[2020-08-04T16:25:15.178] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2020-08-04T16:25:15.178] debug:  sched: Running job scheduler
[2020-08-04T16:25:15.178] debug2: Spawning RPC agent for msg_type REQUEST_REBOOT_NODES
[2020-08-04T16:25:15.306] debug:  backfill: beginning
[2020-08-04T16:25:15.306] debug:  backfill: no jobs to backfill
[2020-08-04T16:25:16.179] debug:  slurm_send_only_node_msg: poll timed out with 0 outstanding: Resource temporarily unavailable
[2020-08-04T16:25:16.179] agent/is_node_resp: node:devslurmvm02 RPC:REQUEST_REBOOT_NODES : Resource temporarily unavailable
[2020-08-04T16:25:16.179] debug:  slurm_send_only_node_msg: poll timed out with 0 outstanding: Resource temporarily unavailable
[2020-08-04T16:25:16.180] agent/is_node_resp: node:devslurmvm01 RPC:REQUEST_REBOOT_NODES : Resource temporarily unavailable
[2020-08-04T16:25:16.454] debug:  node_not_resp: node devslurmvm01 responded since msg sent
[2020-08-04T16:25:16.454] debug:  node_not_resp: node devslurmvm02 responded since msg sent
[2020-08-04T16:25:45.207] debug2: Testing job time limits and checkpoints
[2020-08-04T16:25:45.307] debug:  backfill: beginning
[2020-08-04T16:25:45.307] debug:  backfill: no jobs to backfill
[2020-08-04T16:25:51.937] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0
[2020-08-04T16:25:51.937] Node devslurmvm02 rebooted 7 secs ago
[2020-08-04T16:25:51.937] node devslurmvm02 returned to service
[2020-08-04T16:25:51.937] debug2: _slurm_rpc_node_registration complete for devslurmvm02 usec=93
[2020-08-04T16:25:55.582] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0
[2020-08-04T16:25:55.582] node devslurmvm01 returned to service
[2020-08-04T16:25:55.582] debug2: _slurm_rpc_node_registration complete for devslurmvm01 usec=61
[2020-08-04T16:26:07.086] debug2: Processing RPC: REQUEST_PARTITION_INFO uid=0
[2020-08-04T16:26:07.086] debug2: _slurm_rpc_dump_partitions, size=198 usec=55
[2020-08-04T16:26:15.236] debug2: Testing job time limits and checkpoints
[2020-08-04T16:26:15.236] debug2: Performing purge of old job records
[2020-08-04T16:26:15.236] debug:  sched: Running job scheduler
[2020-08-04T16:26:15.307] debug:  backfill: beginning
[2020-08-04T16:26:15.307] debug:  backfill: no jobs to backfill
[2020-08-04T16:26:21.410] debug2: Processing RPC: REQUEST_PARTITION_INFO uid=0
[2020-08-04T16:26:21.410] debug2: _slurm_rpc_dump_partitions, size=198 usec=45
[2020-08-04T16:26:45.266] debug2: Testing job time limits and checkpoints
[2020-08-04T16:27:15.298] debug2: Testing job time limits and checkpoints
[2020-08-04T16:27:15.298] debug2: Performing purge of old job records
[2020-08-04T16:27:15.298] debug:  sched: Running job scheduler
[2020-08-04T16:27:15.298] debug2: Performing full system state save
[2020-08-04T16:27:15.303] debug2: Sending tres '1=9,2=6001,3=0,4=3,5=9,6=0,7=0,8=0' for cluster


[root@dev-slurm01 ~]# sinfo -lN
Tue Aug 04 16:29:08 2020
NODELIST      NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON               
devslurmvm01      1      dev*     drained 4       4:1:1   3000        0      1       v1 Reboot ASAP : reboot 
devslurmvm02      1      dev*     drained 4       4:1:1   3000        0      1       v2 Reboot ASAP : reboot
Comment 1 Jason Booth 2020-08-04 14:41:05 MDT
*** Ticket 9514 has been marked as a duplicate of this ticket. ***
Comment 4 Marcin Stolarek 2020-08-06 03:49:47 MDT
I can confirm that it's a regression in Slurm 20.02 release.

I have a patch that I'm moving to the review queue. Let me know if you want to give it a try locally before it will pass our quality assurance process.

cheers,
Marcin
Comment 5 lhuang 2020-08-06 08:38:32 MDT
Sure, please send it over and I'll give it a test.


________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, August 6, 2020 5:49 AM
To: Luis Huang
Subject: [Bug 9513] scontrol reboot- node stays in drained state after reboot

Marcin Stolarek<mailto:cinek@schedmd.com> changed bug 9513<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=9513__;!!C6sPl7C9qQ!GgVs6-g0Mm0VrTuKtW7xJjPSorCA2QfwrKNU06nKMI9hhkeCr1alhqp2uo_RL_o$>
What    Removed Added
QA Contact              reviewers@schedmd.com

Comment # 4<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=9513*c4__;Iw!!C6sPl7C9qQ!GgVs6-g0Mm0VrTuKtW7xJjPSorCA2QfwrKNU06nKMI9hhkeCr1alhqp2AyVeFG4$> on bug 9513<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=9513__;!!C6sPl7C9qQ!GgVs6-g0Mm0VrTuKtW7xJjPSorCA2QfwrKNU06nKMI9hhkeCr1alhqp2uo_RL_o$> from Marcin Stolarek<mailto:cinek@schedmd.com>

I can confirm that it's a regression in Slurm 20.02 release.

I have a patch that I'm moving to the review queue. Let me know if you want to
give it a try locally before it will pass our quality assurance process.

cheers,
Marcin

________________________________
You are receiving this mail because:

  *   You reported the bug.

________________________________
This message is for the recipient's use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.
Comment 6 Marcin Stolarek 2020-08-06 09:02:44 MDT
Comment on attachment 15319 [details]
v1

Switching the patch to public mode so Luis can verify it.
Comment 7 lhuang 2020-08-06 11:23:25 MDT
Can confirm the patch works. Do we know if this will be included in the next minor release?


________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, August 6, 2020 11:02 AM
To: Luis Huang
Subject: [Bug 9513] scontrol reboot- node stays in drained state after reboot

Marcin Stolarek<mailto:cinek@schedmd.com> changed bug 9513<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=9513__;!!C6sPl7C9qQ!BDJVJNUsFV5eeYztdZikk2I6_Pg_KTG42m4dp7Vo-1l8J7Wf8X7xqDmpku0W6MY$>
What    Removed Added
Attachment #15319 [details] is private    1

Comment # 6<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=9513*c6__;Iw!!C6sPl7C9qQ!BDJVJNUsFV5eeYztdZikk2I6_Pg_KTG42m4dp7Vo-1l8J7Wf8X7xqDmpe-owl1g$> on bug 9513<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=9513__;!!C6sPl7C9qQ!BDJVJNUsFV5eeYztdZikk2I6_Pg_KTG42m4dp7Vo-1l8J7Wf8X7xqDmpku0W6MY$> from Marcin Stolarek<mailto:cinek@schedmd.com>

Comment on attachment 15319 [details]<https://urldefense.com/v3/__https://bugs.schedmd.com/attachment.cgi?id=15319&action=diff__;!!C6sPl7C9qQ!BDJVJNUsFV5eeYztdZikk2I6_Pg_KTG42m4dp7Vo-1l8J7Wf8X7xqDmpT2is6ks$> [details]<https://urldefense.com/v3/__https://bugs.schedmd.com/attachment.cgi?id=15319&action=edit__;!!C6sPl7C9qQ!BDJVJNUsFV5eeYztdZikk2I6_Pg_KTG42m4dp7Vo-1l8J7Wf8X7xqDmpQvLhC8I$>
v1

Switching the patch to public mode so Luis can verify it.

________________________________
You are receiving this mail because:

  *   You reported the bug.

________________________________
This message is for the recipient's use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.
Comment 8 Marcin Stolarek 2020-08-07 01:55:06 MDT
The patch is under review. I cannot guarantee that it will be a final solution, but I'll keep you posted.

cheers,
Marcin
Comment 13 Marcin Stolarek 2020-09-09 04:25:02 MDT
The reported issue got fixed by d423a64367e446d[1] which will be included in the upcoming Slurm 20.02.5 release.

cheers,
Marcin

[1]https://github.com/SchedMD/slurm/commit/d423a64367e446d4fe42633eeaa75070b4220064