Created attachment 15308 [details] slurm conf On our dev cluster, we are testing 19.05 upgrade to 20.02.3. After the upgrade we rebooted the nodes but they did not come back from the drained state. I've attached the slurm.conf scontrol reboot devslurmvm0[1-2] [2020-08-04T16:25:02.755] Set debug level to 6 [2020-08-04T16:25:14.852] debug2: Processing RPC: REQUEST_REBOOT_NODES from uid=0 [2020-08-04T16:25:14.852] reboot request queued for nodes devslurmvm[01-02] [2020-08-04T16:25:15.178] debug2: Testing job time limits and checkpoints [2020-08-04T16:25:15.178] debug: Queuing reboot request for nodes devslurmvm[01-02] [2020-08-04T16:25:15.178] debug2: Performing purge of old job records [2020-08-04T16:25:15.178] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 [2020-08-04T16:25:15.178] debug: sched: Running job scheduler [2020-08-04T16:25:15.178] debug2: Spawning RPC agent for msg_type REQUEST_REBOOT_NODES [2020-08-04T16:25:15.306] debug: backfill: beginning [2020-08-04T16:25:15.306] debug: backfill: no jobs to backfill [2020-08-04T16:25:16.179] debug: slurm_send_only_node_msg: poll timed out with 0 outstanding: Resource temporarily unavailable [2020-08-04T16:25:16.179] agent/is_node_resp: node:devslurmvm02 RPC:REQUEST_REBOOT_NODES : Resource temporarily unavailable [2020-08-04T16:25:16.179] debug: slurm_send_only_node_msg: poll timed out with 0 outstanding: Resource temporarily unavailable [2020-08-04T16:25:16.180] agent/is_node_resp: node:devslurmvm01 RPC:REQUEST_REBOOT_NODES : Resource temporarily unavailable [2020-08-04T16:25:16.454] debug: node_not_resp: node devslurmvm01 responded since msg sent [2020-08-04T16:25:16.454] debug: node_not_resp: node devslurmvm02 responded since msg sent [2020-08-04T16:25:45.207] debug2: Testing job time limits and checkpoints [2020-08-04T16:25:45.307] debug: backfill: beginning [2020-08-04T16:25:45.307] debug: backfill: no jobs to backfill [2020-08-04T16:25:51.937] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 [2020-08-04T16:25:51.937] Node devslurmvm02 rebooted 7 secs ago [2020-08-04T16:25:51.937] node devslurmvm02 returned to service [2020-08-04T16:25:51.937] debug2: _slurm_rpc_node_registration complete for devslurmvm02 usec=93 [2020-08-04T16:25:55.582] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 [2020-08-04T16:25:55.582] node devslurmvm01 returned to service [2020-08-04T16:25:55.582] debug2: _slurm_rpc_node_registration complete for devslurmvm01 usec=61 [2020-08-04T16:26:07.086] debug2: Processing RPC: REQUEST_PARTITION_INFO uid=0 [2020-08-04T16:26:07.086] debug2: _slurm_rpc_dump_partitions, size=198 usec=55 [2020-08-04T16:26:15.236] debug2: Testing job time limits and checkpoints [2020-08-04T16:26:15.236] debug2: Performing purge of old job records [2020-08-04T16:26:15.236] debug: sched: Running job scheduler [2020-08-04T16:26:15.307] debug: backfill: beginning [2020-08-04T16:26:15.307] debug: backfill: no jobs to backfill [2020-08-04T16:26:21.410] debug2: Processing RPC: REQUEST_PARTITION_INFO uid=0 [2020-08-04T16:26:21.410] debug2: _slurm_rpc_dump_partitions, size=198 usec=45 [2020-08-04T16:26:45.266] debug2: Testing job time limits and checkpoints [2020-08-04T16:27:15.298] debug2: Testing job time limits and checkpoints [2020-08-04T16:27:15.298] debug2: Performing purge of old job records [2020-08-04T16:27:15.298] debug: sched: Running job scheduler [2020-08-04T16:27:15.298] debug2: Performing full system state save [2020-08-04T16:27:15.303] debug2: Sending tres '1=9,2=6001,3=0,4=3,5=9,6=0,7=0,8=0' for cluster [root@dev-slurm01 ~]# sinfo -lN Tue Aug 04 16:29:08 2020 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON devslurmvm01 1 dev* drained 4 4:1:1 3000 0 1 v1 Reboot ASAP : reboot devslurmvm02 1 dev* drained 4 4:1:1 3000 0 1 v2 Reboot ASAP : reboot
*** Ticket 9514 has been marked as a duplicate of this ticket. ***
I can confirm that it's a regression in Slurm 20.02 release. I have a patch that I'm moving to the review queue. Let me know if you want to give it a try locally before it will pass our quality assurance process. cheers, Marcin
Sure, please send it over and I'll give it a test. ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, August 6, 2020 5:49 AM To: Luis Huang Subject: [Bug 9513] scontrol reboot- node stays in drained state after reboot Marcin Stolarek<mailto:cinek@schedmd.com> changed bug 9513<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=9513__;!!C6sPl7C9qQ!GgVs6-g0Mm0VrTuKtW7xJjPSorCA2QfwrKNU06nKMI9hhkeCr1alhqp2uo_RL_o$> What Removed Added QA Contact reviewers@schedmd.com Comment # 4<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=9513*c4__;Iw!!C6sPl7C9qQ!GgVs6-g0Mm0VrTuKtW7xJjPSorCA2QfwrKNU06nKMI9hhkeCr1alhqp2AyVeFG4$> on bug 9513<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=9513__;!!C6sPl7C9qQ!GgVs6-g0Mm0VrTuKtW7xJjPSorCA2QfwrKNU06nKMI9hhkeCr1alhqp2uo_RL_o$> from Marcin Stolarek<mailto:cinek@schedmd.com> I can confirm that it's a regression in Slurm 20.02 release. I have a patch that I'm moving to the review queue. Let me know if you want to give it a try locally before it will pass our quality assurance process. cheers, Marcin ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ This message is for the recipient's use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.
Comment on attachment 15319 [details] v1 Switching the patch to public mode so Luis can verify it.
Can confirm the patch works. Do we know if this will be included in the next minor release? ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, August 6, 2020 11:02 AM To: Luis Huang Subject: [Bug 9513] scontrol reboot- node stays in drained state after reboot Marcin Stolarek<mailto:cinek@schedmd.com> changed bug 9513<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=9513__;!!C6sPl7C9qQ!BDJVJNUsFV5eeYztdZikk2I6_Pg_KTG42m4dp7Vo-1l8J7Wf8X7xqDmpku0W6MY$> What Removed Added Attachment #15319 [details] is private 1 Comment # 6<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=9513*c6__;Iw!!C6sPl7C9qQ!BDJVJNUsFV5eeYztdZikk2I6_Pg_KTG42m4dp7Vo-1l8J7Wf8X7xqDmpe-owl1g$> on bug 9513<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=9513__;!!C6sPl7C9qQ!BDJVJNUsFV5eeYztdZikk2I6_Pg_KTG42m4dp7Vo-1l8J7Wf8X7xqDmpku0W6MY$> from Marcin Stolarek<mailto:cinek@schedmd.com> Comment on attachment 15319 [details]<https://urldefense.com/v3/__https://bugs.schedmd.com/attachment.cgi?id=15319&action=diff__;!!C6sPl7C9qQ!BDJVJNUsFV5eeYztdZikk2I6_Pg_KTG42m4dp7Vo-1l8J7Wf8X7xqDmpT2is6ks$> [details]<https://urldefense.com/v3/__https://bugs.schedmd.com/attachment.cgi?id=15319&action=edit__;!!C6sPl7C9qQ!BDJVJNUsFV5eeYztdZikk2I6_Pg_KTG42m4dp7Vo-1l8J7Wf8X7xqDmpQvLhC8I$> v1 Switching the patch to public mode so Luis can verify it. ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ This message is for the recipient's use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.
The patch is under review. I cannot guarantee that it will be a final solution, but I'll keep you posted. cheers, Marcin
The reported issue got fixed by d423a64367e446d[1] which will be included in the upcoming Slurm 20.02.5 release. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/d423a64367e446d4fe42633eeaa75070b4220064