On our dev cluster, we are testing 19.05 upgrade to 20.02.3. After the upgrade we rebooted the nodes but they did not come back from the drained state. I've attached the slurm.conf scontrol reboot devslurmvm0[1-2] [2020-08-04T16:25:02.755] Set debug level to 6 [2020-08-04T16:25:14.852] debug2: Processing RPC: REQUEST_REBOOT_NODES from uid=0 [2020-08-04T16:25:14.852] reboot request queued for nodes devslurmvm[01-02] [2020-08-04T16:25:15.178] debug2: Testing job time limits and checkpoints [2020-08-04T16:25:15.178] debug: Queuing reboot request for nodes devslurmvm[01-02] [2020-08-04T16:25:15.178] debug2: Performing purge of old job records [2020-08-04T16:25:15.178] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 [2020-08-04T16:25:15.178] debug: sched: Running job scheduler [2020-08-04T16:25:15.178] debug2: Spawning RPC agent for msg_type REQUEST_REBOOT_NODES [2020-08-04T16:25:15.306] debug: backfill: beginning [2020-08-04T16:25:15.306] debug: backfill: no jobs to backfill [2020-08-04T16:25:16.179] debug: slurm_send_only_node_msg: poll timed out with 0 outstanding: Resource temporarily unavailable [2020-08-04T16:25:16.179] agent/is_node_resp: node:devslurmvm02 RPC:REQUEST_REBOOT_NODES : Resource temporarily unavailable [2020-08-04T16:25:16.179] debug: slurm_send_only_node_msg: poll timed out with 0 outstanding: Resource temporarily unavailable [2020-08-04T16:25:16.180] agent/is_node_resp: node:devslurmvm01 RPC:REQUEST_REBOOT_NODES : Resource temporarily unavailable [2020-08-04T16:25:16.454] debug: node_not_resp: node devslurmvm01 responded since msg sent [2020-08-04T16:25:16.454] debug: node_not_resp: node devslurmvm02 responded since msg sent [2020-08-04T16:25:45.207] debug2: Testing job time limits and checkpoints [2020-08-04T16:25:45.307] debug: backfill: beginning [2020-08-04T16:25:45.307] debug: backfill: no jobs to backfill [2020-08-04T16:25:51.937] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 [2020-08-04T16:25:51.937] Node devslurmvm02 rebooted 7 secs ago [2020-08-04T16:25:51.937] node devslurmvm02 returned to service [2020-08-04T16:25:51.937] debug2: _slurm_rpc_node_registration complete for devslurmvm02 usec=93 [2020-08-04T16:25:55.582] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 [2020-08-04T16:25:55.582] node devslurmvm01 returned to service [2020-08-04T16:25:55.582] debug2: _slurm_rpc_node_registration complete for devslurmvm01 usec=61 [2020-08-04T16:26:07.086] debug2: Processing RPC: REQUEST_PARTITION_INFO uid=0 [2020-08-04T16:26:07.086] debug2: _slurm_rpc_dump_partitions, size=198 usec=55 [2020-08-04T16:26:15.236] debug2: Testing job time limits and checkpoints [2020-08-04T16:26:15.236] debug2: Performing purge of old job records [2020-08-04T16:26:15.236] debug: sched: Running job scheduler [2020-08-04T16:26:15.307] debug: backfill: beginning [2020-08-04T16:26:15.307] debug: backfill: no jobs to backfill [2020-08-04T16:26:21.410] debug2: Processing RPC: REQUEST_PARTITION_INFO uid=0 [2020-08-04T16:26:21.410] debug2: _slurm_rpc_dump_partitions, size=198 usec=45 [2020-08-04T16:26:45.266] debug2: Testing job time limits and checkpoints [2020-08-04T16:27:15.298] debug2: Testing job time limits and checkpoints [2020-08-04T16:27:15.298] debug2: Performing purge of old job records [2020-08-04T16:27:15.298] debug: sched: Running job scheduler [2020-08-04T16:27:15.298] debug2: Performing full system state save [2020-08-04T16:27:15.303] debug2: Sending tres '1=9,2=6001,3=0,4=3,5=9,6=0,7=0,8=0' for cluster [root@dev-slurm01 ~]# sinfo -lN Tue Aug 04 16:29:08 2020 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON devslurmvm01 1 dev* drained 4 4:1:1 3000 0 1 v1 Reboot ASAP : reboot devslurmvm02 1 dev* drained 4 4:1:1 3000 0 1 v2 Reboot ASAP : reboot
Closing duplicate. *** This ticket has been marked as a duplicate of ticket 9513 ***