Ticket 9514

Summary: scontrol reboot- node stays in drained state after reboot
Product: Slurm Reporter: lhuang
Component: slurmctldAssignee: Director of Support <support>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 20.02.3   
Hardware: Linux   
OS: Linux   
Site: NY Genome Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description lhuang 2020-08-04 14:32:35 MDT
On our dev cluster, we are testing 19.05 upgrade to 20.02.3. After the upgrade we rebooted the nodes but they did not come back from the drained state. I've attached the slurm.conf

scontrol reboot devslurmvm0[1-2]

[2020-08-04T16:25:02.755] Set debug level to 6
[2020-08-04T16:25:14.852] debug2: Processing RPC: REQUEST_REBOOT_NODES from uid=0
[2020-08-04T16:25:14.852] reboot request queued for nodes devslurmvm[01-02]
[2020-08-04T16:25:15.178] debug2: Testing job time limits and checkpoints
[2020-08-04T16:25:15.178] debug:  Queuing reboot request for nodes devslurmvm[01-02]
[2020-08-04T16:25:15.178] debug2: Performing purge of old job records
[2020-08-04T16:25:15.178] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2020-08-04T16:25:15.178] debug:  sched: Running job scheduler
[2020-08-04T16:25:15.178] debug2: Spawning RPC agent for msg_type REQUEST_REBOOT_NODES
[2020-08-04T16:25:15.306] debug:  backfill: beginning
[2020-08-04T16:25:15.306] debug:  backfill: no jobs to backfill
[2020-08-04T16:25:16.179] debug:  slurm_send_only_node_msg: poll timed out with 0 outstanding: Resource temporarily unavailable
[2020-08-04T16:25:16.179] agent/is_node_resp: node:devslurmvm02 RPC:REQUEST_REBOOT_NODES : Resource temporarily unavailable
[2020-08-04T16:25:16.179] debug:  slurm_send_only_node_msg: poll timed out with 0 outstanding: Resource temporarily unavailable
[2020-08-04T16:25:16.180] agent/is_node_resp: node:devslurmvm01 RPC:REQUEST_REBOOT_NODES : Resource temporarily unavailable
[2020-08-04T16:25:16.454] debug:  node_not_resp: node devslurmvm01 responded since msg sent
[2020-08-04T16:25:16.454] debug:  node_not_resp: node devslurmvm02 responded since msg sent
[2020-08-04T16:25:45.207] debug2: Testing job time limits and checkpoints
[2020-08-04T16:25:45.307] debug:  backfill: beginning
[2020-08-04T16:25:45.307] debug:  backfill: no jobs to backfill
[2020-08-04T16:25:51.937] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0
[2020-08-04T16:25:51.937] Node devslurmvm02 rebooted 7 secs ago
[2020-08-04T16:25:51.937] node devslurmvm02 returned to service
[2020-08-04T16:25:51.937] debug2: _slurm_rpc_node_registration complete for devslurmvm02 usec=93
[2020-08-04T16:25:55.582] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0
[2020-08-04T16:25:55.582] node devslurmvm01 returned to service
[2020-08-04T16:25:55.582] debug2: _slurm_rpc_node_registration complete for devslurmvm01 usec=61
[2020-08-04T16:26:07.086] debug2: Processing RPC: REQUEST_PARTITION_INFO uid=0
[2020-08-04T16:26:07.086] debug2: _slurm_rpc_dump_partitions, size=198 usec=55
[2020-08-04T16:26:15.236] debug2: Testing job time limits and checkpoints
[2020-08-04T16:26:15.236] debug2: Performing purge of old job records
[2020-08-04T16:26:15.236] debug:  sched: Running job scheduler
[2020-08-04T16:26:15.307] debug:  backfill: beginning
[2020-08-04T16:26:15.307] debug:  backfill: no jobs to backfill
[2020-08-04T16:26:21.410] debug2: Processing RPC: REQUEST_PARTITION_INFO uid=0
[2020-08-04T16:26:21.410] debug2: _slurm_rpc_dump_partitions, size=198 usec=45
[2020-08-04T16:26:45.266] debug2: Testing job time limits and checkpoints
[2020-08-04T16:27:15.298] debug2: Testing job time limits and checkpoints
[2020-08-04T16:27:15.298] debug2: Performing purge of old job records
[2020-08-04T16:27:15.298] debug:  sched: Running job scheduler
[2020-08-04T16:27:15.298] debug2: Performing full system state save
[2020-08-04T16:27:15.303] debug2: Sending tres '1=9,2=6001,3=0,4=3,5=9,6=0,7=0,8=0' for cluster


[root@dev-slurm01 ~]# sinfo -lN
Tue Aug 04 16:29:08 2020
NODELIST      NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON               
devslurmvm01      1      dev*     drained 4       4:1:1   3000        0      1       v1 Reboot ASAP : reboot 
devslurmvm02      1      dev*     drained 4       4:1:1   3000        0      1       v2 Reboot ASAP : reboot
Comment 1 Jason Booth 2020-08-04 14:41:05 MDT
Closing duplicate.

*** This ticket has been marked as a duplicate of ticket 9513 ***