Ticket 14473

Summary: what is the correct way to bring down a SLURM cluster
Product: Slurm Reporter: Openfive Support <it_support>
Component: slurmctldAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: bart, naresh.midatha
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
Site: Alphawave Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 20.11.8 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Openfive Support 2022-07-05 00:42:56 MDT

    
Comment 1 Openfive Support 2022-07-05 00:45:00 MDT
Hi Team,

We have following in our slurm cluster

1. primary controller
2. secondary / backup controller
3. Slurmdb server 
4. compute nodes

We wanted to upgrade server firmware and hence we need to take down the cluster 

What is the correct sequence to take down the machines and Vice versa ?

Regards,
Anurag
Comment 2 Dominik Bartkiewicz 2022-07-05 03:36:18 MDT
Hi

-1) drain nodes or create maintenance reservation
0) wait until all jobs completed
1) compute nodes
2) secondary / backup controller
3) primary controller
4) Slurmdb server

Points 0) and -1) are recommended for accounting clarity.
Canceling or re-queueing running jobs can speed up point 0).

Dominik
Comment 3 Openfive Support 2022-07-05 05:04:26 MDT
Hi Dominik,

0) wait until all jobs completed : We wanted to cancel all PD / R jobs at once . do we have any command ?

1. in compute nodes: systemctl stop slurmd
2. secondary / backup controller: systemctl stop slurmctld
3. primary controller: systemctl stop slurmctld
4. Slurmdb server: systemctl stop slurmdbd

Are these suffice to take down cluster ?

Regards,
Anurag
Comment 4 Dominik Bartkiewicz 2022-07-05 05:52:30 MDT
Hi

I suggest switching 2) and 3) this will prevent from taking over by the backup controller.
Depending on how many jobs you have and how busy your cluster is relatively low-loaded, you can cancel each job:

eg.:  "squeue -h -O jobid |xargs -n 1 scancel"

Otherwise, you can restart primary slurmctld with "-c" if you don't care about jobs. This will cancel and delete all jobs.

Dominik
Comment 5 Openfive Support 2022-07-05 23:43:14 MDT
ok Dominik,

Thanks. One more query is 

"squeue -h -O jobid |xargs -n 1 scancel:

Will the above command keep the job in sacct output . It will be good if we can keep those details. 

Regards,
Anurag
Comment 6 Dominik Bartkiewicz 2022-07-06 02:44:43 MDT
Hi

Yes, scancel doesn't remove any information from the database.

Both running and pending jobs will be marked as CANCELLED in the database, but all available information will be there.

Dominik
Comment 7 Dominik Bartkiewicz 2022-07-12 05:29:21 MDT
Hi

Did I answer your questions?
Do you have any additional questions?

Dominik
Comment 8 Openfive Support 2022-07-12 07:02:29 MDT
Hi Dominik,

Sorry for the delayed response. Let me rephrase the steps . 

0) drain nodes or create maintenance reservation
1) scancel all jobs as we want to cancel all jobs (squeue -h -O jobid |xargs -n 1 scancel)
2)shutdown  primary controller
3)shutdown  secondary / backup controller
4) shutdown slurmdb server

gards,
Anurag
Comment 9 Dominik Bartkiewicz 2022-07-12 09:05:25 MDT
Hi 

This shouldn't change much, but I suggest stopping the backup controller first.
In step 0) you can also block submission of new jobs by setting partitions to an INACTIVE state (scontrol update partition=<part_name> state=INACTIVE).

0) drain nodes or create maintenance reservation and set all partitions to INACTIVE
1) scancel all jobs as we want to cancel all jobs (squeue -h -O jobid |xargs -n 1 scancel)
2)now you can safely stop slurmd and reboots/power-off nodes
3)shutdown  secondary / backup controller
4)shutdown  primary controller
5) shutdown slurmdb server

Dominik
Comment 10 Dominik Bartkiewicz 2022-07-28 05:28:00 MDT
Hi

Let me know if you have any questions.

Dominik
Comment 11 Openfive Support 2022-08-01 06:57:16 MDT
thanks, closing it