Ticket 14473

Summary:	what is the correct way to bring down a SLURM cluster
Product:	Slurm	Reporter:	Openfive Support <it_support>
Component:	slurmctld	Assignee:	Dominik Bartkiewicz <bart>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	bart, naresh.midatha
Version:	- Unsupported Older Versions
Hardware:	Linux
OS:	Linux
Site:	Alphawave	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	20.11.8
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Openfive Support 2022-07-05 00:42:56 MDT

Comment 1 Openfive Support 2022-07-05 00:45:00 MDT

Hi Team,

We have following in our slurm cluster

1. primary controller
2. secondary / backup controller
3. Slurmdb server 
4. compute nodes

We wanted to upgrade server firmware and hence we need to take down the cluster 

What is the correct sequence to take down the machines and Vice versa ?

Regards,
Anurag

Comment 2 Dominik Bartkiewicz 2022-07-05 03:36:18 MDT

Hi

-1) drain nodes or create maintenance reservation
0) wait until all jobs completed
1) compute nodes
2) secondary / backup controller
3) primary controller
4) Slurmdb server

Points 0) and -1) are recommended for accounting clarity.
Canceling or re-queueing running jobs can speed up point 0).

Dominik

Comment 3 Openfive Support 2022-07-05 05:04:26 MDT

Hi Dominik,

0) wait until all jobs completed : We wanted to cancel all PD / R jobs at once . do we have any command ?

1. in compute nodes: systemctl stop slurmd
2. secondary / backup controller: systemctl stop slurmctld
3. primary controller: systemctl stop slurmctld
4. Slurmdb server: systemctl stop slurmdbd

Are these suffice to take down cluster ?

Regards,
Anurag

Comment 4 Dominik Bartkiewicz 2022-07-05 05:52:30 MDT

Hi

I suggest switching 2) and 3) this will prevent from taking over by the backup controller.
Depending on how many jobs you have and how busy your cluster is relatively low-loaded, you can cancel each job:

eg.:  "squeue -h -O jobid |xargs -n 1 scancel"

Otherwise, you can restart primary slurmctld with "-c" if you don't care about jobs. This will cancel and delete all jobs.

Dominik

Comment 5 Openfive Support 2022-07-05 23:43:14 MDT

ok Dominik,

Thanks. One more query is 

"squeue -h -O jobid |xargs -n 1 scancel:

Will the above command keep the job in sacct output . It will be good if we can keep those details. 

Regards,
Anurag

Comment 6 Dominik Bartkiewicz 2022-07-06 02:44:43 MDT

Hi

Yes, scancel doesn't remove any information from the database.

Both running and pending jobs will be marked as CANCELLED in the database, but all available information will be there.

Dominik

Comment 7 Dominik Bartkiewicz 2022-07-12 05:29:21 MDT

Hi

Did I answer your questions?
Do you have any additional questions?

Dominik

Comment 8 Openfive Support 2022-07-12 07:02:29 MDT

Hi Dominik,

Sorry for the delayed response. Let me rephrase the steps . 

0) drain nodes or create maintenance reservation
1) scancel all jobs as we want to cancel all jobs (squeue -h -O jobid |xargs -n 1 scancel)
2)shutdown  primary controller
3)shutdown  secondary / backup controller
4) shutdown slurmdb server

gards,
Anurag

Comment 9 Dominik Bartkiewicz 2022-07-12 09:05:25 MDT

Hi 

This shouldn't change much, but I suggest stopping the backup controller first.
In step 0) you can also block submission of new jobs by setting partitions to an INACTIVE state (scontrol update partition=<part_name> state=INACTIVE).

0) drain nodes or create maintenance reservation and set all partitions to INACTIVE
1) scancel all jobs as we want to cancel all jobs (squeue -h -O jobid |xargs -n 1 scancel)
2)now you can safely stop slurmd and reboots/power-off nodes
3)shutdown  secondary / backup controller
4)shutdown  primary controller
5) shutdown slurmdb server

Dominik

Comment 10 Dominik Bartkiewicz 2022-07-28 05:28:00 MDT

Hi

Let me know if you have any questions.

Dominik

Comment 11 Openfive Support 2022-08-01 06:57:16 MDT

thanks, closing it