| Summary: | what is the correct way to bring down a SLURM cluster | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Openfive Support <it_support> |
| Component: | slurmctld | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | bart, naresh.midatha |
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Alphawave | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 20.11.8 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Openfive Support
2022-07-05 00:42:56 MDT
Hi Team, We have following in our slurm cluster 1. primary controller 2. secondary / backup controller 3. Slurmdb server 4. compute nodes We wanted to upgrade server firmware and hence we need to take down the cluster What is the correct sequence to take down the machines and Vice versa ? Regards, Anurag Hi -1) drain nodes or create maintenance reservation 0) wait until all jobs completed 1) compute nodes 2) secondary / backup controller 3) primary controller 4) Slurmdb server Points 0) and -1) are recommended for accounting clarity. Canceling or re-queueing running jobs can speed up point 0). Dominik Hi Dominik, 0) wait until all jobs completed : We wanted to cancel all PD / R jobs at once . do we have any command ? 1. in compute nodes: systemctl stop slurmd 2. secondary / backup controller: systemctl stop slurmctld 3. primary controller: systemctl stop slurmctld 4. Slurmdb server: systemctl stop slurmdbd Are these suffice to take down cluster ? Regards, Anurag Hi I suggest switching 2) and 3) this will prevent from taking over by the backup controller. Depending on how many jobs you have and how busy your cluster is relatively low-loaded, you can cancel each job: eg.: "squeue -h -O jobid |xargs -n 1 scancel" Otherwise, you can restart primary slurmctld with "-c" if you don't care about jobs. This will cancel and delete all jobs. Dominik ok Dominik, Thanks. One more query is "squeue -h -O jobid |xargs -n 1 scancel: Will the above command keep the job in sacct output . It will be good if we can keep those details. Regards, Anurag Hi Yes, scancel doesn't remove any information from the database. Both running and pending jobs will be marked as CANCELLED in the database, but all available information will be there. Dominik Hi Did I answer your questions? Do you have any additional questions? Dominik Hi Dominik, Sorry for the delayed response. Let me rephrase the steps . 0) drain nodes or create maintenance reservation 1) scancel all jobs as we want to cancel all jobs (squeue -h -O jobid |xargs -n 1 scancel) 2)shutdown primary controller 3)shutdown secondary / backup controller 4) shutdown slurmdb server gards, Anurag Hi This shouldn't change much, but I suggest stopping the backup controller first. In step 0) you can also block submission of new jobs by setting partitions to an INACTIVE state (scontrol update partition=<part_name> state=INACTIVE). 0) drain nodes or create maintenance reservation and set all partitions to INACTIVE 1) scancel all jobs as we want to cancel all jobs (squeue -h -O jobid |xargs -n 1 scancel) 2)now you can safely stop slurmd and reboots/power-off nodes 3)shutdown secondary / backup controller 4)shutdown primary controller 5) shutdown slurmdb server Dominik Hi Let me know if you have any questions. Dominik thanks, closing it |