Ticket 2970 - Purge queue after shutdown
Summary: Purge queue after shutdown
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 15.08.11
Hardware: Linux Linux
: 2 - High Impact
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2016-08-04 07:41 MDT by Davide Vanzo
Modified: 2016-08-04 16:34 MDT (History)
1 user (show)

See Also:
Site: Vanderbilt
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Davide Vanzo 2016-08-04 07:41:08 MDT
Hello guys,
we recently had a cluster downtime and before shutting down Slurm we were not able to cancel all jobs left in the queue. Since we are now ready to bring Slurm back, we want to be sure that the queue will be completely empty, so that Slurm will not try to restart jobs across the cluster.
How is it possible to do so before turning on slurmctld?
Thanks!

Best,
Davide
Comment 1 Dominik Bartkiewicz 2016-08-04 07:48:34 MDT
Hi

You can use -c option to initial start of slurmctld.

http://slurm.schedmd.com/slurmctld.html:

-c
    Clear all previous slurmctld state from its last checkpoint. With this option, all jobs, including both running and queued, and all node states, will be deleted. Without this option, previously running jobs will be preserved along with node State of DOWN, DRAINED and DRAINING nodes and the associated Reason field for those nodes. NOTE: It is rare you would ever want to use this in production as all jobs will be killed. 



Dominik
Comment 2 Davide Vanzo 2016-08-04 07:54:01 MDT
Hi Dominik,
thank you for your quick reply.
If I understand correctly, that would not only purge the job queue but also delete all node states. Am I right? That is something we don't want since that would release nodes that have been set as DOWN for multiple reasons.


DV
Comment 3 Moe Jette 2016-08-04 08:27:45 MDT
(In reply to Davide Vanzo from comment #2)
> Hi Dominik,
> thank you for your quick reply.
> If I understand correctly, that would not only purge the job queue but also
> delete all node states. Am I right? That is something we don't want since
> that would release nodes that have been set as DOWN for multiple reasons.
> 
> 
> DV

You are correct.

You will need to delete the "job_state" and "job_state.old" files from your configured "StateSaveLocation" directory in order to delete the jobs, but retain all other state information.
Comment 4 Davide Vanzo 2016-08-04 09:11:41 MDT
Moe,
that is what I thought but I preferred to get confirmation from you before touching stuff I should not touch.
A couple of related questions.

1) I also have a job_state.new file in the same folder. What is that? Should I delete that one too?

2) Will Slurm update the database accordingly by setting the pugred jobs as "CANCELLED" or will they remain in the old state in the database?

DV


On Aug 4 2016, at 9:27 am, bugs@schedmd.com wrote:
Moe Jette<mailto:jette@schedmd.com> changed bug 2970<https://bugs.schedmd.com/show_bug.cgi?id=2970>
What    Removed Added
CC              jette@schedmd.com

Comment # 3<https://bugs.schedmd.com/show_bug.cgi?id=2970#c3> on bug 2970<https://bugs.schedmd.com/show_bug.cgi?id=2970> from Moe Jette<mailto:jette@schedmd.com>

(In reply to Davide Vanzo from comment #2<show_bug.cgi?id=2970#c2>)
> Hi Dominik,
> thank you for your quick reply.
> If I understand correctly, that would not only purge the job queue but also
> delete all node states. Am I right? That is something we don't want since
> that would release nodes that have been set as DOWN for multiple reasons.
>
>
> DV

You are correct.

You will need to delete the "job_state" and "job_state.old" files from your
configured "StateSaveLocation" directory in order to delete the jobs, but
retain all other state information.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 5 Dominik Bartkiewicz 2016-08-04 09:48:10 MDT
You can safely remove this file to, this file is used during updating of job_state file.
Dominik
Comment 6 Dominik Bartkiewicz 2016-08-04 10:07:43 MDT
And yes, all info in database will be correctly updated.

Dominik
Comment 7 Davide Vanzo 2016-08-04 10:09:42 MDT
Great, thank you Dominik.
You can close this ticket now.

Have a great day,
DV


On Aug 4 2016, at 11:08 am, bugs@schedmd.com wrote:

Comment # 6<https://bugs.schedmd.com/show_bug.cgi?id=2970#c6> on bug 2970<https://bugs.schedmd.com/show_bug.cgi?id=2970> from Dominik Bartkiewicz<mailto:bart@schedmd.com>

And yes, all info in database will be correctly updated.

Dominik

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 8 Dominik Bartkiewicz 2016-08-04 10:21:05 MDT
Closing as resolved/infogiven.
Please reopen if there's anything else I can address.

Dominik
Comment 9 Davide Vanzo 2016-08-04 15:27:34 MDT
Dominik,
your solution worked well but it had the unexpected and undesired effect of resetting the JOBID counter. So now the new jobs start from 1. Is there a way to set the counter to a different value?

Davide
Comment 10 Danny Auble 2016-08-04 15:30:49 MDT
FirstJobID

http://slurm.schedmd.com/slurm.conf.html#OPT_FirstJobId
Comment 13 Davide Vanzo 2016-08-04 16:28:19 MDT
That made the trick.
Thanks again. You can re-close the ticket now.

Davide

On Aug 4 2016, at 4:31 pm, bugs@schedmd.com wrote:

Comment # 10<https://bugs.schedmd.com/show_bug.cgi?id=2970#c10> on bug 2970<https://bugs.schedmd.com/show_bug.cgi?id=2970> from Danny Auble<mailto:da@schedmd.com>

FirstJobID

http://slurm.schedmd.com/slurm.conf.html#OPT_FirstJobId

________________________________
You are receiving this mail because:

  *   You reported the bug.