Ticket 2970

Summary:	Purge queue after shutdown
Product:	Slurm	Reporter:	Davide Vanzo <davide.vanzo>
Component:	Scheduling	Assignee:	Dominik Bartkiewicz <bart>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	2 - High Impact
Priority:	---	CC:	bart
Version:	15.08.11
Hardware:	Linux
OS:	Linux
Site:	Vanderbilt	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Davide Vanzo 2016-08-04 07:41:08 MDT

Hello guys,
we recently had a cluster downtime and before shutting down Slurm we were not able to cancel all jobs left in the queue. Since we are now ready to bring Slurm back, we want to be sure that the queue will be completely empty, so that Slurm will not try to restart jobs across the cluster.
How is it possible to do so before turning on slurmctld?
Thanks!

Best,
Davide

Comment 1 Dominik Bartkiewicz 2016-08-04 07:48:34 MDT

Hi

You can use -c option to initial start of slurmctld.

http://slurm.schedmd.com/slurmctld.html:

-c
    Clear all previous slurmctld state from its last checkpoint. With this option, all jobs, including both running and queued, and all node states, will be deleted. Without this option, previously running jobs will be preserved along with node State of DOWN, DRAINED and DRAINING nodes and the associated Reason field for those nodes. NOTE: It is rare you would ever want to use this in production as all jobs will be killed. 



Dominik

Comment 2 Davide Vanzo 2016-08-04 07:54:01 MDT

Hi Dominik,
thank you for your quick reply.
If I understand correctly, that would not only purge the job queue but also delete all node states. Am I right? That is something we don't want since that would release nodes that have been set as DOWN for multiple reasons.


DV

Comment 3 Moe Jette 2016-08-04 08:27:45 MDT

(In reply to Davide Vanzo from comment #2)
> Hi Dominik,
> thank you for your quick reply.
> If I understand correctly, that would not only purge the job queue but also
> delete all node states. Am I right? That is something we don't want since
> that would release nodes that have been set as DOWN for multiple reasons.
> 
> 
> DV

You are correct.

You will need to delete the "job_state" and "job_state.old" files from your configured "StateSaveLocation" directory in order to delete the jobs, but retain all other state information.

Comment 4 Davide Vanzo 2016-08-04 09:11:41 MDT

Moe,
that is what I thought but I preferred to get confirmation from you before touching stuff I should not touch.
A couple of related questions.

1) I also have a job_state.new file in the same folder. What is that? Should I delete that one too?

2) Will Slurm update the database accordingly by setting the pugred jobs as "CANCELLED" or will they remain in the old state in the database?

DV

On Aug 4 2016, at 9:27 am, bugs@schedmd.com wrote:
Moe Jette<mailto:jette@schedmd.com> changed bug 2970<https://bugs.schedmd.com/show_bug.cgi?id=2970>
What    Removed Added
CC              jette@schedmd.com

Comment # 3<https://bugs.schedmd.com/show_bug.cgi?id=2970#c3> on bug 2970<https://bugs.schedmd.com/show_bug.cgi?id=2970> from Moe Jette<mailto:jette@schedmd.com>

(In reply to Davide Vanzo from comment #2<show_bug.cgi?id=2970#c2>)
> Hi Dominik,
> thank you for your quick reply.
> If I understand correctly, that would not only purge the job queue but also
> delete all node states. Am I right? That is something we don't want since
> that would release nodes that have been set as DOWN for multiple reasons.
>
>
> DV

You are correct.

You will need to delete the "job_state" and "job_state.old" files from your
configured "StateSaveLocation" directory in order to delete the jobs, but
retain all other state information.

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 5 Dominik Bartkiewicz 2016-08-04 09:48:10 MDT

You can safely remove this file to, this file is used during updating of job_state file.
Dominik

Comment 6 Dominik Bartkiewicz 2016-08-04 10:07:43 MDT

And yes, all info in database will be correctly updated.

Dominik

Comment 7 Davide Vanzo 2016-08-04 10:09:42 MDT

Great, thank you Dominik.
You can close this ticket now.

Have a great day,
DV


On Aug 4 2016, at 11:08 am, bugs@schedmd.com wrote:

Comment # 6<https://bugs.schedmd.com/show_bug.cgi?id=2970#c6> on bug 2970<https://bugs.schedmd.com/show_bug.cgi?id=2970> from Dominik Bartkiewicz<mailto:bart@schedmd.com>

And yes, all info in database will be correctly updated.

Dominik

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 8 Dominik Bartkiewicz 2016-08-04 10:21:05 MDT

Closing as resolved/infogiven.
Please reopen if there's anything else I can address.

Dominik

Comment 9 Davide Vanzo 2016-08-04 15:27:34 MDT

Dominik,
your solution worked well but it had the unexpected and undesired effect of resetting the JOBID counter. So now the new jobs start from 1. Is there a way to set the counter to a different value?

Davide

Comment 10 Danny Auble 2016-08-04 15:30:49 MDT

FirstJobID

http://slurm.schedmd.com/slurm.conf.html#OPT_FirstJobId

Comment 13 Davide Vanzo 2016-08-04 16:28:19 MDT

That made the trick.
Thanks again. You can re-close the ticket now.

Davide

On Aug 4 2016, at 4:31 pm, bugs@schedmd.com wrote:

Comment # 10<https://bugs.schedmd.com/show_bug.cgi?id=2970#c10> on bug 2970<https://bugs.schedmd.com/show_bug.cgi?id=2970> from Danny Auble<mailto:da@schedmd.com>

FirstJobID

http://slurm.schedmd.com/slurm.conf.html#OPT_FirstJobId

________________________________
You are receiving this mail because:

  *   You reported the bug.