Ticket 5116

Summary:	New (and recurring) Jobid reuse behavior
Product:	Slurm	Reporter:	rl303f
Component:	Other	Assignee:	Dominik Bartkiewicz <bart>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	btmiller, hooverdm, sfellini, susanc, wresch
Version:	17.02.9
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=4538
Site:	NIH	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description rl303f 2018-05-01 07:09:04 MDT

Good morning,

We would like to report some new and what we think is unusual
behavior in our slurm cluster.  We have been running slurm for
some time now and we recently rolled over from MaxJobId=67043328
to the default FirstJobId=1.  Now, whenever we restart slurm for
a configuration change or maintenance, slurm always starts out
re-using somewhat random older jobid's for newly submitted jobs.

For example, since we have already exceeded MaxJobId and rolled
over to 1, slurm has already incremented jobid's up to 9xx,xxx.
However, if we restart slurm, it will start out using some new
jobid's in the 6x,xxx,xxx range and some here and some there
just kind of "filling in holes" randomly.  It doesn't seem to
want to pick up where it left off and just increment from there.

This has happened twice now and is new behavior that, if possible
we would like to suppress.  We would like job numbers to remain
continuous and incremental and only re-use jobid's when we reach
the MaxJobId.

Have you heard of this strange job numbering behavior before?
Or do there is something in our configuration that is causing
this?  Do you know of any way we can keep this from happening?

Thanks for any help you can provide!

Comment 3 Dominik Bartkiewicz 2018-05-02 04:09:44 MDT

Hi

I suspect that this is 4538 duplicate.
In bug 4538 comment 10 Kolbeinn described some workaround.
This should be solved in 17.11.1 and above.

Dominik

Comment 4 rl303f 2018-05-02 07:02:33 MDT

Thanks, Dominik.

Bug 4538 sure sounds like what we stumbled upon.  We'll likely
apply the patch or just upgrade to v17.11.5 as the workaround
would be too much trouble.

Thanks again for your help!