1447 – Job Id resetting

Ticket 1447 - Job Id resetting

Summary: Job Id resetting

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	14.03.11
Hardware:	Linux Linux

Severity:	2 - High Impact
Assignee:	David Bigagli
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2015-02-09 21:08 MST by Akmal Madzlan
Modified:	2015-02-11 15:28 MST (History)
CC List:	2 users (show)

See Also:
Site:	DownUnder GeoSolutions
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Akmal Madzlan 2015-02-09 21:08:16 MST

Hi,
Did you guys have any idea why this could happen? 


[2015-01-30T19:13:31.303] debug2: [2015-02-06T20:20:14.733] licenses: init_license=ioother total=10 used=0
[2015-02-06T20:20:14.747] licenses: init_license=ioh1 total=22 used=0
[2015-02-06T20:20:14.747] slurmctld version 14.03.11 started on cluster houston
[2015-02-06T20:20:15.091] error: ***********************************************
[2015-02-06T20:20:15.092] error: Can not recover assoc_mgr state, incompatible version, got 29290 need > 11 <= 6912
[2015-02-06T20:20:15.092] error: ***********************************************
[2015-02-06T20:20:15.175] error: read_slurm_conf: default partition not set.
[2015-02-06T20:20:15.300] error: Incomplete node data checkpoint file
[2015-02-06T20:20:15.300] Recovered state of 0 nodes
[2015-02-06T20:20:15.320] error: ***********************************************
[2015-02-06T20:20:15.320] error: Can not recover job state, incompatible version
[2015-02-06T20:20:15.321] error: ***********************************************
[2015-02-06T20:20:15.321] cons_res: select_p_node_init
[2015-02-06T20:20:15.321] cons_res: preparing for 18 partitions
[2015-02-06T20:20:15.505] licenses: update_license=ioother total=10 used=0
[2015-02-06T20:20:15.505] licenses: update_license=ioh1 total=22 used=0
[2015-02-06T20:20:15.876] error: ************************************************************
[2015-02-06T20:20:15.876] error: Can not recover reservation state, data version incompatible
[2015-02-06T20:20:15.876] error: ************************************************************
[2015-02-06T20:20:15.876] error: Incomplete trigger data checkpoint file
[2015-02-06T20:20:15.876] read_slurm_conf: backup_controller not specified.
[2015-02-06T20:20:15.876] Reinitializing job accounting state
[2015-02-06T20:20:15.876] Ending any jobs in accounting that were running when controller went down on
[2015-02-06T20:20:15.877] cons_res: select_p_reconfigure

Comment 1 David Bigagli 2015-02-10 06:36:13 MST

Hi,
   what is happening that Slurm reads the state files in the StateSaveLocation
but those files appear to be corrupt or perhaps file system full, since the
data read are in unexpected format. The first 2 bytes encode the Slurm
version which is 6912 (27 << 8) for your version but instead a completely
different number was read 29290. Is there any problem with the file system.

The message "Can not recover assoc_mgr state" also indicate slurmctld started
when the database daemon was down but that should not be a problem.

David

Comment 2 Akmal Madzlan 2015-02-10 11:47:13 MST

Thanks David.
Is it possible that restarting slurmctld causing the state files to become corrupted?

Akmal

Comment 3 Moe Jette 2015-02-10 12:11:41 MST

Not possible. New files written each tine, then file names changed. Worst case is you'll have old state information.

On February 10, 2015 5:47:13 PM PST, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1447
>
>--- Comment #2 from Akmal Madzlan <akmalm@dugeo.com> ---
>Thanks David.
>Is it possible that restarting slurmctld causing the state files to
>become
>corrupted?
>
>Akmal
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.

Comment 4 Akmal Madzlan 2015-02-10 12:15:44 MST

Thanks Moe.
Another question, is this gonna affect the acocunting data in some way?

Akmal

Comment 5 Moe Jette 2015-02-10 12:30:11 MST

As I recall all state was lost so the job ID would start over, but the account tracks each job based upon its ID plus submit time, so there should be no confusing of the jobs.

On February 10, 2015 6:15:44 PM PST, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1447
>
>--- Comment #4 from Akmal Madzlan <akmalm@dugeo.com> ---
>Thanks Moe.
>Another question, is this gonna affect the acocunting data in some way?
>
>Akmal
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.

Comment 6 Akmal Madzlan 2015-02-10 12:38:21 MST

Thanks Moe
I'll close this

Akmal

Comment 7 Akmal Madzlan 2015-02-10 18:52:10 MST

Hi,
Sorry to open this again.
Is there any way to make the job id starts from a certain number?

I would like the job id to continue from from the last id before it get reset

Akmal

Comment 8 Brian Christiansen 2015-02-11 02:56:16 MST

One way you can do this is by submitting a job that requests a specific jobid and then restart controller. The controller will read in the saved job id and increment the ids from there.

ex.
brian@compy:~/slurm/14.11/compy$ sbatch --wrap=hostname
Submitted batch job 111006
brian@compy:~/slurm/14.11/compy$ sbatch --jobid=120000 --wrap=hostname
Submitted batch job 120000

restart controller

brian@compy:~/slurm/14.11/compy$ sbatch --wrap=hostname
Submitted batch job 120002

Comment 9 Akmal Madzlan 2015-02-11 15:28:14 MST

Thanks Brian

Akmal