Ticket 1447 - Job Id resetting
Summary: Job Id resetting
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 14.03.11
Hardware: Linux Linux
: 2 - High Impact
Assignee: David Bigagli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-02-09 21:08 MST by Akmal Madzlan
Modified: 2015-02-11 15:28 MST (History)
2 users (show)

See Also:
Site: DownUnder GeoSolutions
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Akmal Madzlan 2015-02-09 21:08:16 MST
Hi,
Did you guys have any idea why this could happen? 


[2015-01-30T19:13:31.303] debug2: [2015-02-06T20:20:14.733] licenses: init_license=ioother total=10 used=0
[2015-02-06T20:20:14.747] licenses: init_license=ioh1 total=22 used=0
[2015-02-06T20:20:14.747] slurmctld version 14.03.11 started on cluster houston
[2015-02-06T20:20:15.091] error: ***********************************************
[2015-02-06T20:20:15.092] error: Can not recover assoc_mgr state, incompatible version, got 29290 need > 11 <= 6912
[2015-02-06T20:20:15.092] error: ***********************************************
[2015-02-06T20:20:15.175] error: read_slurm_conf: default partition not set.
[2015-02-06T20:20:15.300] error: Incomplete node data checkpoint file
[2015-02-06T20:20:15.300] Recovered state of 0 nodes
[2015-02-06T20:20:15.320] error: ***********************************************
[2015-02-06T20:20:15.320] error: Can not recover job state, incompatible version
[2015-02-06T20:20:15.321] error: ***********************************************
[2015-02-06T20:20:15.321] cons_res: select_p_node_init
[2015-02-06T20:20:15.321] cons_res: preparing for 18 partitions
[2015-02-06T20:20:15.505] licenses: update_license=ioother total=10 used=0
[2015-02-06T20:20:15.505] licenses: update_license=ioh1 total=22 used=0
[2015-02-06T20:20:15.876] error: ************************************************************
[2015-02-06T20:20:15.876] error: Can not recover reservation state, data version incompatible
[2015-02-06T20:20:15.876] error: ************************************************************
[2015-02-06T20:20:15.876] error: Incomplete trigger data checkpoint file
[2015-02-06T20:20:15.876] read_slurm_conf: backup_controller not specified.
[2015-02-06T20:20:15.876] Reinitializing job accounting state
[2015-02-06T20:20:15.876] Ending any jobs in accounting that were running when controller went down on
[2015-02-06T20:20:15.877] cons_res: select_p_reconfigure
Comment 1 David Bigagli 2015-02-10 06:36:13 MST
Hi,
   what is happening that Slurm reads the state files in the StateSaveLocation
but those files appear to be corrupt or perhaps file system full, since the
data read are in unexpected format. The first 2 bytes encode the Slurm
version which is 6912 (27 << 8) for your version but instead a completely
different number was read 29290. Is there any problem with the file system.

The message "Can not recover assoc_mgr state" also indicate slurmctld started
when the database daemon was down but that should not be a problem.

David
Comment 2 Akmal Madzlan 2015-02-10 11:47:13 MST
Thanks David.
Is it possible that restarting slurmctld causing the state files to become corrupted?

Akmal
Comment 3 Moe Jette 2015-02-10 12:11:41 MST
Not possible. New files written each tine, then file names changed. Worst case is you'll have old state information.

On February 10, 2015 5:47:13 PM PST, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1447
>
>--- Comment #2 from Akmal Madzlan <akmalm@dugeo.com> ---
>Thanks David.
>Is it possible that restarting slurmctld causing the state files to
>become
>corrupted?
>
>Akmal
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
Comment 4 Akmal Madzlan 2015-02-10 12:15:44 MST
Thanks Moe.
Another question, is this gonna affect the acocunting data in some way?

Akmal
Comment 5 Moe Jette 2015-02-10 12:30:11 MST
As I recall all state was lost so the job ID would start over, but the account tracks each job based upon its ID plus submit time, so there should be no confusing of the jobs.

On February 10, 2015 6:15:44 PM PST, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1447
>
>--- Comment #4 from Akmal Madzlan <akmalm@dugeo.com> ---
>Thanks Moe.
>Another question, is this gonna affect the acocunting data in some way?
>
>Akmal
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
Comment 6 Akmal Madzlan 2015-02-10 12:38:21 MST
Thanks Moe
I'll close this

Akmal
Comment 7 Akmal Madzlan 2015-02-10 18:52:10 MST
Hi,
Sorry to open this again.
Is there any way to make the job id starts from a certain number?

I would like the job id to continue from from the last id before it get reset

Akmal
Comment 8 Brian Christiansen 2015-02-11 02:56:16 MST
One way you can do this is by submitting a job that requests a specific jobid and then restart controller. The controller will read in the saved job id and increment the ids from there.

ex.
brian@compy:~/slurm/14.11/compy$ sbatch --wrap=hostname
Submitted batch job 111006
brian@compy:~/slurm/14.11/compy$ sbatch --jobid=120000 --wrap=hostname
Submitted batch job 120000

restart controller

brian@compy:~/slurm/14.11/compy$ sbatch --wrap=hostname
Submitted batch job 120002
Comment 9 Akmal Madzlan 2015-02-11 15:28:14 MST
Thanks Brian

Akmal