Hi, Did you guys have any idea why this could happen? [2015-01-30T19:13:31.303] debug2: [2015-02-06T20:20:14.733] licenses: init_license=ioother total=10 used=0 [2015-02-06T20:20:14.747] licenses: init_license=ioh1 total=22 used=0 [2015-02-06T20:20:14.747] slurmctld version 14.03.11 started on cluster houston [2015-02-06T20:20:15.091] error: *********************************************** [2015-02-06T20:20:15.092] error: Can not recover assoc_mgr state, incompatible version, got 29290 need > 11 <= 6912 [2015-02-06T20:20:15.092] error: *********************************************** [2015-02-06T20:20:15.175] error: read_slurm_conf: default partition not set. [2015-02-06T20:20:15.300] error: Incomplete node data checkpoint file [2015-02-06T20:20:15.300] Recovered state of 0 nodes [2015-02-06T20:20:15.320] error: *********************************************** [2015-02-06T20:20:15.320] error: Can not recover job state, incompatible version [2015-02-06T20:20:15.321] error: *********************************************** [2015-02-06T20:20:15.321] cons_res: select_p_node_init [2015-02-06T20:20:15.321] cons_res: preparing for 18 partitions [2015-02-06T20:20:15.505] licenses: update_license=ioother total=10 used=0 [2015-02-06T20:20:15.505] licenses: update_license=ioh1 total=22 used=0 [2015-02-06T20:20:15.876] error: ************************************************************ [2015-02-06T20:20:15.876] error: Can not recover reservation state, data version incompatible [2015-02-06T20:20:15.876] error: ************************************************************ [2015-02-06T20:20:15.876] error: Incomplete trigger data checkpoint file [2015-02-06T20:20:15.876] read_slurm_conf: backup_controller not specified. [2015-02-06T20:20:15.876] Reinitializing job accounting state [2015-02-06T20:20:15.876] Ending any jobs in accounting that were running when controller went down on [2015-02-06T20:20:15.877] cons_res: select_p_reconfigure
Hi, what is happening that Slurm reads the state files in the StateSaveLocation but those files appear to be corrupt or perhaps file system full, since the data read are in unexpected format. The first 2 bytes encode the Slurm version which is 6912 (27 << 8) for your version but instead a completely different number was read 29290. Is there any problem with the file system. The message "Can not recover assoc_mgr state" also indicate slurmctld started when the database daemon was down but that should not be a problem. David
Thanks David. Is it possible that restarting slurmctld causing the state files to become corrupted? Akmal
Not possible. New files written each tine, then file names changed. Worst case is you'll have old state information. On February 10, 2015 5:47:13 PM PST, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=1447 > >--- Comment #2 from Akmal Madzlan <akmalm@dugeo.com> --- >Thanks David. >Is it possible that restarting slurmctld causing the state files to >become >corrupted? > >Akmal > >-- >You are receiving this mail because: >You are on the CC list for the bug.
Thanks Moe. Another question, is this gonna affect the acocunting data in some way? Akmal
As I recall all state was lost so the job ID would start over, but the account tracks each job based upon its ID plus submit time, so there should be no confusing of the jobs. On February 10, 2015 6:15:44 PM PST, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=1447 > >--- Comment #4 from Akmal Madzlan <akmalm@dugeo.com> --- >Thanks Moe. >Another question, is this gonna affect the acocunting data in some way? > >Akmal > >-- >You are receiving this mail because: >You are on the CC list for the bug.
Thanks Moe I'll close this Akmal
Hi, Sorry to open this again. Is there any way to make the job id starts from a certain number? I would like the job id to continue from from the last id before it get reset Akmal
One way you can do this is by submitting a job that requests a specific jobid and then restart controller. The controller will read in the saved job id and increment the ids from there. ex. brian@compy:~/slurm/14.11/compy$ sbatch --wrap=hostname Submitted batch job 111006 brian@compy:~/slurm/14.11/compy$ sbatch --jobid=120000 --wrap=hostname Submitted batch job 120000 restart controller brian@compy:~/slurm/14.11/compy$ sbatch --wrap=hostname Submitted batch job 120002
Thanks Brian Akmal