| Summary: | Job Id resetting | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Akmal Madzlan <akmalm> |
| Component: | slurmctld | Assignee: | David Bigagli <david> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | brian, da |
| Version: | 14.03.11 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | DownUnder GeoSolutions | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Akmal Madzlan
2015-02-09 21:08:16 MST
Hi, what is happening that Slurm reads the state files in the StateSaveLocation but those files appear to be corrupt or perhaps file system full, since the data read are in unexpected format. The first 2 bytes encode the Slurm version which is 6912 (27 << 8) for your version but instead a completely different number was read 29290. Is there any problem with the file system. The message "Can not recover assoc_mgr state" also indicate slurmctld started when the database daemon was down but that should not be a problem. David Thanks David. Is it possible that restarting slurmctld causing the state files to become corrupted? Akmal Not possible. New files written each tine, then file names changed. Worst case is you'll have old state information. On February 10, 2015 5:47:13 PM PST, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=1447 > >--- Comment #2 from Akmal Madzlan <akmalm@dugeo.com> --- >Thanks David. >Is it possible that restarting slurmctld causing the state files to >become >corrupted? > >Akmal > >-- >You are receiving this mail because: >You are on the CC list for the bug. Thanks Moe. Another question, is this gonna affect the acocunting data in some way? Akmal As I recall all state was lost so the job ID would start over, but the account tracks each job based upon its ID plus submit time, so there should be no confusing of the jobs. On February 10, 2015 6:15:44 PM PST, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=1447 > >--- Comment #4 from Akmal Madzlan <akmalm@dugeo.com> --- >Thanks Moe. >Another question, is this gonna affect the acocunting data in some way? > >Akmal > >-- >You are receiving this mail because: >You are on the CC list for the bug. Thanks Moe I'll close this Akmal Hi, Sorry to open this again. Is there any way to make the job id starts from a certain number? I would like the job id to continue from from the last id before it get reset Akmal One way you can do this is by submitting a job that requests a specific jobid and then restart controller. The controller will read in the saved job id and increment the ids from there. ex. brian@compy:~/slurm/14.11/compy$ sbatch --wrap=hostname Submitted batch job 111006 brian@compy:~/slurm/14.11/compy$ sbatch --jobid=120000 --wrap=hostname Submitted batch job 120000 restart controller brian@compy:~/slurm/14.11/compy$ sbatch --wrap=hostname Submitted batch job 120002 Thanks Brian Akmal |