Created attachment 5772 [details] Slurm config file We have severe issue with JobId's after the MaxJobId was reached some weeks ago. This is causing our pipeline automation to fail as it depends on status for older jobs. In the slurm.conf we have MaxJobId=67108863 It seems the JobId rolls over again and again with few days interval: sacct -j 1 -D -o JobID,Submit,Start,End JobID Submit Start End ------------ ------------------- ------------------- ------------------- 1 2017-09-13T09:03:32 2017-09-13T09:03:32 2017-09-14T09:03:34 1 2017-12-06T02:14:55 2017-12-06T02:14:55 2017-12-06T02:14:57 1.batch 2017-12-06T02:14:55 2017-12-06T02:14:55 2017-12-06T02:14:57 1 2017-12-11T19:14:38 2017-12-11T19:14:48 2017-12-11T20:18:53 1.batch 2017-12-11T19:14:48 2017-12-11T19:14:48 2017-12-11T20:18:53 1 2017-12-16T01:57:53 2017-12-16T07:23:29 2017-12-16T07:23:30 1.batch 2017-12-16T07:23:29 2017-12-16T07:23:29 2017-12-16T07:23:30 Here we can see the JobId jumps from 290051 to 67080368 for some unknown reason and seems as bug to us: JobID Submit ------------ ------------------- 290040 2017-12-15T15:29:04 290041 2017-12-15T15:29:04 290042 2017-12-15T15:29:21 290043 2017-12-15T15:29:21 290044 2017-12-15T15:29:21 290045 2017-12-15T15:29:21 290046 2017-12-15T15:29:21 290047 2017-12-15T15:29:21 290048 2017-12-15T15:29:21 290049 2017-12-15T15:29:21 290050 2017-12-15T15:29:21 290051 2017-12-15T15:29:21 67080368 2017-12-15T15:33:35 67080369 2017-12-15T15:34:05 67080370 2017-12-15T15:34:05 67080371 2017-12-15T15:34:06 67080372 2017-12-15T15:34:06 67080373 2017-12-15T15:34:06 67080374 2017-12-15T15:34:06 67080375 2017-12-15T15:34:06 67080376 2017-12-15T15:34:06 67080377 2017-12-15T15:34:06 67080378 2017-12-15T15:34:06 And later it exceeds the MaxJobId and starts at JobId 1 as expected: JobID Submit ------------ ------------------- 67108853 2017-12-16T01:57:52 67108854 2017-12-16T01:57:52 67108855 2017-12-16T01:57:52 67108856 2017-12-16T01:57:52 67108857 2017-12-16T01:57:52 67108858 2017-12-16T01:57:52 67108859 2017-12-16T01:57:52 67108860 2017-12-16T01:57:52 67108861 2017-12-16T01:57:53 67108862 2017-12-16T01:57:53 1 2017-12-16T01:57:53 2 2017-12-16T01:57:53 3 2017-12-16T01:57:54 4 2017-12-16T01:57:54 5 2017-12-16T01:57:54 6 2017-12-16T01:57:54 7 2017-12-16T01:58:16 8 2017-12-16T01:58:16 9 2017-12-16T01:58:16 10 2017-12-16T01:58:16 I will upload sacct output (sacct -S 12/15-00:00 -o Submit,JobID slurm-sacct.txt) and slurmctld.log covering timespan were we can see this issue, including clurm.conf Quick assumption is this seems to happen each time the slurmctld is restarted, we restarted the service at 2017-12-15T15:33:27 as a part of adding new nodes to the slurm. So the JobId's seems to jump back to 67xxxxxx (could probably be related to older jobs still running in the 67xxxxxx jobid range).
Created attachment 5773 [details] slurmstld log file
Created attachment 5774 [details] sacct output file
Hey Kolbeinn, I'm able to reproduce the situation that you are are seeing. Basically it's happening from a restart and having the large job ids still in the system when restarting. When restarting, Slurm sets it's next job id to give out to the highest jobid in the system + 1. So on your restart it started at the top again and then rerolled again. I'm looking into a possible solution to prevent it from happening and will let you know what we find. Just as a side note, jobs in the database are unique even though the jobids may roll. They can be distinguished by the submission time. By default sacct will only show the most recent job but the duplicates can be displayed using the --duplicates/-D option. Thanks, Brian e.g. brian@lappy:~/slurm/17.02/lappy$ scontrol show config | grep MaxJobId MaxJobId = 67043328 brian@lappy:~/slurm/17.02/lappy$ squeue JOBID PARTITION NAME USER ST TIME CPUS NODELIST(REASON) brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999" Submitted batch job 67043321 brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999" Submitted batch job 67043322 brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999" Submitted batch job 67043323 brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999" Submitted batch job 67043324 brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999" Submitted batch job 67043325 brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999" Submitted batch job 67043326 brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999" Submitted batch job 67043327 brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999" Submitted batch job 1 brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999" Submitted batch job 2 brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999" Submitted batch job 3 brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999" Submitted batch job 4 brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999" Submitted batch job 5 brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999" Submitted batch job 6 brian@lappy:~/slurm/17.02/lappy$ squeue -Si JOBID PARTITION NAME USER ST TIME CPUS NODELIST(REASON) 1 debug wrap brian R 0:05 2 lappy2 2 debug wrap brian R 0:05 2 lappy2 3 debug wrap brian R 0:05 2 lappy3 4 debug wrap brian R 0:05 2 lappy3 5 debug wrap brian R 0:05 2 lappy3 6 debug wrap brian R 0:02 2 lappy4 67043321 debug wrap brian R 0:11 2 lappy1 67043322 debug wrap brian R 0:08 2 lappy1 67043323 debug wrap brian R 0:08 2 lappy1 67043324 debug wrap brian R 0:08 2 lappy1 67043325 debug wrap brian R 0:08 2 lappy2 67043326 debug wrap brian R 0:08 2 lappy2 67043327 debug wrap brian R 0:05 2 lappy3 brian@lappy:~/slurm/17.02/lappy$ for i in `seq 2 7`; do scancel 6704332$i; done; brian@lappy:~/slurm/17.02/lappy$ for i in `seq 1 3`; do scancel $i; done; brian@lappy:~/slurm/17.02/lappy$ squeue -Si JOBID PARTITION NAME USER ST TIME CPUS NODELIST(REASON) 4 debug wrap brian R 0:22 2 lappy3 5 debug wrap brian R 0:22 2 lappy3 6 debug wrap brian R 0:19 2 lappy4 67043321 debug wrap brian R 0:28 2 lappy1 brian@lappy:~/slurm/17.02/lappy$ scontrol show jobs | grep JobId JobId=67043321 JobName=wrap JobId=4 JobName=wrap JobId=5 JobName=wrap JobId=6 JobName=wrap brian@lappy:~/slurm/17.02/lappy$ echo restart restart brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999" Submitted batch job 67043323 brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999" Submitted batch job 67043324 brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999" Submitted batch job 67043325 brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999" Submitted batch job 67043326 brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999" Submitted batch job 67043327 brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999" Submitted batch job 1 brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999" Submitted batch job 2 brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999" Submitted batch job 3 brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999" Submitted batch job 7 brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999" Submitted batch job 8 brian@lappy:~/slurm/17.02/lappy$ squeue -Si JOBID PARTITION NAME USER ST TIME CPUS NODELIST(REASON) 1 debug wrap brian R 0:04 2 lappy1 2 debug wrap brian R 0:04 2 lappy2 3 debug wrap brian R 0:04 2 lappy2 4 debug wrap brian R 1:25 2 lappy3 5 debug wrap brian R 1:25 2 lappy3 6 debug wrap brian R 1:22 2 lappy4 7 debug wrap brian R 0:04 2 lappy2 8 debug wrap brian R 0:04 2 lappy2 67043321 debug wrap brian R 1:31 2 lappy1 67043323 debug wrap brian R 0:10 2 lappy3 67043324 debug wrap brian R 0:07 2 lappy3 67043325 debug wrap brian R 0:07 2 lappy1 67043326 debug wrap brian R 0:07 2 lappy1 67043327 debug wrap brian R 0:04 2 lappy4
The following patch has been added 17.11.1: https://github.com/SchedMD/slurm/commit/7d83f77d64fcdf4ea2fb0670ffb8fdbba7f461a6 The patch will not set the next jobid to be the highest jobid in the system and will continue where it left before being restarted. If you would like, you can patch this into 17.02. Let us know if you have any questions. Thanks, Brian
Hi Brian, Many thanks for your quick response and findings. As you have confirmed this bug does only affect us when we restart the slurmctld, then we can stay calm :) We are in discussion if we will patch the current version, upgrade to 17.11.1 or possible just cancel all jobs with the high Id's and upgrade later. Cheers, Kolbeinn
FYI We used the following workaround: Set FirstJobId higher than last submitted job Set MaxJobId lower than the old jobs still in queue Restarted slurmctld service In our case: FirstJobId=1000001 MaxJobId=60000000 Results: [2017-12-19T14:15:41.443] _slurm_rpc_submit_batch_job JobId=327382 usec=4635 [2017-12-19T14:15:41.444] _slurm_rpc_submit_batch_job JobId=327383 usec=193 [2017-12-19T14:15:42.018] _slurm_rpc_submit_batch_job JobId=327384 usec=164 [2017-12-19T14:15:42.509] Terminate signal (SIGINT or SIGTERM) received ### SLURMCTLD RESTART ### [2017-12-19T14:15:55.340] slurmctld version 17.02.7 started on cluster lhpc [2017-12-19T14:15:57.776] _slurm_rpc_submit_batch_job JobId=1000001 usec=299 [2017-12-19T14:15:58.788] _slurm_rpc_submit_batch_job JobId=1000002 usec=271 [2017-12-19T14:15:59.364] _slurm_rpc_submit_batch_job JobId=1000003 usec=783 [2017-12-19T14:16:00.266] _slurm_rpc_submit_batch_job JobId=1000004 usec=271 We will upgrade to patched version at our earliest convenience and change the JobId config back to normal.
Thanks for sharing. Good idea.
Ég verð í fríi til föstudagsins 11. maí. Vinsamlegast sendið póst á helpdesk@decode.is ef erindið er áríðandi. I'm on vacation until Mai 11. Please contact helpdesk@decode.is if urgent.
*** Ticket 5116 has been marked as a duplicate of this ticket. ***