Created attachment 5773 [details]
slurmstld log file
Created attachment 5774 [details]
sacct output file
Hey Kolbeinn,
I'm able to reproduce the situation that you are are seeing.
Basically it's happening from a restart and having the large job ids still in the system when restarting. When restarting, Slurm sets it's next job id to give out to the highest jobid in the system + 1. So on your restart it started at the top again and then rerolled again.
I'm looking into a possible solution to prevent it from happening and will let you know what we find.
Just as a side note, jobs in the database are unique even though the jobids may roll. They can be distinguished by the submission time. By default sacct will only show the most recent job but the duplicates can be displayed using the --duplicates/-D option.
Thanks,
Brian
e.g.
brian@lappy:~/slurm/17.02/lappy$ scontrol show config | grep MaxJobId
MaxJobId = 67043328
brian@lappy:~/slurm/17.02/lappy$ squeue
JOBID PARTITION NAME USER ST TIME CPUS NODELIST(REASON)
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 67043321
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 67043322
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 67043323
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 67043324
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 67043325
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 67043326
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 67043327
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 1
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 2
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 3
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 4
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 5
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 6
brian@lappy:~/slurm/17.02/lappy$ squeue -Si
JOBID PARTITION NAME USER ST TIME CPUS NODELIST(REASON)
1 debug wrap brian R 0:05 2 lappy2
2 debug wrap brian R 0:05 2 lappy2
3 debug wrap brian R 0:05 2 lappy3
4 debug wrap brian R 0:05 2 lappy3
5 debug wrap brian R 0:05 2 lappy3
6 debug wrap brian R 0:02 2 lappy4
67043321 debug wrap brian R 0:11 2 lappy1
67043322 debug wrap brian R 0:08 2 lappy1
67043323 debug wrap brian R 0:08 2 lappy1
67043324 debug wrap brian R 0:08 2 lappy1
67043325 debug wrap brian R 0:08 2 lappy2
67043326 debug wrap brian R 0:08 2 lappy2
67043327 debug wrap brian R 0:05 2 lappy3
brian@lappy:~/slurm/17.02/lappy$ for i in `seq 2 7`; do scancel 6704332$i; done;
brian@lappy:~/slurm/17.02/lappy$ for i in `seq 1 3`; do scancel $i; done;
brian@lappy:~/slurm/17.02/lappy$ squeue -Si
JOBID PARTITION NAME USER ST TIME CPUS NODELIST(REASON)
4 debug wrap brian R 0:22 2 lappy3
5 debug wrap brian R 0:22 2 lappy3
6 debug wrap brian R 0:19 2 lappy4
67043321 debug wrap brian R 0:28 2 lappy1
brian@lappy:~/slurm/17.02/lappy$ scontrol show jobs | grep JobId
JobId=67043321 JobName=wrap
JobId=4 JobName=wrap
JobId=5 JobName=wrap
JobId=6 JobName=wrap
brian@lappy:~/slurm/17.02/lappy$ echo restart
restart
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 67043323
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 67043324
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 67043325
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 67043326
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 67043327
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 1
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 2
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 3
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 7
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 8
brian@lappy:~/slurm/17.02/lappy$ squeue -Si
JOBID PARTITION NAME USER ST TIME CPUS NODELIST(REASON)
1 debug wrap brian R 0:04 2 lappy1
2 debug wrap brian R 0:04 2 lappy2
3 debug wrap brian R 0:04 2 lappy2
4 debug wrap brian R 1:25 2 lappy3
5 debug wrap brian R 1:25 2 lappy3
6 debug wrap brian R 1:22 2 lappy4
7 debug wrap brian R 0:04 2 lappy2
8 debug wrap brian R 0:04 2 lappy2
67043321 debug wrap brian R 1:31 2 lappy1
67043323 debug wrap brian R 0:10 2 lappy3
67043324 debug wrap brian R 0:07 2 lappy3
67043325 debug wrap brian R 0:07 2 lappy1
67043326 debug wrap brian R 0:07 2 lappy1
67043327 debug wrap brian R 0:04 2 lappy4
The following patch has been added 17.11.1: https://github.com/SchedMD/slurm/commit/7d83f77d64fcdf4ea2fb0670ffb8fdbba7f461a6 The patch will not set the next jobid to be the highest jobid in the system and will continue where it left before being restarted. If you would like, you can patch this into 17.02. Let us know if you have any questions. Thanks, Brian Hi Brian, Many thanks for your quick response and findings. As you have confirmed this bug does only affect us when we restart the slurmctld, then we can stay calm :) We are in discussion if we will patch the current version, upgrade to 17.11.1 or possible just cancel all jobs with the high Id's and upgrade later. Cheers, Kolbeinn FYI We used the following workaround: Set FirstJobId higher than last submitted job Set MaxJobId lower than the old jobs still in queue Restarted slurmctld service In our case: FirstJobId=1000001 MaxJobId=60000000 Results: [2017-12-19T14:15:41.443] _slurm_rpc_submit_batch_job JobId=327382 usec=4635 [2017-12-19T14:15:41.444] _slurm_rpc_submit_batch_job JobId=327383 usec=193 [2017-12-19T14:15:42.018] _slurm_rpc_submit_batch_job JobId=327384 usec=164 [2017-12-19T14:15:42.509] Terminate signal (SIGINT or SIGTERM) received ### SLURMCTLD RESTART ### [2017-12-19T14:15:55.340] slurmctld version 17.02.7 started on cluster lhpc [2017-12-19T14:15:57.776] _slurm_rpc_submit_batch_job JobId=1000001 usec=299 [2017-12-19T14:15:58.788] _slurm_rpc_submit_batch_job JobId=1000002 usec=271 [2017-12-19T14:15:59.364] _slurm_rpc_submit_batch_job JobId=1000003 usec=783 [2017-12-19T14:16:00.266] _slurm_rpc_submit_batch_job JobId=1000004 usec=271 We will upgrade to patched version at our earliest convenience and change the JobId config back to normal. Thanks for sharing. Good idea. Ég verð í fríi til föstudagsins 11. maí. Vinsamlegast sendið póst á helpdesk@decode.is ef erindið er áríðandi. I'm on vacation until Mai 11. Please contact helpdesk@decode.is if urgent. *** Ticket 5116 has been marked as a duplicate of this ticket. *** |
Created attachment 5772 [details] Slurm config file We have severe issue with JobId's after the MaxJobId was reached some weeks ago. This is causing our pipeline automation to fail as it depends on status for older jobs. In the slurm.conf we have MaxJobId=67108863 It seems the JobId rolls over again and again with few days interval: sacct -j 1 -D -o JobID,Submit,Start,End JobID Submit Start End ------------ ------------------- ------------------- ------------------- 1 2017-09-13T09:03:32 2017-09-13T09:03:32 2017-09-14T09:03:34 1 2017-12-06T02:14:55 2017-12-06T02:14:55 2017-12-06T02:14:57 1.batch 2017-12-06T02:14:55 2017-12-06T02:14:55 2017-12-06T02:14:57 1 2017-12-11T19:14:38 2017-12-11T19:14:48 2017-12-11T20:18:53 1.batch 2017-12-11T19:14:48 2017-12-11T19:14:48 2017-12-11T20:18:53 1 2017-12-16T01:57:53 2017-12-16T07:23:29 2017-12-16T07:23:30 1.batch 2017-12-16T07:23:29 2017-12-16T07:23:29 2017-12-16T07:23:30 Here we can see the JobId jumps from 290051 to 67080368 for some unknown reason and seems as bug to us: JobID Submit ------------ ------------------- 290040 2017-12-15T15:29:04 290041 2017-12-15T15:29:04 290042 2017-12-15T15:29:21 290043 2017-12-15T15:29:21 290044 2017-12-15T15:29:21 290045 2017-12-15T15:29:21 290046 2017-12-15T15:29:21 290047 2017-12-15T15:29:21 290048 2017-12-15T15:29:21 290049 2017-12-15T15:29:21 290050 2017-12-15T15:29:21 290051 2017-12-15T15:29:21 67080368 2017-12-15T15:33:35 67080369 2017-12-15T15:34:05 67080370 2017-12-15T15:34:05 67080371 2017-12-15T15:34:06 67080372 2017-12-15T15:34:06 67080373 2017-12-15T15:34:06 67080374 2017-12-15T15:34:06 67080375 2017-12-15T15:34:06 67080376 2017-12-15T15:34:06 67080377 2017-12-15T15:34:06 67080378 2017-12-15T15:34:06 And later it exceeds the MaxJobId and starts at JobId 1 as expected: JobID Submit ------------ ------------------- 67108853 2017-12-16T01:57:52 67108854 2017-12-16T01:57:52 67108855 2017-12-16T01:57:52 67108856 2017-12-16T01:57:52 67108857 2017-12-16T01:57:52 67108858 2017-12-16T01:57:52 67108859 2017-12-16T01:57:52 67108860 2017-12-16T01:57:52 67108861 2017-12-16T01:57:53 67108862 2017-12-16T01:57:53 1 2017-12-16T01:57:53 2 2017-12-16T01:57:53 3 2017-12-16T01:57:54 4 2017-12-16T01:57:54 5 2017-12-16T01:57:54 6 2017-12-16T01:57:54 7 2017-12-16T01:58:16 8 2017-12-16T01:58:16 9 2017-12-16T01:58:16 10 2017-12-16T01:58:16 I will upload sacct output (sacct -S 12/15-00:00 -o Submit,JobID slurm-sacct.txt) and slurmctld.log covering timespan were we can see this issue, including clurm.conf Quick assumption is this seems to happen each time the slurmctld is restarted, we restarted the service at 2017-12-15T15:33:27 as a part of adding new nodes to the slurm. So the JobId's seems to jump back to 67xxxxxx (could probably be related to older jobs still running in the 67xxxxxx jobid range).