Ticket 4538

Summary:	JobId issue after MaxJobId has been reached
Product:	Slurm	Reporter:	Kolbeinn Josepsson <kolbeinn.josepsson>
Component:	slurmctld	Assignee:	Brian Christiansen <brian>
Status:	RESOLVED FIXED	QA Contact:
Severity:	2 - High Impact
Priority:	---	CC:	felip.moll, rl303f
Version:	17.02.7
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=3995 https://bugs.schedmd.com/show_bug.cgi?id=4533 https://bugs.schedmd.com/show_bug.cgi?id=5116
Site:	deCODE	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	17.11.1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Slurm config file slurmstld log file sacct output file

Description Kolbeinn Josepsson 2017-12-18 06:49:53 MST

Created attachment 5772 [details]
Slurm config file

We have severe issue with JobId's after the MaxJobId was reached some weeks ago. This is causing our pipeline automation to fail as it depends on status for older jobs. In the slurm.conf we have MaxJobId=67108863

It seems the JobId rolls over again and again with few days interval:
sacct -j 1 -D -o JobID,Submit,Start,End
       JobID              Submit               Start                 End 
------------ ------------------- ------------------- ------------------- 
1            2017-09-13T09:03:32 2017-09-13T09:03:32 2017-09-14T09:03:34 
1            2017-12-06T02:14:55 2017-12-06T02:14:55 2017-12-06T02:14:57 
1.batch      2017-12-06T02:14:55 2017-12-06T02:14:55 2017-12-06T02:14:57 
1            2017-12-11T19:14:38 2017-12-11T19:14:48 2017-12-11T20:18:53 
1.batch      2017-12-11T19:14:48 2017-12-11T19:14:48 2017-12-11T20:18:53 
1            2017-12-16T01:57:53 2017-12-16T07:23:29 2017-12-16T07:23:30 
1.batch      2017-12-16T07:23:29 2017-12-16T07:23:29 2017-12-16T07:23:30 

Here we can see the JobId jumps from 290051 to 67080368 for some unknown reason and seems as bug to us:
       JobID              Submit
------------ -------------------
290040       2017-12-15T15:29:04
290041       2017-12-15T15:29:04
290042       2017-12-15T15:29:21
290043       2017-12-15T15:29:21
290044       2017-12-15T15:29:21
290045       2017-12-15T15:29:21
290046       2017-12-15T15:29:21
290047       2017-12-15T15:29:21
290048       2017-12-15T15:29:21
290049       2017-12-15T15:29:21
290050       2017-12-15T15:29:21
290051       2017-12-15T15:29:21
67080368     2017-12-15T15:33:35
67080369     2017-12-15T15:34:05
67080370     2017-12-15T15:34:05
67080371     2017-12-15T15:34:06
67080372     2017-12-15T15:34:06
67080373     2017-12-15T15:34:06
67080374     2017-12-15T15:34:06
67080375     2017-12-15T15:34:06
67080376     2017-12-15T15:34:06
67080377     2017-12-15T15:34:06
67080378     2017-12-15T15:34:06

And later it exceeds the MaxJobId and starts at JobId 1 as expected:
       JobID              Submit
------------ -------------------
67108853     2017-12-16T01:57:52
67108854     2017-12-16T01:57:52
67108855     2017-12-16T01:57:52
67108856     2017-12-16T01:57:52
67108857     2017-12-16T01:57:52
67108858     2017-12-16T01:57:52
67108859     2017-12-16T01:57:52
67108860     2017-12-16T01:57:52
67108861     2017-12-16T01:57:53
67108862     2017-12-16T01:57:53
1            2017-12-16T01:57:53
2            2017-12-16T01:57:53
3            2017-12-16T01:57:54
4            2017-12-16T01:57:54
5            2017-12-16T01:57:54
6            2017-12-16T01:57:54
7            2017-12-16T01:58:16
8            2017-12-16T01:58:16
9            2017-12-16T01:58:16
10           2017-12-16T01:58:16

I will upload sacct output (sacct -S 12/15-00:00 -o Submit,JobID slurm-sacct.txt) and slurmctld.log covering timespan were we can see this issue, including clurm.conf

Quick assumption is this seems to happen each time the slurmctld is restarted, we restarted the service at 2017-12-15T15:33:27 as a part of adding new nodes to the slurm. So the JobId's seems to jump back to 67xxxxxx (could probably be related to older jobs still running in the 67xxxxxx jobid range).

Comment 1 Kolbeinn Josepsson 2017-12-18 06:53:43 MST

Created attachment 5773 [details]
slurmstld log file

Comment 2 Kolbeinn Josepsson 2017-12-18 06:54:26 MST

Created attachment 5774 [details]
sacct output file

Comment 5 Brian Christiansen 2017-12-18 11:35:44 MST

Hey Kolbeinn,

I'm able to reproduce the situation that you are are seeing.

Basically it's happening from a restart and having the large job ids still in the system when restarting. When restarting, Slurm sets it's next job id to give out to the highest jobid in the system + 1. So on your restart it started at the top again and then rerolled again.

I'm looking into a possible solution to prevent it from happening and will let you know what we find.

Just as a side note, jobs in the database are unique even though the jobids may roll. They can be distinguished by the submission time. By default sacct will only show the most recent job but the duplicates can be displayed using the --duplicates/-D option.

Thanks,
Brian

e.g.
brian@lappy:~/slurm/17.02/lappy$ scontrol show config | grep MaxJobId
MaxJobId                = 67043328

brian@lappy:~/slurm/17.02/lappy$ squeue
             JOBID PARTITION     NAME     USER ST       TIME   CPUS NODELIST(REASON)
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 67043321
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 67043322
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 67043323
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 67043324
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 67043325
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 67043326
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 67043327
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 1
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 2
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 3
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 4
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 5
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 6

brian@lappy:~/slurm/17.02/lappy$ squeue -Si
             JOBID PARTITION     NAME     USER ST       TIME   CPUS NODELIST(REASON)
                 1     debug     wrap    brian  R       0:05      2 lappy2
                 2     debug     wrap    brian  R       0:05      2 lappy2
                 3     debug     wrap    brian  R       0:05      2 lappy3
                 4     debug     wrap    brian  R       0:05      2 lappy3
                 5     debug     wrap    brian  R       0:05      2 lappy3
                 6     debug     wrap    brian  R       0:02      2 lappy4
          67043321     debug     wrap    brian  R       0:11      2 lappy1
          67043322     debug     wrap    brian  R       0:08      2 lappy1
          67043323     debug     wrap    brian  R       0:08      2 lappy1
          67043324     debug     wrap    brian  R       0:08      2 lappy1
          67043325     debug     wrap    brian  R       0:08      2 lappy2
          67043326     debug     wrap    brian  R       0:08      2 lappy2
          67043327     debug     wrap    brian  R       0:05      2 lappy3
brian@lappy:~/slurm/17.02/lappy$ for i in `seq 2 7`; do scancel 6704332$i; done;
brian@lappy:~/slurm/17.02/lappy$ for i in `seq 1 3`; do scancel $i; done;
brian@lappy:~/slurm/17.02/lappy$ squeue -Si
             JOBID PARTITION     NAME     USER ST       TIME   CPUS NODELIST(REASON)
                 4     debug     wrap    brian  R       0:22      2 lappy3
                 5     debug     wrap    brian  R       0:22      2 lappy3
                 6     debug     wrap    brian  R       0:19      2 lappy4
          67043321     debug     wrap    brian  R       0:28      2 lappy1


brian@lappy:~/slurm/17.02/lappy$ scontrol show jobs | grep JobId
JobId=67043321 JobName=wrap
JobId=4 JobName=wrap
JobId=5 JobName=wrap
JobId=6 JobName=wrap

brian@lappy:~/slurm/17.02/lappy$ echo restart
restart

brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 67043323
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 67043324
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 67043325
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 67043326
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 67043327
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 1
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 2
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 3
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 7
brian@lappy:~/slurm/17.02/lappy$ sbatch -n1 --wrap="sleep 9999999999999"
Submitted batch job 8

brian@lappy:~/slurm/17.02/lappy$ squeue -Si
             JOBID PARTITION     NAME     USER ST       TIME   CPUS NODELIST(REASON)
                 1     debug     wrap    brian  R       0:04      2 lappy1
                 2     debug     wrap    brian  R       0:04      2 lappy2
                 3     debug     wrap    brian  R       0:04      2 lappy2
                 4     debug     wrap    brian  R       1:25      2 lappy3
                 5     debug     wrap    brian  R       1:25      2 lappy3
                 6     debug     wrap    brian  R       1:22      2 lappy4
                 7     debug     wrap    brian  R       0:04      2 lappy2
                 8     debug     wrap    brian  R       0:04      2 lappy2
          67043321     debug     wrap    brian  R       1:31      2 lappy1
          67043323     debug     wrap    brian  R       0:10      2 lappy3
          67043324     debug     wrap    brian  R       0:07      2 lappy3
          67043325     debug     wrap    brian  R       0:07      2 lappy1
          67043326     debug     wrap    brian  R       0:07      2 lappy1
          67043327     debug     wrap    brian  R       0:04      2 lappy4

Comment 8 Brian Christiansen 2017-12-18 15:35:50 MST

The following patch has been added 17.11.1:
https://github.com/SchedMD/slurm/commit/7d83f77d64fcdf4ea2fb0670ffb8fdbba7f461a6

The patch will not set the next jobid to be the highest jobid in the system and will continue where it left before being restarted.

If you would like, you can patch this into 17.02.

Let us know if you have any questions.

Thanks,
Brian

Comment 9 Kolbeinn Josepsson 2017-12-19 03:32:03 MST

Hi Brian,

Many thanks for your quick response and findings. As you have confirmed this bug does only affect us when we restart the slurmctld, then we can stay calm :)

We are in discussion if we will patch the current version, upgrade to 17.11.1 or possible just cancel all jobs with the high Id's and upgrade later.

Cheers,
Kolbeinn

Comment 10 Kolbeinn Josepsson 2017-12-19 07:56:37 MST

FYI

We used the following workaround:
Set FirstJobId higher than last submitted job
Set MaxJobId lower than the old jobs still in queue
Restarted slurmctld service

In our case:
FirstJobId=1000001
MaxJobId=60000000

Results:
[2017-12-19T14:15:41.443] _slurm_rpc_submit_batch_job JobId=327382 usec=4635
[2017-12-19T14:15:41.444] _slurm_rpc_submit_batch_job JobId=327383 usec=193
[2017-12-19T14:15:42.018] _slurm_rpc_submit_batch_job JobId=327384 usec=164
[2017-12-19T14:15:42.509] Terminate signal (SIGINT or SIGTERM) received
### SLURMCTLD RESTART ###
[2017-12-19T14:15:55.340] slurmctld version 17.02.7 started on cluster lhpc
[2017-12-19T14:15:57.776] _slurm_rpc_submit_batch_job JobId=1000001 usec=299
[2017-12-19T14:15:58.788] _slurm_rpc_submit_batch_job JobId=1000002 usec=271
[2017-12-19T14:15:59.364] _slurm_rpc_submit_batch_job JobId=1000003 usec=783
[2017-12-19T14:16:00.266] _slurm_rpc_submit_batch_job JobId=1000004 usec=271

We will upgrade to patched version at our earliest convenience and change the JobId config back to normal.

Comment 11 Brian Christiansen 2017-12-19 09:19:39 MST

Thanks for sharing. Good idea.

Comment 12 Kolbeinn Josepsson 2018-05-02 02:06:03 MDT

 Ég verð í fríi til föstudagsins 11. maí. Vinsamlegast sendið póst á helpdesk@decode.is ef erindið er áríðandi.

I'm on vacation until Mai 11. Please contact helpdesk@decode.is if urgent.

Comment 13 Dominik Bartkiewicz 2018-05-08 09:34:21 MDT

*** Ticket 5116 has been marked as a duplicate of this ticket. ***