Ticket 6107

Summary: array jobs refuse to die
Product: Slurm Reporter: Charles Zaruba <cfzaruba>
Component: slurmctldAssignee: Jason Booth <jbooth>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 17.11.3   
Hardware: Linux   
OS: Linux   
Site: Iowa State Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Charles Zaruba 2018-11-28 08:48:12 MST
We would like some advice on dealing with some stubborn array jobs. We restarted slurmctld and the jobs persist. Don't want to poke too much more and risk job loss, corruption of state information, or database issues without some advice.

The offending jobs

          455382_1       mem meryl_gs ken.frie RH       0:00      1 (JobHoldMaxRequeue)
          455382_2       mem meryl_gs ken.frie RH       0:00      1 (JobHoldMaxRequeue)
          455382_3       mem meryl_gs ken.frie RH       0:00      1 (JobHoldMaxRequeue)
          455382_4       mem meryl_gs ken.frie RH       0:00      1 (JobHoldMaxRequeue)
          455382_5       mem meryl_gs ken.frie RH       0:00      1 (JobHoldMaxRequeue)



The array jobs show invalid job_id even though they are still in queue


$scontrol show job 455382_1

slurm_load_jobs error: Invalid job id specified

slurmctld.log of 455382_1

$ cat /var/log/slurm/slurmctld.log | grep 455382_1
[2018-11-19T17:25:05.132] backfill: Started JobID=455382_1(455389) in mem on sn-mem-2
[2018-11-19T17:25:05.164] backfill: Started JobID=455382_10(455398) in mem on sn-mem-1
[2018-11-19T17:25:05.168] backfill: Started JobID=455382_11(455399) in mem on sn-mem-1
[2018-11-19T17:25:05.170] backfill: Started JobID=455382_12(455382) in mem on sn-mem-1
[2018-11-19T17:34:57.856] _job_complete: JobID=455382_12(455382) State=0x1 NodeCnt=1 WEXITSTATUS 0
[2018-11-19T17:34:57.856] _job_complete: JobID=455382_12(455382) State=0x8003 NodeCnt=1 done
[2018-11-19T17:35:30.826] _job_complete: JobID=455382_11(455399) State=0x1 NodeCnt=1 WEXITSTATUS 0
[2018-11-19T17:35:30.826] _job_complete: JobID=455382_11(455399) State=0x8003 NodeCnt=1 done
[2018-11-19T17:36:12.531] _job_complete: JobID=455382_10(455398) State=0x1 NodeCnt=1 WEXITSTATUS 0
[2018-11-19T17:36:12.531] _job_complete: JobID=455382_10(455398) State=0x8003 NodeCnt=1 done
[2018-11-19T17:41:11.860] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0
[2018-11-19T17:43:14.889] backfill: Started JobID=455382_1(455389) in mem on sn-mem-0
[2018-11-19T17:43:14.890] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0
[2018-11-19T17:43:14.890] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request
[2018-11-19T17:43:14.890] _job_complete: JobID=455382_1(455389) State=0x8000 NodeCnt=1 done
[2018-11-19T17:43:15.347] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0
[2018-11-19T17:45:17.072] sched: Allocate JobID=455382_1(455389) NodeList=sn-mem-1 #CPUs=8 Partition=mem
[2018-11-19T17:45:17.072] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0
[2018-11-19T17:45:17.073] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request
[2018-11-19T17:45:17.073] _job_complete: JobID=455382_1(455389) State=0x8000 NodeCnt=1 done
[2018-11-19T17:45:17.132] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0
[2018-11-19T17:47:24.910] backfill: Started JobID=455382_1(455389) in mem on sn-mem-4
[2018-11-19T17:47:24.911] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0
[2018-11-19T17:47:24.912] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request
[2018-11-19T17:47:24.912] _job_complete: JobID=455382_1(455389) State=0x8000 NodeCnt=1 done
[2018-11-19T17:47:25.234] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0
[2018-11-19T17:49:26.791] backfill: Started JobID=455382_1(455389) in mem on sn-mem-1
[2018-11-19T17:49:26.792] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0
[2018-11-19T17:49:26.792] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request
[2018-11-19T17:49:26.792] _job_complete: JobID=455382_1(455389) State=0x8000 NodeCnt=1 done
[2018-11-19T17:49:27.084] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0
[2018-11-19T17:51:28.720] backfill: Started JobID=455382_1(455389) in mem on sn-mem-0
[2018-11-19T17:51:28.721] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0
[2018-11-19T17:51:28.721] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request
[2018-11-19T17:51:28.721] _job_complete: JobID=455382_1(455389) State=0x8800 NodeCnt=1 done
[2018-11-19T17:51:29.008] Requeuing JobID=455382_1(455389) State=0x800 NodeCnt=0
[2018-11-19T20:15:30.303] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 455382_1 uid 1727000230
[2018-11-19T20:15:30.303] job_str_signal(5): invalid job id 455382_1
[2018-11-19T20:15:30.303] _slurm_rpc_kill_job: job_str_signal() job 455382_1 sig 9 returned Invalid job id specified
[2018-11-19T20:16:00.909] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 455382_1 uid 1727000230
[2018-11-19T20:16:00.909] job_str_signal(5): invalid job id 455382_1
[2018-11-19T20:16:00.909] _slurm_rpc_kill_job: job_str_signal() job 455382_1 sig 9 returned Invalid job id specified
[2018-11-19T20:17:28.834] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 455382_1 uid 1727000230
[2018-11-19T20:17:28.834] job_str_signal(5): invalid job id 455382_1
[2018-11-19T20:17:28.834] _slurm_rpc_kill_job: job_str_signal() job 455382_1 sig 9 returned Invalid job id specified
[2018-11-19T20:18:36.515] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 455382_1 uid 1727000230
[2018-11-19T20:18:36.515] job_str_signal(5): invalid job id 455382_1
[2018-11-19T20:18:36.516] _slurm_rpc_kill_job: job_str_signal() job 455382_1 sig 9 returned Invalid job id specified
[2018-11-19T20:20:39.302] _slurm_rpc_update_job: complete JobId=455382_1 uid=1727000230 usec=733
[2018-11-21T16:03:01.787] _slurm_rpc_update_job: complete JobId=455382_1 uid=1727000142 usec=765
[2018-11-21T16:03:02.572] sched: Allocate JobID=455382_1(455389) NodeList=sn-mem-3 #CPUs=8 Partition=mem
[2018-11-21T16:03:02.573] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0
[2018-11-21T16:03:02.573] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request
[2018-11-21T16:03:02.573] _job_complete: JobID=455382_1(455389) State=0x8000 NodeCnt=1 done
[2018-11-21T16:03:02.631] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0
[2018-11-21T16:05:18.137] backfill: Started JobID=455382_1(455389) in mem on sn-mem-3
[2018-11-21T16:05:18.139] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0
[2018-11-21T16:05:18.139] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request
[2018-11-21T16:05:18.139] _job_complete: JobID=455382_1(455389) State=0x8000 NodeCnt=1 done
[2018-11-21T16:05:18.233] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0
[2018-11-21T16:07:30.904] sched: Allocate JobID=455382_1(455389) NodeList=sn-mem-3 #CPUs=8 Partition=mem
[2018-11-21T16:07:30.907] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0
[2018-11-21T16:07:30.907] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request
[2018-11-21T16:07:30.907] _job_complete: JobID=455382_1(455389) State=0x8000 NodeCnt=1 done
[2018-11-21T16:07:30.935] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0
[2018-11-21T16:09:49.459] backfill: Started JobID=455382_1(455389) in mem on sn-mem-3
[2018-11-21T16:09:49.460] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0
[2018-11-21T16:09:49.460] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request
[2018-11-21T16:09:49.460] _job_complete: JobID=455382_1(455389) State=0x8000 NodeCnt=1 done
[2018-11-21T16:09:49.554] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0
[2018-11-21T16:11:50.177] backfill: Started JobID=455382_1(455389) in mem on sn-mem-3
[2018-11-21T16:11:50.178] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0
[2018-11-21T16:11:50.178] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request
[2018-11-21T16:11:50.178] _job_complete: JobID=455382_1(455389) State=0x8800 NodeCnt=1 done
[2018-11-21T16:11:50.386] Requeuing JobID=455382_1(455389) State=0x800 NodeCnt=0
[2018-11-27T12:33:23.036] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 455382_1 uid 0
[2018-11-27T12:33:23.036] job_str_signal(5): invalid job id 455382_1
[2018-11-27T12:33:23.036] _slurm_rpc_kill_job: job_str_signal() job 455382_1 sig 9 returned Invalid job id specified
[2018-11-27T12:44:22.302] _slurm_rpc_update_job: complete JobId=455382_1 uid=0 usec=409
[2018-11-27T12:44:22.992] sched: Allocate JobID=455382_1(455389) NodeList=sn-mem-4 #CPUs=8 Partition=mem
[2018-11-27T12:44:22.993] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0
[2018-11-27T12:44:22.993] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request
[2018-11-27T12:44:22.993] _job_complete: JobID=455382_1(455389) State=0x8000 NodeCnt=1 done
[2018-11-27T12:44:23.016] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0
[2018-11-27T12:44:30.965] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 455382_1 uid 0
[2018-11-27T12:44:30.965] job_str_signal(5): invalid job id 455382_1
[2018-11-27T12:44:30.965] _slurm_rpc_kill_job: job_str_signal() job 455382_1 sig 9 returned Invalid job id specified
[2018-11-27T12:46:42.732] sched: Allocate JobID=455382_1(455389) NodeList=sn-mem-4 #CPUs=8 Partition=mem
[2018-11-27T12:46:42.734] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0
[2018-11-27T12:46:42.734] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request
[2018-11-27T12:46:42.734] _job_complete: JobID=455382_1(455389) State=0x8000 NodeCnt=1 done
[2018-11-27T12:46:42.766] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0
[2018-11-27T12:48:45.909] backfill: Started JobID=455382_1(455389) in mem on sn-mem-4
[2018-11-27T12:48:45.911] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0
[2018-11-27T12:48:45.911] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request
[2018-11-27T12:48:45.911] _job_complete: JobID=455382_1(455389) State=0x8000 NodeCnt=1 done
[2018-11-27T12:48:46.003] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0
[2018-11-27T12:51:16.263] backfill: Started JobID=455382_1(455389) in mem on sn-mem-4
[2018-11-27T12:51:16.265] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0
[2018-11-27T12:51:16.265] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request
[2018-11-27T12:51:16.265] _job_complete: JobID=455382_1(455389) State=0x8000 NodeCnt=1 done
[2018-11-27T12:51:16.357] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0
[2018-11-27T12:53:42.769] sched: Allocate JobID=455382_1(455389) NodeList=sn-mem-4 #CPUs=8 Partition=mem
[2018-11-27T12:53:42.771] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0
[2018-11-27T12:53:42.771] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request
[2018-11-27T12:53:42.771] _job_complete: JobID=455382_1(455389) State=0x8800 NodeCnt=1 done
[2018-11-27T12:53:42.794] Requeuing JobID=455382_1(455389) State=0x800 NodeCnt=0
[2018-11-27T12:56:50.367] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 455382_1 uid 0
[2018-11-27T12:56:50.368] job_str_signal(5): invalid job id 455382_1
[2018-11-27T12:56:50.368] _slurm_rpc_kill_job: job_str_signal() job 455382_1 sig 9 returned Invalid job id specified
[2018-11-27T14:48:46.379] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 455382_1 uid 0
[2018-11-27T14:48:46.380] job_str_signal(5): invalid job id 455382_1
[2018-11-27T14:48:46.380] _slurm_rpc_kill_job: job_str_signal() job 455382_1 sig 9 returned Invalid job id specified


But the unique job id assigned to the array job does show the correct job details. These array jobs don't have any dependencies. 

$scontrol show job 455389
JobId=455389 ArrayJobId=455382 ArrayTaskId=1 JobName=meryl_gsi_ceres_canu18
   UserId=ken.friesen(1727000230) GroupId=ken.friesen(1727000230) MCS_label=N/A
   Priority=0 Nice=0 Account=scinet QOS=memlimit
   JobState=REQUEUE_HOLD Reason=JobHoldMaxRequeue Dependency=(null)
   Requeue=1 Restarts=16 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2018-11-27T12:53:42 EligibleTime=2018-11-27T12:55:43
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-11-27T12:53:42
   Partition=mem AllocNode:Sid=sn-mem-0:21210
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   BatchHost=sn-mem-4
   NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:*
   TRES=cpu=8,mem=17G,node=1,billing=8
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=8 MinMemoryCPU=2176M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=./meryl-count.sh 0
   WorkDir=/project/cricket_gen/gsi_banded/asmb/gsi_ceres_canu18/trimming/0-mercounts
   StdErr=/project/cricket_gen/gsi_banded/asmb/gsi_ceres_canu18/trimming/0-mercounts/meryl-count.455382_1.out
   StdIn=/dev/null
   StdOut=/project/cricket_gen/gsi_banded/asmb/gsi_ceres_canu18/trimming/0-mercounts/meryl-count.455382_1.out
   Power=


scancel for job 455389 and 455382_1 fail with 




[2018-11-27T14:48:46.379] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 455382_1 uid 0
[2018-11-27T14:48:46.380] job_str_signal(5): invalid job id 455382_1
[2018-11-27T14:48:46.380] _slurm_rpc_kill_job: job_str_signal() job 455382_1 sig 9 returned Invalid job id specified

[2018-11-27T14:43:38.928] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 455389 uid 0
[2018-11-27T14:43:38.928] job_str_signal(3): invalid job id 455389
[2018-11-27T14:43:38.928] _slurm_rpc_kill_job: job_str_signal() job 455389 sig 9 returned Invalid job id specified
Comment 2 Jason Booth 2018-11-28 09:06:52 MST
Hi Charles,

This issue looks like we have already addressed it 17.11.4 and commented about it in the issue: Bug 4833 Comment 3

You should plan to upgrade soon as there has been a number of fixes since 17.11.3 to 17.11.12 including one for this issue.

Quoted below

>The jobs almost certainly got into this state due a bug which was fixed in commit (in 17.11.4):

>https://github.com/SchedMD/slurm/commit/f381e4e6abca6ce45709b86989112442487f856a

>The job array hashes are getting corrupt and the job can't be found to be killed. Also the meta job of the array is being cleared from the slurmctld's memory -- it should stay in memory until all of the tasks are gone -- so the job can't be requeued (because it holds the job script for the array). If you can apply this patch sooner than 17.11.4 that would help prevent future incidents from happening.

>As for resolving the issue now, if you restart the slurmctld the jobs should go away. Since the meta job of the array should be out of slurmctld's memory, on restart the slurmctld will see that the job's job script is missing and will mark them as failed and will clean them up.

>You'll see a messages like:
>error: Script for job 15152 lost, state set to FAILED

>Can you try restarting the controller?
Comment 3 Jason Booth 2018-11-29 14:14:55 MST
Hi Charles,
 
 I am closing this issue out for now as info given comment #2. Please feel free to re-open if you run into more issues around this after upgrading and restarting the controller.

-Jason