| Summary: | array jobs refuse to die | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Charles Zaruba <cfzaruba> |
| Component: | slurmctld | Assignee: | Jason Booth <jbooth> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 17.11.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Iowa State | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Hi Charles, This issue looks like we have already addressed it 17.11.4 and commented about it in the issue: Bug 4833 Comment 3 You should plan to upgrade soon as there has been a number of fixes since 17.11.3 to 17.11.12 including one for this issue. Quoted below >The jobs almost certainly got into this state due a bug which was fixed in commit (in 17.11.4): >https://github.com/SchedMD/slurm/commit/f381e4e6abca6ce45709b86989112442487f856a >The job array hashes are getting corrupt and the job can't be found to be killed. Also the meta job of the array is being cleared from the slurmctld's memory -- it should stay in memory until all of the tasks are gone -- so the job can't be requeued (because it holds the job script for the array). If you can apply this patch sooner than 17.11.4 that would help prevent future incidents from happening. >As for resolving the issue now, if you restart the slurmctld the jobs should go away. Since the meta job of the array should be out of slurmctld's memory, on restart the slurmctld will see that the job's job script is missing and will mark them as failed and will clean them up. >You'll see a messages like: >error: Script for job 15152 lost, state set to FAILED >Can you try restarting the controller? Hi Charles, I am closing this issue out for now as info given comment #2. Please feel free to re-open if you run into more issues around this after upgrading and restarting the controller. -Jason |
We would like some advice on dealing with some stubborn array jobs. We restarted slurmctld and the jobs persist. Don't want to poke too much more and risk job loss, corruption of state information, or database issues without some advice. The offending jobs 455382_1 mem meryl_gs ken.frie RH 0:00 1 (JobHoldMaxRequeue) 455382_2 mem meryl_gs ken.frie RH 0:00 1 (JobHoldMaxRequeue) 455382_3 mem meryl_gs ken.frie RH 0:00 1 (JobHoldMaxRequeue) 455382_4 mem meryl_gs ken.frie RH 0:00 1 (JobHoldMaxRequeue) 455382_5 mem meryl_gs ken.frie RH 0:00 1 (JobHoldMaxRequeue) The array jobs show invalid job_id even though they are still in queue $scontrol show job 455382_1 slurm_load_jobs error: Invalid job id specified slurmctld.log of 455382_1 $ cat /var/log/slurm/slurmctld.log | grep 455382_1 [2018-11-19T17:25:05.132] backfill: Started JobID=455382_1(455389) in mem on sn-mem-2 [2018-11-19T17:25:05.164] backfill: Started JobID=455382_10(455398) in mem on sn-mem-1 [2018-11-19T17:25:05.168] backfill: Started JobID=455382_11(455399) in mem on sn-mem-1 [2018-11-19T17:25:05.170] backfill: Started JobID=455382_12(455382) in mem on sn-mem-1 [2018-11-19T17:34:57.856] _job_complete: JobID=455382_12(455382) State=0x1 NodeCnt=1 WEXITSTATUS 0 [2018-11-19T17:34:57.856] _job_complete: JobID=455382_12(455382) State=0x8003 NodeCnt=1 done [2018-11-19T17:35:30.826] _job_complete: JobID=455382_11(455399) State=0x1 NodeCnt=1 WEXITSTATUS 0 [2018-11-19T17:35:30.826] _job_complete: JobID=455382_11(455399) State=0x8003 NodeCnt=1 done [2018-11-19T17:36:12.531] _job_complete: JobID=455382_10(455398) State=0x1 NodeCnt=1 WEXITSTATUS 0 [2018-11-19T17:36:12.531] _job_complete: JobID=455382_10(455398) State=0x8003 NodeCnt=1 done [2018-11-19T17:41:11.860] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0 [2018-11-19T17:43:14.889] backfill: Started JobID=455382_1(455389) in mem on sn-mem-0 [2018-11-19T17:43:14.890] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0 [2018-11-19T17:43:14.890] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request [2018-11-19T17:43:14.890] _job_complete: JobID=455382_1(455389) State=0x8000 NodeCnt=1 done [2018-11-19T17:43:15.347] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0 [2018-11-19T17:45:17.072] sched: Allocate JobID=455382_1(455389) NodeList=sn-mem-1 #CPUs=8 Partition=mem [2018-11-19T17:45:17.072] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0 [2018-11-19T17:45:17.073] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request [2018-11-19T17:45:17.073] _job_complete: JobID=455382_1(455389) State=0x8000 NodeCnt=1 done [2018-11-19T17:45:17.132] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0 [2018-11-19T17:47:24.910] backfill: Started JobID=455382_1(455389) in mem on sn-mem-4 [2018-11-19T17:47:24.911] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0 [2018-11-19T17:47:24.912] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request [2018-11-19T17:47:24.912] _job_complete: JobID=455382_1(455389) State=0x8000 NodeCnt=1 done [2018-11-19T17:47:25.234] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0 [2018-11-19T17:49:26.791] backfill: Started JobID=455382_1(455389) in mem on sn-mem-1 [2018-11-19T17:49:26.792] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0 [2018-11-19T17:49:26.792] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request [2018-11-19T17:49:26.792] _job_complete: JobID=455382_1(455389) State=0x8000 NodeCnt=1 done [2018-11-19T17:49:27.084] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0 [2018-11-19T17:51:28.720] backfill: Started JobID=455382_1(455389) in mem on sn-mem-0 [2018-11-19T17:51:28.721] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0 [2018-11-19T17:51:28.721] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request [2018-11-19T17:51:28.721] _job_complete: JobID=455382_1(455389) State=0x8800 NodeCnt=1 done [2018-11-19T17:51:29.008] Requeuing JobID=455382_1(455389) State=0x800 NodeCnt=0 [2018-11-19T20:15:30.303] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 455382_1 uid 1727000230 [2018-11-19T20:15:30.303] job_str_signal(5): invalid job id 455382_1 [2018-11-19T20:15:30.303] _slurm_rpc_kill_job: job_str_signal() job 455382_1 sig 9 returned Invalid job id specified [2018-11-19T20:16:00.909] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 455382_1 uid 1727000230 [2018-11-19T20:16:00.909] job_str_signal(5): invalid job id 455382_1 [2018-11-19T20:16:00.909] _slurm_rpc_kill_job: job_str_signal() job 455382_1 sig 9 returned Invalid job id specified [2018-11-19T20:17:28.834] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 455382_1 uid 1727000230 [2018-11-19T20:17:28.834] job_str_signal(5): invalid job id 455382_1 [2018-11-19T20:17:28.834] _slurm_rpc_kill_job: job_str_signal() job 455382_1 sig 9 returned Invalid job id specified [2018-11-19T20:18:36.515] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 455382_1 uid 1727000230 [2018-11-19T20:18:36.515] job_str_signal(5): invalid job id 455382_1 [2018-11-19T20:18:36.516] _slurm_rpc_kill_job: job_str_signal() job 455382_1 sig 9 returned Invalid job id specified [2018-11-19T20:20:39.302] _slurm_rpc_update_job: complete JobId=455382_1 uid=1727000230 usec=733 [2018-11-21T16:03:01.787] _slurm_rpc_update_job: complete JobId=455382_1 uid=1727000142 usec=765 [2018-11-21T16:03:02.572] sched: Allocate JobID=455382_1(455389) NodeList=sn-mem-3 #CPUs=8 Partition=mem [2018-11-21T16:03:02.573] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0 [2018-11-21T16:03:02.573] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request [2018-11-21T16:03:02.573] _job_complete: JobID=455382_1(455389) State=0x8000 NodeCnt=1 done [2018-11-21T16:03:02.631] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0 [2018-11-21T16:05:18.137] backfill: Started JobID=455382_1(455389) in mem on sn-mem-3 [2018-11-21T16:05:18.139] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0 [2018-11-21T16:05:18.139] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request [2018-11-21T16:05:18.139] _job_complete: JobID=455382_1(455389) State=0x8000 NodeCnt=1 done [2018-11-21T16:05:18.233] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0 [2018-11-21T16:07:30.904] sched: Allocate JobID=455382_1(455389) NodeList=sn-mem-3 #CPUs=8 Partition=mem [2018-11-21T16:07:30.907] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0 [2018-11-21T16:07:30.907] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request [2018-11-21T16:07:30.907] _job_complete: JobID=455382_1(455389) State=0x8000 NodeCnt=1 done [2018-11-21T16:07:30.935] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0 [2018-11-21T16:09:49.459] backfill: Started JobID=455382_1(455389) in mem on sn-mem-3 [2018-11-21T16:09:49.460] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0 [2018-11-21T16:09:49.460] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request [2018-11-21T16:09:49.460] _job_complete: JobID=455382_1(455389) State=0x8000 NodeCnt=1 done [2018-11-21T16:09:49.554] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0 [2018-11-21T16:11:50.177] backfill: Started JobID=455382_1(455389) in mem on sn-mem-3 [2018-11-21T16:11:50.178] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0 [2018-11-21T16:11:50.178] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request [2018-11-21T16:11:50.178] _job_complete: JobID=455382_1(455389) State=0x8800 NodeCnt=1 done [2018-11-21T16:11:50.386] Requeuing JobID=455382_1(455389) State=0x800 NodeCnt=0 [2018-11-27T12:33:23.036] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 455382_1 uid 0 [2018-11-27T12:33:23.036] job_str_signal(5): invalid job id 455382_1 [2018-11-27T12:33:23.036] _slurm_rpc_kill_job: job_str_signal() job 455382_1 sig 9 returned Invalid job id specified [2018-11-27T12:44:22.302] _slurm_rpc_update_job: complete JobId=455382_1 uid=0 usec=409 [2018-11-27T12:44:22.992] sched: Allocate JobID=455382_1(455389) NodeList=sn-mem-4 #CPUs=8 Partition=mem [2018-11-27T12:44:22.993] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0 [2018-11-27T12:44:22.993] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request [2018-11-27T12:44:22.993] _job_complete: JobID=455382_1(455389) State=0x8000 NodeCnt=1 done [2018-11-27T12:44:23.016] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0 [2018-11-27T12:44:30.965] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 455382_1 uid 0 [2018-11-27T12:44:30.965] job_str_signal(5): invalid job id 455382_1 [2018-11-27T12:44:30.965] _slurm_rpc_kill_job: job_str_signal() job 455382_1 sig 9 returned Invalid job id specified [2018-11-27T12:46:42.732] sched: Allocate JobID=455382_1(455389) NodeList=sn-mem-4 #CPUs=8 Partition=mem [2018-11-27T12:46:42.734] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0 [2018-11-27T12:46:42.734] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request [2018-11-27T12:46:42.734] _job_complete: JobID=455382_1(455389) State=0x8000 NodeCnt=1 done [2018-11-27T12:46:42.766] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0 [2018-11-27T12:48:45.909] backfill: Started JobID=455382_1(455389) in mem on sn-mem-4 [2018-11-27T12:48:45.911] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0 [2018-11-27T12:48:45.911] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request [2018-11-27T12:48:45.911] _job_complete: JobID=455382_1(455389) State=0x8000 NodeCnt=1 done [2018-11-27T12:48:46.003] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0 [2018-11-27T12:51:16.263] backfill: Started JobID=455382_1(455389) in mem on sn-mem-4 [2018-11-27T12:51:16.265] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0 [2018-11-27T12:51:16.265] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request [2018-11-27T12:51:16.265] _job_complete: JobID=455382_1(455389) State=0x8000 NodeCnt=1 done [2018-11-27T12:51:16.357] Requeuing JobID=455382_1(455389) State=0x0 NodeCnt=0 [2018-11-27T12:53:42.769] sched: Allocate JobID=455382_1(455389) NodeList=sn-mem-4 #CPUs=8 Partition=mem [2018-11-27T12:53:42.771] _job_complete: JobID=455382_1(455389) State=0x1 NodeCnt=1 WEXITSTATUS 0 [2018-11-27T12:53:42.771] _job_complete: requeue JobID=455382_1(455389) State=0x8000 NodeCnt=1 per user/system request [2018-11-27T12:53:42.771] _job_complete: JobID=455382_1(455389) State=0x8800 NodeCnt=1 done [2018-11-27T12:53:42.794] Requeuing JobID=455382_1(455389) State=0x800 NodeCnt=0 [2018-11-27T12:56:50.367] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 455382_1 uid 0 [2018-11-27T12:56:50.368] job_str_signal(5): invalid job id 455382_1 [2018-11-27T12:56:50.368] _slurm_rpc_kill_job: job_str_signal() job 455382_1 sig 9 returned Invalid job id specified [2018-11-27T14:48:46.379] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 455382_1 uid 0 [2018-11-27T14:48:46.380] job_str_signal(5): invalid job id 455382_1 [2018-11-27T14:48:46.380] _slurm_rpc_kill_job: job_str_signal() job 455382_1 sig 9 returned Invalid job id specified But the unique job id assigned to the array job does show the correct job details. These array jobs don't have any dependencies. $scontrol show job 455389 JobId=455389 ArrayJobId=455382 ArrayTaskId=1 JobName=meryl_gsi_ceres_canu18 UserId=ken.friesen(1727000230) GroupId=ken.friesen(1727000230) MCS_label=N/A Priority=0 Nice=0 Account=scinet QOS=memlimit JobState=REQUEUE_HOLD Reason=JobHoldMaxRequeue Dependency=(null) Requeue=1 Restarts=16 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A SubmitTime=2018-11-27T12:53:42 EligibleTime=2018-11-27T12:55:43 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-11-27T12:53:42 Partition=mem AllocNode:Sid=sn-mem-0:21210 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) BatchHost=sn-mem-4 NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:* TRES=cpu=8,mem=17G,node=1,billing=8 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=8 MinMemoryCPU=2176M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=./meryl-count.sh 0 WorkDir=/project/cricket_gen/gsi_banded/asmb/gsi_ceres_canu18/trimming/0-mercounts StdErr=/project/cricket_gen/gsi_banded/asmb/gsi_ceres_canu18/trimming/0-mercounts/meryl-count.455382_1.out StdIn=/dev/null StdOut=/project/cricket_gen/gsi_banded/asmb/gsi_ceres_canu18/trimming/0-mercounts/meryl-count.455382_1.out Power= scancel for job 455389 and 455382_1 fail with [2018-11-27T14:48:46.379] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 455382_1 uid 0 [2018-11-27T14:48:46.380] job_str_signal(5): invalid job id 455382_1 [2018-11-27T14:48:46.380] _slurm_rpc_kill_job: job_str_signal() job 455382_1 sig 9 returned Invalid job id specified [2018-11-27T14:43:38.928] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 455389 uid 0 [2018-11-27T14:43:38.928] job_str_signal(3): invalid job id 455389 [2018-11-27T14:43:38.928] _slurm_rpc_kill_job: job_str_signal() job 455389 sig 9 returned Invalid job id specified