Ticket 2455

Summary:	removing a completing MPI job from squeue
Product:	Slurm	Reporter:	Hadrian <hxd58>
Component:	Scheduling	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	14.11.5
Hardware:	Linux
OS:	Linux
Site:	Case	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf comp001 log

Description Hadrian 2016-02-18 01:46:16 MST

An MPI job has been in CG mode for a while, and none of the processes run on the compute nodes. Is there a way to forcefully remove the job from the queue?

squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES CPU NODELIST(REASON)
            186018     batch   mpirun   sam238 CG       0:20      7  12 comp002t,comp010t,comp012t,comp013t,comp014t,comp015t,comp016t

$ scontrol show job 186018
JobId=186018 JobName=mpirun
   UserId=sam238(715780) GroupId=hpcadmin(10076)
   Priority=5557 Nice=0 Account=arc_staff QOS=normal
   JobState=COMPLETING Reason=NonZeroExitCode Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=139:0
   RunTime=00:00:20 TimeLimit=10:00:00 TimeMin=N/A
   SubmitTime=2016-02-03T14:08:05 EligibleTime=2016-02-03T14:08:05
   StartTime=2016-02-03T14:08:05 EndTime=2016-02-03T14:08:25
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=batch AllocNode:Sid=hpctest:43576
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=comp002t,comp010t,comp012t,comp013t,comp014t,comp015t,comp016t
   BatchHost=comp001t
   NumNodes=7 NumCPUs=12 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=1900M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=mpirun
   WorkDir=/home/sam238/test/mpi
   Power= SICP=0

slurm log:
[2016-02-11T10:33:45.147] Resending TERMINATE_JOB request JobId=186018 Nodelist=comp002t,comp010t,comp012t,comp013t,comp014t,comp015t,comp016t,comp125t
[2016-02-11T10:34:45.250] Resending TERMINATE_JOB request JobId=186018 Nodelist=comp002t,comp010t,comp012t,comp013t,comp014t,comp015t,comp016t,comp125t
[2016-02-11T10:35:56.077] error: job_update_cpu_cnt: cpu_cnt underflow on job_id 186018
[2016-02-15T12:33:40.314] _sync_nodes_to_comp_job: Job 186018 in completing state
[2016-02-15T12:57:02.041] Resending TERMINATE_JOB request JobId=186018 Nodelist=comp002t,comp010t,comp012t,comp013t,comp014t,comp015t,comp016t
[2016-02-15T13:44:12.004] _sync_nodes_to_comp_job: Job 186018 in completing state
[2016-02-16T16:15:41.990] error: Security violation, JOB_CANCEL RPC for jobID 186018 from uid 661646
[2016-02-16T16:15:41.990] error: _slurm_rpc_kill_job2: job_str_signal() job 186018 sig 9 returned Access/permission denied

Comment 1 Tim Wickberg 2016-02-18 01:57:20 MST

Can you grab the slurmd.log from the affected compute nodes? Those should at least hint at why the job is still completing.

The last two lines there indicate a non-admin non-job-owner attempted to 'scancel' the job, I'm guessing someone noticed it was stuck and tried to cancel it?

Comment 2 Hadrian 2016-02-18 02:06:58 MST

The failed job removal was my feeble attempt to remove the job as a regular user.

slurm.log from the compute nodes:

xdsh comp002,comp010,comp012-comp016,comp125 grep 186018 /var/log/slurm/slurm.log                  
comp013: [2016-02-03T14:08:05.546] _run_prolog: prolog with lock for job 186018 ran for 0 seconds
comp013: [2016-02-03T14:08:05.631] launch task 186018.0 request from 715780.10076@192.168.223.241 (port 58843)
comp013: [2016-02-03T14:08:05.631] lllp_distribution jobid [186018] implicit auto binding: sockets,one_thread, dist 2
comp013: [2016-02-03T14:08:05.631] _lllp_generate_cpu_bind jobid [186018]: mask_cpu,one_thread, 0x020000
comp013: [2016-02-03T14:08:05.677] [186018.0] task_p_pre_launch: Using sched_affinity for tasks
comp013: [2016-02-03T14:08:05.751] launch task 186018.1 request from 715780.10076@192.168.208.137 (port 25297)
comp013: [2016-02-03T14:08:05.780] launch task 186018.2 request from 715780.10076@192.168.209.26 (port 32442)
comp013: [2016-02-03T14:08:05.798] [186018.1] task_p_pre_launch: Using sched_affinity for tasks
comp013: [2016-02-03T14:08:05.809] launch task 186018.3 request from 715780.10076@192.168.208.135 (port 50641)
comp013: [2016-02-03T14:08:05.828] [186018.2] task_p_pre_launch: Using sched_affinity for tasks
comp013: [2016-02-03T14:08:05.839] launch task 186018.4 request from 715780.10076@192.168.208.141 (port 11666)
comp013: [2016-02-03T14:08:05.858] [186018.3] task_p_pre_launch: Using sched_affinity for tasks
comp013: [2016-02-03T14:08:05.870] launch task 186018.5 request from 715780.10076@192.168.208.127 (port 23503)
comp013: [2016-02-03T14:08:05.890] [186018.4] task_p_pre_launch: Using sched_affinity for tasks
comp013: [2016-02-03T14:08:05.900] launch task 186018.6 request from 715780.10076@192.168.208.136 (port 43157)
comp013: [2016-02-03T14:08:05.918] [186018.5] task_p_pre_launch: Using sched_affinity for tasks
comp013: [2016-02-03T14:08:05.929] launch task 186018.7 request from 715780.10076@192.168.208.138 (port 53687)
comp013: [2016-02-03T14:08:05.946] [186018.6] task_p_pre_launch: Using sched_affinity for tasks
comp013: [2016-02-03T14:08:05.957] launch task 186018.8 request from 715780.10076@192.168.208.140 (port 9179)
comp013: [2016-02-03T14:08:05.978] [186018.7] task_p_pre_launch: Using sched_affinity for tasks
comp013: [2016-02-03T14:08:05.987] launch task 186018.10 request from 715780.10076@192.168.208.142 (port 23749)
comp013: [2016-02-03T14:08:06.009] [186018.8] task_p_pre_launch: Using sched_affinity for tasks
comp013: [2016-02-03T14:08:06.016] launch task 186018.11 request from 715780.10076@192.168.208.128 (port 40656)
comp013: [2016-02-03T14:08:06.034] [186018.10] task_p_pre_launch: Using sched_affinity for tasks
comp013: [2016-02-03T14:08:06.049] launch task 186018.12 request from 715780.10076@192.168.209.26 (port 35770)
comp013: [2016-02-03T14:08:06.068] [186018.11] task_p_pre_launch: Using sched_affinity for tasks
comp013: [2016-02-03T14:08:06.097] [186018.12] task_p_pre_launch: Using sched_affinity for tasks
comp013: [2016-02-03T14:08:06.626] [186018.1] done with job
comp013: [2016-02-03T14:08:06.752] [186018.2] done with job
comp013: [2016-02-03T14:08:06.852] [186018.5] done with job
comp013: [2016-02-03T14:08:06.880] [186018.3] done with job
comp013: [2016-02-03T14:08:25.088] [186018.12] done with job
comp013: [2016-02-03T14:08:25.092] [186018.7] done with job
comp013: [2016-02-03T14:08:25.093] [186018.8] done with job
comp013: [2016-02-03T14:08:25.098] [186018.6] done with job
comp013: [2016-02-03T14:08:25.099] [186018.4] done with job
comp013: [2016-02-03T14:08:25.100] [186018.10] done with job
comp013: [2016-02-03T14:08:25.101] [186018.11] done with job
comp013: [2016-02-05T15:17:15.608] Warning: revoke on job 186018 has no expiration
comp013: [2016-02-05T15:18:24.744] error: Error reading step 186018.0 memory limits
comp013: [2016-02-05T15:26:34.767] Warning: revoke on job 186018 has no expiration
comp013: [2016-02-05T15:26:36.990] error: Error reading step 186018.0 memory limits
comp014: [2016-02-03T14:08:05.546] _run_prolog: prolog with lock for job 186018 ran for 0 seconds
comp014: [2016-02-03T14:08:05.631] launch task 186018.0 request from 715780.10076@192.168.223.241 (port 13445)
comp014: [2016-02-03T14:08:05.632] lllp_distribution jobid [186018] implicit auto binding: sockets,one_thread, dist 2
comp014: [2016-02-03T14:08:05.632] _lllp_generate_cpu_bind jobid [186018]: mask_cpu,one_thread, 0x800
comp014: [2016-02-03T14:08:05.676] [186018.0] task_p_pre_launch: Using sched_affinity for tasks
comp014: [2016-02-03T14:08:05.752] launch task 186018.1 request from 715780.10076@192.168.208.137 (port 11731)
comp014: [2016-02-03T14:08:05.781] launch task 186018.2 request from 715780.10076@192.168.209.26 (port 59071)
comp014: [2016-02-03T14:08:05.796] [186018.1] task_p_pre_launch: Using sched_affinity for tasks
comp014: [2016-02-03T14:08:05.809] launch task 186018.3 request from 715780.10076@192.168.208.135 (port 25275)
comp014: [2016-02-03T14:08:05.824] [186018.2] task_p_pre_launch: Using sched_affinity for tasks
comp014: [2016-02-03T14:08:05.838] launch task 186018.4 request from 715780.10076@192.168.208.141 (port 19369)
comp014: [2016-02-03T14:08:05.854] [186018.3] task_p_pre_launch: Using sched_affinity for tasks
comp014: [2016-02-03T14:08:05.868] launch task 186018.5 request from 715780.10076@192.168.208.127 (port 62101)
comp014: [2016-02-03T14:08:05.883] [186018.4] task_p_pre_launch: Using sched_affinity for tasks
comp014: [2016-02-03T14:08:05.896] launch task 186018.6 request from 715780.10076@192.168.208.136 (port 37815)
comp014: [2016-02-03T14:08:05.912] [186018.5] task_p_pre_launch: Using sched_affinity for tasks
comp014: [2016-02-03T14:08:05.924] launch task 186018.9 request from 715780.10076@192.168.208.139 (port 53677)
comp014: [2016-02-03T14:08:05.939] [186018.6] task_p_pre_launch: Using sched_affinity for tasks
comp014: [2016-02-03T14:08:05.953] launch task 186018.7 request from 715780.10076@192.168.208.138 (port 54192)
comp014: [2016-02-03T14:08:05.968] [186018.9] task_p_pre_launch: Using sched_affinity for tasks
comp014: [2016-02-03T14:08:05.981] launch task 186018.10 request from 715780.10076@192.168.208.142 (port 7575)
comp014: [2016-02-03T14:08:05.997] [186018.7] task_p_pre_launch: Using sched_affinity for tasks
comp014: [2016-02-03T14:08:06.010] launch task 186018.11 request from 715780.10076@192.168.208.128 (port 15053)
comp014: [2016-02-03T14:08:06.027] [186018.10] task_p_pre_launch: Using sched_affinity for tasks
comp014: [2016-02-03T14:08:06.039] launch task 186018.12 request from 715780.10076@192.168.209.26 (port 62399)
comp014: [2016-02-03T14:08:06.054] [186018.11] task_p_pre_launch: Using sched_affinity for tasks
comp014: [2016-02-03T14:08:06.082] [186018.12] task_p_pre_launch: Using sched_affinity for tasks
comp014: [2016-02-03T14:08:06.664] [186018.1] done with job
comp014: [2016-02-03T14:08:06.756] [186018.2] done with job
comp014: [2016-02-03T14:08:06.852] [186018.5] done with job
comp014: [2016-02-03T14:08:06.886] [186018.3] done with job
comp014: [2016-02-03T14:08:25.075] [186018.4] done with job
comp014: [2016-02-03T14:08:25.087] [186018.11] done with job
comp014: [2016-02-03T14:08:25.090] [186018.7] done with job
comp014: [2016-02-03T14:08:25.096] [186018.9] done with job
comp014: [2016-02-03T14:08:25.096] [186018.10] done with job
comp014: [2016-02-03T14:08:25.097] [186018.6] done with job
comp014: [2016-02-03T14:08:25.103] [186018.12] done with job
comp014: [2016-02-05T15:17:16.752] Warning: revoke on job 186018 has no expiration
comp014: [2016-02-05T15:18:24.745] error: Error reading step 186018.0 memory limits
comp014: [2016-02-05T15:26:34.758] Warning: revoke on job 186018 has no expiration
comp014: [2016-02-05T15:26:36.991] error: Error reading step 186018.0 memory limits
comp012: [2016-02-03T14:08:05.547] _run_prolog: prolog with lock for job 186018 ran for 0 seconds
comp012: [2016-02-03T14:08:05.632] launch task 186018.0 request from 715780.10076@192.168.223.241 (port 2447)
comp012: [2016-02-03T14:08:05.632] lllp_distribution jobid [186018] implicit auto binding: sockets,one_thread, dist 2
comp012: [2016-02-03T14:08:05.632] _lllp_generate_cpu_bind jobid [186018]: mask_cpu,one_thread, 0x800
comp012: [2016-02-03T14:08:05.676] [186018.0] task_p_pre_launch: Using sched_affinity for tasks
comp012: [2016-02-03T14:08:05.752] launch task 186018.1 request from 715780.10076@192.168.208.137 (port 43945)
comp012: [2016-02-03T14:08:05.780] launch task 186018.2 request from 715780.10076@192.168.209.26 (port 42971)
comp012: [2016-02-03T14:08:05.796] [186018.1] task_p_pre_launch: Using sched_affinity for tasks
comp012: [2016-02-03T14:08:05.808] launch task 186018.3 request from 715780.10076@192.168.208.135 (port 21473)
comp012: [2016-02-03T14:08:05.823] [186018.2] task_p_pre_launch: Using sched_affinity for tasks
comp012: [2016-02-03T14:08:05.838] launch task 186018.4 request from 715780.10076@192.168.208.141 (port 57541)
comp012: [2016-02-03T14:08:05.853] [186018.3] task_p_pre_launch: Using sched_affinity for tasks
comp012: [2016-02-03T14:08:05.867] launch task 186018.5 request from 715780.10076@192.168.208.127 (port 7389)
comp012: [2016-02-03T14:08:05.882] [186018.4] task_p_pre_launch: Using sched_affinity for tasks
comp012: [2016-02-03T14:08:05.896] launch task 186018.6 request from 715780.10076@192.168.208.136 (port 23271)
comp012: [2016-02-03T14:08:05.912] [186018.5] task_p_pre_launch: Using sched_affinity for tasks
comp012: [2016-02-03T14:08:05.926] launch task 186018.9 request from 715780.10076@192.168.208.139 (port 45961)
comp012: [2016-02-03T14:08:05.941] [186018.6] task_p_pre_launch: Using sched_affinity for tasks
comp012: [2016-02-03T14:08:05.955] launch task 186018.8 request from 715780.10076@192.168.208.140 (port 12685)
comp012: [2016-02-03T14:08:05.970] [186018.9] task_p_pre_launch: Using sched_affinity for tasks
comp012: [2016-02-03T14:08:05.983] launch task 186018.10 request from 715780.10076@192.168.208.142 (port 5355)
comp012: [2016-02-03T14:08:05.999] [186018.8] task_p_pre_launch: Using sched_affinity for tasks
comp012: [2016-02-03T14:08:06.013] launch task 186018.11 request from 715780.10076@192.168.208.128 (port 64902)
comp012: [2016-02-03T14:08:06.028] [186018.10] task_p_pre_launch: Using sched_affinity for tasks
comp012: [2016-02-03T14:08:06.044] launch task 186018.12 request from 715780.10076@192.168.209.26 (port 46043)
comp012: [2016-02-03T14:08:06.060] [186018.11] task_p_pre_launch: Using sched_affinity for tasks
comp012: [2016-02-03T14:08:06.090] [186018.12] task_p_pre_launch: Using sched_affinity for tasks
comp012: [2016-02-03T14:08:06.665] [186018.1] done with job
comp012: [2016-02-03T14:08:06.757] [186018.2] done with job
comp012: [2016-02-03T14:08:06.847] [186018.5] done with job
comp012: [2016-02-03T14:08:06.887] [186018.3] done with job
comp012: [2016-02-03T14:08:25.077] [186018.12] done with job
comp012: [2016-02-03T14:08:25.087] [186018.8] done with job
comp012: [2016-02-03T14:08:25.087] [186018.9] done with job
comp012: [2016-02-03T14:08:25.095] [186018.11] done with job
comp012: [2016-02-03T14:08:25.100] [186018.10] done with job
comp012: [2016-02-03T14:08:25.102] [186018.4] done with job
comp012: [2016-02-03T14:08:25.105] [186018.6] done with job
comp012: [2016-02-05T15:17:15.506] Warning: revoke on job 186018 has no expiration
comp012: [2016-02-05T15:18:24.746] error: Error reading step 186018.0 memory limits
comp012: [2016-02-05T15:26:34.789] Warning: revoke on job 186018 has no expiration
comp012: [2016-02-05T15:26:36.991] error: Error reading step 186018.0 memory limits
comp002: [2016-02-03T14:08:05.546] _run_prolog: prolog with lock for job 186018 ran for 0 seconds
comp002: [2016-02-03T14:08:05.631] launch task 186018.0 request from 715780.10076@192.168.223.241 (port 47503)
comp002: [2016-02-03T14:08:05.631] lllp_distribution jobid [186018] implicit auto binding: sockets,one_thread, dist 2
comp002: [2016-02-03T14:08:05.631] _lllp_generate_cpu_bind jobid [186018]: mask_cpu,one_thread, 0x800
comp002: [2016-02-03T14:08:05.675] [186018.0] task_p_pre_launch: Using sched_affinity for tasks
comp002: [2016-02-03T14:08:05.751] launch task 186018.1 request from 715780.10076@192.168.208.137 (port 12744)
comp002: [2016-02-03T14:08:05.781] launch task 186018.2 request from 715780.10076@192.168.209.26 (port 30179)
comp002: [2016-02-03T14:08:05.797] [186018.1] task_p_pre_launch: Using sched_affinity for tasks
comp002: [2016-02-03T14:08:05.810] launch task 186018.3 request from 715780.10076@192.168.208.135 (port 14300)
comp002: [2016-02-03T14:08:05.825] [186018.2] task_p_pre_launch: Using sched_affinity for tasks
comp002: [2016-02-03T14:08:05.839] launch task 186018.4 request from 715780.10076@192.168.208.141 (port 5269)
comp002: [2016-02-03T14:08:05.854] [186018.3] task_p_pre_launch: Using sched_affinity for tasks
comp002: [2016-02-03T14:08:05.868] launch task 186018.5 request from 715780.10076@192.168.208.127 (port 18304)
comp002: [2016-02-03T14:08:05.883] [186018.4] task_p_pre_launch: Using sched_affinity for tasks
comp002: [2016-02-03T14:08:05.896] launch task 186018.8 request from 715780.10076@192.168.208.140 (port 29586)
comp002: [2016-02-03T14:08:05.912] [186018.5] task_p_pre_launch: Using sched_affinity for tasks
comp002: [2016-02-03T14:08:05.925] launch task 186018.7 request from 715780.10076@192.168.208.138 (port 41637)
comp002: [2016-02-03T14:08:05.941] [186018.8] task_p_pre_launch: Using sched_affinity for tasks
comp002: [2016-02-03T14:08:05.954] launch task 186018.9 request from 715780.10076@192.168.208.139 (port 6087)
comp002: [2016-02-03T14:08:05.970] [186018.7] task_p_pre_launch: Using sched_affinity for tasks
comp002: [2016-02-03T14:08:05.983] launch task 186018.6 request from 715780.10076@192.168.208.136 (port 16523)
comp002: [2016-02-03T14:08:05.999] [186018.9] task_p_pre_launch: Using sched_affinity for tasks
comp002: [2016-02-03T14:08:06.013] launch task 186018.10 request from 715780.10076@192.168.208.142 (port 24241)
comp002: [2016-02-03T14:08:06.029] [186018.6] task_p_pre_launch: Using sched_affinity for tasks
comp002: [2016-02-03T14:08:06.043] launch task 186018.12 request from 715780.10076@192.168.209.26 (port 33251)
comp002: [2016-02-03T14:08:06.060] [186018.10] task_p_pre_launch: Using sched_affinity for tasks
comp002: [2016-02-03T14:08:06.089] [186018.12] task_p_pre_launch: Using sched_affinity for tasks
comp002: [2016-02-03T14:08:06.666] [186018.1] done with job
comp002: [2016-02-03T14:08:06.758] [186018.2] done with job
comp002: [2016-02-03T14:08:06.854] [186018.5] done with job
comp002: [2016-02-03T14:08:06.950] [186018.3] done with job
comp002: [2016-02-03T14:08:25.104] [186018.9] done with job
comp002: [2016-02-03T14:08:25.104] [186018.10] done with job
comp002: [2016-02-03T14:08:25.105] [186018.6] done with job
comp002: [2016-02-03T14:08:25.107] [186018.12] done with job
comp002: [2016-02-03T14:08:25.107] [186018.4] done with job
comp002: [2016-02-03T14:08:25.109] [186018.8] done with job
comp002: [2016-02-03T14:08:25.109] [186018.7] done with job
comp002: [2016-02-05T15:17:15.267] Warning: revoke on job 186018 has no expiration
comp002: [2016-02-05T15:18:24.726] error: Error reading step 186018.0 memory limits
comp002: [2016-02-05T15:26:34.994] Warning: revoke on job 186018 has no expiration
comp002: [2016-02-05T15:26:36.980] error: Error reading step 186018.0 memory limits
comp010: [2016-02-03T14:08:05.547] _run_prolog: prolog with lock for job 186018 ran for 0 seconds
comp010: [2016-02-03T14:08:05.632] launch task 186018.0 request from 715780.10076@192.168.223.241 (port 41910)
comp010: [2016-02-03T14:08:05.632] lllp_distribution jobid [186018] implicit auto binding: sockets,one_thread, dist 2
comp010: [2016-02-03T14:08:05.632] _lllp_generate_cpu_bind jobid [186018]: mask_cpu,one_thread, 0x800
comp010: [2016-02-03T14:08:05.677] [186018.0] task_p_pre_launch: Using sched_affinity for tasks
comp010: [2016-02-03T14:08:05.752] launch task 186018.1 request from 715780.10076@192.168.208.137 (port 44971)
comp010: [2016-02-03T14:08:05.781] launch task 186018.2 request from 715780.10076@192.168.209.26 (port 32934)
comp010: [2016-02-03T14:08:05.795] [186018.1] task_p_pre_launch: Using sched_affinity for tasks
comp010: [2016-02-03T14:08:05.810] launch task 186018.3 request from 715780.10076@192.168.208.135 (port 61321)
comp010: [2016-02-03T14:08:05.824] [186018.2] task_p_pre_launch: Using sched_affinity for tasks
comp010: [2016-02-03T14:08:05.838] launch task 186018.4 request from 715780.10076@192.168.208.141 (port 183)
comp010: [2016-02-03T14:08:05.853] [186018.3] task_p_pre_launch: Using sched_affinity for tasks
comp010: [2016-02-03T14:08:05.868] launch task 186018.5 request from 715780.10076@192.168.208.127 (port 63953)
comp010: [2016-02-03T14:08:05.883] [186018.4] task_p_pre_launch: Using sched_affinity for tasks
comp010: [2016-02-03T14:08:05.896] launch task 186018.9 request from 715780.10076@192.168.208.139 (port 17856)
comp010: [2016-02-03T14:08:05.911] [186018.5] task_p_pre_launch: Using sched_affinity for tasks
comp010: [2016-02-03T14:08:05.926] launch task 186018.7 request from 715780.10076@192.168.208.138 (port 48040)
comp010: [2016-02-03T14:08:05.941] [186018.9] task_p_pre_launch: Using sched_affinity for tasks
comp010: [2016-02-03T14:08:05.955] launch task 186018.8 request from 715780.10076@192.168.208.140 (port 34504)
comp010: [2016-02-03T14:08:05.970] [186018.7] task_p_pre_launch: Using sched_affinity for tasks
comp010: [2016-02-03T14:08:05.985] launch task 186018.10 request from 715780.10076@192.168.208.142 (port 55465)
comp010: [2016-02-03T14:08:06.000] [186018.8] task_p_pre_launch: Using sched_affinity for tasks
comp010: [2016-02-03T14:08:06.015] launch task 186018.11 request from 715780.10076@192.168.208.128 (port 48343)
comp010: [2016-02-03T14:08:06.030] [186018.10] task_p_pre_launch: Using sched_affinity for tasks
comp010: [2016-02-03T14:08:06.050] launch task 186018.12 request from 715780.10076@192.168.209.26 (port 36006)
comp010: [2016-02-03T14:08:06.066] [186018.11] task_p_pre_launch: Using sched_affinity for tasks
comp010: [2016-02-03T14:08:06.095] [186018.12] task_p_pre_launch: Using sched_affinity for tasks
comp010: [2016-02-03T14:08:06.664] [186018.1] done with job
comp010: [2016-02-03T14:08:06.757] [186018.2] done with job
comp010: [2016-02-03T14:08:06.853] [186018.5] done with job
comp010: [2016-02-03T14:08:06.885] [186018.3] done with job
comp010: [2016-02-03T14:08:25.088] [186018.9] done with job
comp010: [2016-02-03T14:08:25.097] [186018.12] done with job
comp010: [2016-02-03T14:08:25.099] [186018.11] done with job
comp010: [2016-02-03T14:08:25.102] [186018.10] done with job
comp010: [2016-02-03T14:08:25.102] [186018.7] done with job
comp010: [2016-02-03T14:08:25.104] [186018.4] done with job
comp010: [2016-02-03T14:08:25.104] [186018.8] done with job
comp010: [2016-02-05T15:17:15.774] Warning: revoke on job 186018 has no expiration
comp010: [2016-02-05T15:18:24.734] error: Error reading step 186018.0 memory limits
comp010: [2016-02-05T15:26:34.760] Warning: revoke on job 186018 has no expiration
comp010: [2016-02-05T15:26:36.980] error: Error reading step 186018.0 memory limits
comp015: [2016-02-03T14:08:05.546] _run_prolog: prolog with lock for job 186018 ran for 0 seconds
comp015: [2016-02-03T14:08:05.631] launch task 186018.0 request from 715780.10076@192.168.223.241 (port 6331)
comp015: [2016-02-03T14:08:05.631] lllp_distribution jobid [186018] implicit auto binding: sockets,one_thread, dist 2
comp015: [2016-02-03T14:08:05.631] _lllp_generate_cpu_bind jobid [186018]: mask_cpu,one_thread, 0x800
comp015: [2016-02-03T14:08:05.675] [186018.0] task_p_pre_launch: Using sched_affinity for tasks
comp015: [2016-02-03T14:08:05.751] launch task 186018.1 request from 715780.10076@192.168.208.137 (port 65460)
comp015: [2016-02-03T14:08:05.780] launch task 186018.2 request from 715780.10076@192.168.209.26 (port 26766)
comp015: [2016-02-03T14:08:05.795] [186018.1] task_p_pre_launch: Using sched_affinity for tasks
comp015: [2016-02-03T14:08:05.808] launch task 186018.3 request from 715780.10076@192.168.208.135 (port 56782)
comp015: [2016-02-03T14:08:05.823] [186018.2] task_p_pre_launch: Using sched_affinity for tasks
comp015: [2016-02-03T14:08:05.836] launch task 186018.5 request from 715780.10076@192.168.208.127 (port 25281)
comp015: [2016-02-03T14:08:05.851] [186018.3] task_p_pre_launch: Using sched_affinity for tasks
comp015: [2016-02-03T14:08:05.865] launch task 186018.6 request from 715780.10076@192.168.208.136 (port 42180)
comp015: [2016-02-03T14:08:05.880] [186018.5] task_p_pre_launch: Using sched_affinity for tasks
comp015: [2016-02-03T14:08:05.894] launch task 186018.9 request from 715780.10076@192.168.208.139 (port 38786)
comp015: [2016-02-03T14:08:05.909] [186018.6] task_p_pre_launch: Using sched_affinity for tasks
comp015: [2016-02-03T14:08:05.923] launch task 186018.7 request from 715780.10076@192.168.208.138 (port 57803)
comp015: [2016-02-03T14:08:05.938] [186018.9] task_p_pre_launch: Using sched_affinity for tasks
comp015: [2016-02-03T14:08:05.951] launch task 186018.8 request from 715780.10076@192.168.208.140 (port 56241)
comp015: [2016-02-03T14:08:05.967] [186018.7] task_p_pre_launch: Using sched_affinity for tasks
comp015: [2016-02-03T14:08:05.980] launch task 186018.10 request from 715780.10076@192.168.208.142 (port 64433)
comp015: [2016-02-03T14:08:05.996] [186018.8] task_p_pre_launch: Using sched_affinity for tasks
comp015: [2016-02-03T14:08:06.010] launch task 186018.11 request from 715780.10076@192.168.208.128 (port 1950)
comp015: [2016-02-03T14:08:06.026] [186018.10] task_p_pre_launch: Using sched_affinity for tasks
comp015: [2016-02-03T14:08:06.038] launch task 186018.12 request from 715780.10076@192.168.209.26 (port 30094)
comp015: [2016-02-03T14:08:06.054] [186018.11] task_p_pre_launch: Using sched_affinity for tasks
comp015: [2016-02-03T14:08:06.085] [186018.12] task_p_pre_launch: Using sched_affinity for tasks
comp015: [2016-02-03T14:08:06.623] [186018.1] done with job
comp015: [2016-02-03T14:08:06.751] [186018.2] done with job
comp015: [2016-02-03T14:08:06.840] [186018.3] done with job
comp015: [2016-02-03T14:08:06.851] [186018.5] done with job
comp015: [2016-02-03T14:08:25.070] [186018.9] done with job
comp015: [2016-02-03T14:08:25.076] [186018.6] done with job
comp015: [2016-02-03T14:08:25.100] [186018.10] done with job
comp015: [2016-02-03T14:08:25.102] [186018.12] done with job
comp015: [2016-02-03T14:08:25.106] [186018.8] done with job
comp015: [2016-02-03T14:08:25.107] [186018.11] done with job
comp015: [2016-02-03T14:08:25.107] [186018.7] done with job
comp015: [2016-02-05T15:17:16.310] Warning: revoke on job 186018 has no expiration
comp015: [2016-02-05T15:18:24.720] error: Error reading step 186018.0 memory limits
comp015: [2016-02-05T15:26:34.770] Warning: revoke on job 186018 has no expiration
comp015: [2016-02-05T15:26:36.970] error: Error reading step 186018.0 memory limits
comp016: [2016-02-03T14:08:05.546] _run_prolog: prolog with lock for job 186018 ran for 0 seconds
comp016: [2016-02-03T14:08:05.631] launch task 186018.0 request from 715780.10076@192.168.223.241 (port 13277)
comp016: [2016-02-03T14:08:05.631] lllp_distribution jobid [186018] implicit auto binding: sockets,one_thread, dist 2
comp016: [2016-02-03T14:08:05.631] _lllp_generate_cpu_bind jobid [186018]: mask_cpu,one_thread, 0x800
comp016: [2016-02-03T14:08:05.674] [186018.0] task_p_pre_launch: Using sched_affinity for tasks
comp016: [2016-02-03T14:08:05.751] launch task 186018.1 request from 715780.10076@192.168.208.137 (port 59354)
comp016: [2016-02-03T14:08:05.780] launch task 186018.2 request from 715780.10076@192.168.209.26 (port 18389)
comp016: [2016-02-03T14:08:05.795] [186018.1] task_p_pre_launch: Using sched_affinity for tasks
comp016: [2016-02-03T14:08:05.808] launch task 186018.3 request from 715780.10076@192.168.208.135 (port 64650)
comp016: [2016-02-03T14:08:05.824] [186018.2] task_p_pre_launch: Using sched_affinity for tasks
comp016: [2016-02-03T14:08:05.837] launch task 186018.4 request from 715780.10076@192.168.208.141 (port 13759)
comp016: [2016-02-03T14:08:05.851] [186018.3] task_p_pre_launch: Using sched_affinity for tasks
comp016: [2016-02-03T14:08:05.865] launch task 186018.5 request from 715780.10076@192.168.208.127 (port 40646)
comp016: [2016-02-03T14:08:05.880] [186018.4] task_p_pre_launch: Using sched_affinity for tasks
comp016: [2016-02-03T14:08:05.894] launch task 186018.6 request from 715780.10076@192.168.208.136 (port 45255)
comp016: [2016-02-03T14:08:05.909] [186018.5] task_p_pre_launch: Using sched_affinity for tasks
comp016: [2016-02-03T14:08:05.922] launch task 186018.9 request from 715780.10076@192.168.208.139 (port 52171)
comp016: [2016-02-03T14:08:05.936] [186018.6] task_p_pre_launch: Using sched_affinity for tasks
comp016: [2016-02-03T14:08:05.951] launch task 186018.7 request from 715780.10076@192.168.208.138 (port 56971)
comp016: [2016-02-03T14:08:05.966] [186018.9] task_p_pre_launch: Using sched_affinity for tasks
comp016: [2016-02-03T14:08:05.980] launch task 186018.8 request from 715780.10076@192.168.208.140 (port 60060)
comp016: [2016-02-03T14:08:05.995] [186018.7] task_p_pre_launch: Using sched_affinity for tasks
comp016: [2016-02-03T14:08:06.008] launch task 186018.11 request from 715780.10076@192.168.208.128 (port 43670)
comp016: [2016-02-03T14:08:06.024] [186018.8] task_p_pre_launch: Using sched_affinity for tasks
comp016: [2016-02-03T14:08:06.041] launch task 186018.12 request from 715780.10076@192.168.209.26 (port 21717)
comp016: [2016-02-03T14:08:06.056] [186018.11] task_p_pre_launch: Using sched_affinity for tasks
comp016: [2016-02-03T14:08:06.085] [186018.12] task_p_pre_launch: Using sched_affinity for tasks
comp016: [2016-02-03T14:08:06.664] [186018.1] done with job
comp016: [2016-02-03T14:08:06.760] [186018.2] done with job
comp016: [2016-02-03T14:08:06.846] [186018.5] done with job
comp016: [2016-02-03T14:08:06.885] [186018.3] done with job
comp016: [2016-02-03T14:08:25.073] [186018.7] done with job
comp016: [2016-02-03T14:08:25.074] [186018.8] done with job
comp016: [2016-02-03T14:08:25.086] [186018.4] done with job
comp016: [2016-02-03T14:08:25.088] [186018.6] done with job
comp016: [2016-02-03T14:08:25.089] [186018.11] done with job
comp016: [2016-02-03T14:08:25.091] [186018.12] done with job
comp016: [2016-02-03T14:08:25.096] [186018.9] done with job
comp016: [2016-02-05T15:17:15.634] Warning: revoke on job 186018 has no expiration
comp016: [2016-02-05T15:18:24.721] error: Error reading step 186018.0 memory limits
comp016: [2016-02-05T15:26:34.771] Warning: revoke on job 186018 has no expiration
comp016: [2016-02-05T15:26:36.991] error: Error reading step 186018.0 memory limits
comp125: [2016-02-03T14:08:05.545] _run_prolog: prolog with lock for job 186018 ran for 0 seconds
comp125: [2016-02-03T14:08:05.630] launch task 186018.0 request from 715780.10076@192.168.223.241 (port 24192)
comp125: [2016-02-03T14:08:05.630] lllp_distribution jobid [186018] implicit auto binding: sockets,one_thread, dist 2
comp125: [2016-02-03T14:08:05.630] _lllp_generate_cpu_bind jobid [186018]: mask_cpu,one_thread, 0x200,0x001
comp125: [2016-02-03T14:08:05.675] [186018.0] task_p_pre_launch: Using sched_affinity for tasks
comp125: [2016-02-03T14:08:05.676] [186018.0] task_p_pre_launch: Using sched_affinity for tasks
comp125: [2016-02-03T14:08:05.751] launch task 186018.1 request from 715780.10076@192.168.208.137 (port 14236)
comp125: [2016-02-03T14:08:05.780] launch task 186018.3 request from 715780.10076@192.168.208.135 (port 27342)
comp125: [2016-02-03T14:08:05.798] [186018.1] task_p_pre_launch: Using sched_affinity for tasks
comp125: [2016-02-03T14:08:05.811] launch task 186018.4 request from 715780.10076@192.168.208.141 (port 28321)
comp125: [2016-02-03T14:08:05.826] [186018.3] task_p_pre_launch: Using sched_affinity for tasks
comp125: [2016-02-03T14:08:05.840] launch task 186018.5 request from 715780.10076@192.168.208.127 (port 28828)
comp125: [2016-02-03T14:08:05.855] [186018.4] task_p_pre_launch: Using sched_affinity for tasks
comp125: [2016-02-03T14:08:05.869] launch task 186018.6 request from 715780.10076@192.168.208.136 (port 31143)
comp125: [2016-02-03T14:08:05.884] [186018.5] task_p_pre_launch: Using sched_affinity for tasks
comp125: [2016-02-03T14:08:05.898] launch task 186018.7 request from 715780.10076@192.168.208.138 (port 19585)
comp125: [2016-02-03T14:08:05.913] [186018.6] task_p_pre_launch: Using sched_affinity for tasks
comp125: [2016-02-03T14:08:05.927] launch task 186018.9 request from 715780.10076@192.168.208.139 (port 41391)
comp125: [2016-02-03T14:08:05.942] [186018.7] task_p_pre_launch: Using sched_affinity for tasks
comp125: [2016-02-03T14:08:05.956] launch task 186018.8 request from 715780.10076@192.168.208.140 (port 62366)
comp125: [2016-02-03T14:08:05.971] [186018.9] task_p_pre_launch: Using sched_affinity for tasks
comp125: [2016-02-03T14:08:05.977] launch task 186018.10 request from 715780.10076@192.168.208.142 (port 28586)
comp125: [2016-02-03T14:08:05.995] [186018.8] task_p_pre_launch: Using sched_affinity for tasks
comp125: [2016-02-03T14:08:05.999] launch task 186018.11 request from 715780.10076@192.168.208.128 (port 31163)
comp125: [2016-02-03T14:08:06.014] [186018.10] task_p_pre_launch: Using sched_affinity for tasks
comp125: [2016-02-03T14:08:06.038] [186018.11] task_p_pre_launch: Using sched_affinity for tasks
comp125: [2016-02-03T14:08:06.625] [186018.1] done with job
comp125: [2016-02-03T14:08:06.852] [186018.5] done with job
comp125: [2016-02-03T14:08:06.944] [186018.3] done with job
comp125: [2016-02-03T14:08:25.082] [186018.11] done with job
comp125: [2016-02-03T14:08:25.085] [186018.6] done with job
comp125: [2016-02-03T14:08:25.098] [186018.9] done with job
comp125: [2016-02-03T14:08:25.101] [186018.10] done with job
comp125: [2016-02-03T14:08:25.108] [186018.4] done with job
comp125: [2016-02-03T14:08:25.108] [186018.7] done with job
comp125: [2016-02-03T14:08:25.118] [186018.8] done with job
comp125: [2016-02-05T15:17:15.931] Warning: revoke on job 186018 has no expiration
comp125: [2016-02-05T15:18:24.744] error: Error reading step 186018.0 memory limits
comp125: [2016-02-05T15:26:34.804] Warning: revoke on job 186018 has no expiration
comp125: [2016-02-05T15:26:36.990] error: Error reading step 186018.0 memory limits
comp125: [2016-02-11T10:35:47.088] [186018.0] Failed to send MESSAGE_TASK_EXIT: Connection refused
comp125: [2016-02-11T10:35:47.095] [186018.0] done with job

Comment 3 Tim Wickberg 2016-02-18 02:20:23 MST

That last line there is suspicious:

comp125: [2016-02-11T10:35:47.088] [186018.0] Failed to send MESSAGE_TASK_EXIT: Connection refused

Do you happen to have the slurm-186018.out file from the user? It may have some debug output in it from the srun that ran into problems. It'd also be nice to have a chunk of slurmctld.log from around that time - just searching on the job number doesn't always pick up everything that may have been related.

The warnings repeated warnings about "error: Error reading step 186018.0 memory limits" are also curious - do you see this for other jobs, or is that isolated to this specific one?

Can you also attach a copy of your slurm.conf file for reference?

Comment 4 Hadrian 2016-02-18 02:38:41 MST

Created attachment 2747 [details]
slurm.conf

Comment 5 Hadrian 2016-02-18 02:42:00 MST

comp125 - the node that reports the error is not in the list of the nodes where the job ran. The user had /dev/null as the output. Also, I see the following errors on one of the compute nodes:

[2016-02-03T14:08:05.546] _run_prolog: prolog with lock for job 186018 ran for 0 seconds
[2016-02-03T14:08:05.631] launch task 186018.0 request from 715780.10076@192.168.223.241 (port 47503)
[2016-02-03T14:08:05.631] lllp_distribution jobid [186018] implicit auto binding: sockets,one_thread, dist 2
[2016-02-03T14:08:05.631] _lllp_generate_cpu_bind jobid [186018]: mask_cpu,one_thread, 0x800
[2016-02-03T14:08:05.675] [186018.0] task_p_pre_launch: Using sched_affinity for tasks
[2016-02-03T14:08:05.751] launch task 186018.1 request from 715780.10076@192.168.208.137 (port 12744)
[2016-02-03T14:08:05.781] launch task 186018.2 request from 715780.10076@192.168.209.26 (port 30179)
[2016-02-03T14:08:05.797] [186018.1] task_p_pre_launch: Using sched_affinity for tasks
[2016-02-03T14:08:05.810] launch task 186018.3 request from 715780.10076@192.168.208.135 (port 14300)
[2016-02-03T14:08:05.825] [186018.2] task_p_pre_launch: Using sched_affinity for tasks
[2016-02-03T14:08:05.839] launch task 186018.4 request from 715780.10076@192.168.208.141 (port 5269)
[2016-02-03T14:08:05.854] [186018.3] task_p_pre_launch: Using sched_affinity for tasks
[2016-02-03T14:08:05.868] launch task 186018.5 request from 715780.10076@192.168.208.127 (port 18304)
[2016-02-03T14:08:05.883] [186018.4] task_p_pre_launch: Using sched_affinity for tasks
[2016-02-03T14:08:05.896] launch task 186018.8 request from 715780.10076@192.168.208.140 (port 29586)
[2016-02-03T14:08:05.912] [186018.5] task_p_pre_launch: Using sched_affinity for tasks
[2016-02-03T14:08:05.925] launch task 186018.7 request from 715780.10076@192.168.208.138 (port 41637)
[2016-02-03T14:08:05.941] [186018.8] task_p_pre_launch: Using sched_affinity for tasks
[2016-02-03T14:08:05.954] launch task 186018.9 request from 715780.10076@192.168.208.139 (port 6087)
[2016-02-03T14:08:05.970] [186018.7] task_p_pre_launch: Using sched_affinity for tasks
[2016-02-03T14:08:05.983] launch task 186018.6 request from 715780.10076@192.168.208.136 (port 16523)
[2016-02-03T14:08:05.999] [186018.9] task_p_pre_launch: Using sched_affinity for tasks
[2016-02-03T14:08:06.013] launch task 186018.10 request from 715780.10076@192.168.208.142 (port 24241)
[2016-02-03T14:08:06.029] [186018.6] task_p_pre_launch: Using sched_affinity for tasks
[2016-02-03T14:08:06.043] launch task 186018.12 request from 715780.10076@192.168.209.26 (port 33251)
[2016-02-03T14:08:06.060] [186018.10] task_p_pre_launch: Using sched_affinity for tasks
[2016-02-03T14:08:06.089] [186018.12] task_p_pre_launch: Using sched_affinity for tasks
[2016-02-03T14:08:06.666] [186018.1] done with job
[2016-02-03T14:08:06.758] [186018.2] done with job
[2016-02-03T14:08:06.854] [186018.5] done with job
[2016-02-03T14:08:06.950] [186018.3] done with job
[2016-02-03T14:08:25.104] [186018.9] done with job
[2016-02-03T14:08:25.104] [186018.10] done with job
[2016-02-03T14:08:25.105] [186018.6] done with job
[2016-02-03T14:08:25.107] [186018.12] done with job
[2016-02-03T14:08:25.107] [186018.4] done with job
[2016-02-03T14:08:25.109] [186018.8] done with job
[2016-02-03T14:08:25.109] [186018.7] done with job
[2016-02-05T15:17:15.267] Warning: revoke on job 186018 has no expiration
[2016-02-05T15:18:24.726] error: Error reading step 186018.0 memory limits
[2016-02-05T15:26:34.994] Warning: revoke on job 186018 has no expiration
[2016-02-05T15:26:36.980] error: Error reading step 186018.0 memory limits



Is there a way to forcefully remove the job from the queue?

Comment 6 Tim Wickberg 2016-02-18 02:57:42 MST

Restarting slurmctld should clear it up, and should be safe to do while the system is running.

I'm concerned as to how comp125 wound up running that job - have you made any change to the network or slurm configuration lately?

Comment 7 Hadrian 2016-02-18 03:51:04 MST

We tried restarting slurmctld but the job is still there. The only changes that were made to slurm.conf is that the TaskProlog flag was uncommented. The node names/parameters were not changed.

Comment 8 Hadrian 2016-02-18 04:08:07 MST

The other observation is that the job id # actually shows up on the logs of all the compute nodes even though the user requested around 7 nodes.

Comment 9 Tim Wickberg 2016-02-18 05:19:55 MST

We've been talking over this internally, and there are a few odd things happening here that we're trying to sort through.

It sounds like some step - possibly errant - is still out there, or it should have been marked as completed. It also looks like comp001 and comp125 may have been mixed up at some point - is it possible that there's an address conflict or some other issue between those two nodes?

We're also curious about the comp001t vs comp001 distinction. Can you explain why you have that setup that way?

Can you run 

scontrol show steps 186018

and 

sacct -j 186018

Getting logs from comp001t would also be helpful, and if you can increase the SlurmdDebug level to debug2 that may shed some light on why that job refuses to go away.

Comment 10 Hadrian 2016-02-18 06:14:55 MST

comp001 and comp001t differ in the interfaces (xxt is on 1g -- used for provisioning and DRAC, the other is on 10g). The addresses for comp001 and comp025 have been the way they are from a long time. Please find the output of scontrol show steps 186018:

StepId=186018.0 UserId=715780 StartTime=2016-02-03T14:08:05 TimeLimit=UNLIMITED
   State=RUNNING Partition=batch NodeList=comp001t,comp002t,comp009t,comp010t,comp011t,comp012t,comp013t,comp014t,comp015t,comp016t,comp125t Gre
   Nodes=11 CPUs=12 Tasks=12 Name=mpirun Network=(null)
   ResvPorts=(null) Checkpoint=0 CheckpointDir=/home/sam238/test/mpi
   CPUFreqReq=Default


and sacct -j 186018:

 JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
186018           mpirun      batch  arc_staff         12     FAILED     11:0 
186018.0         mpirun             arc_staff         12     FAILED      0:0 
186018.1          orted             arc_staff         10  COMPLETED      0:0 
186018.2          orted             arc_staff         10  COMPLETED      0:0 
186018.3          orted             arc_staff         10  COMPLETED      0:0 
186018.4          orted             arc_staff         10     FAILED      1:0 
186018.5          orted             arc_staff         10  COMPLETED      0:0 
186018.6          orted             arc_staff         10  COMPLETED      0:0 
186018.7          orted             arc_staff         10  COMPLETED      0:0 
186018.8          orted             arc_staff         10     FAILED      1:0 
186018.9          orted             arc_staff         10     FAILED      1:0 
186018.10         orted             arc_staff         10     FAILED      1:0 
186018.11         orted             arc_staff         10     FAILED      1:0 
186018.12         orted             arc_staff         10     FAILED      1:0

We have the debug level set at 3. I have also added comp001's log file below..

Comment 11 Hadrian 2016-02-18 06:16:12 MST

Created attachment 2749 [details]
comp001 log

Comment 12 Tim Wickberg 2016-02-18 06:56:07 MST

Can you try cancelling the orphaned step? ("scancel 186018.0")

I'd also like to see the output for 
sacct --format=jobid,state,exitcode,start,end -j 186018

Comment 13 Hadrian 2016-02-18 07:20:52 MST

I get this when I do scancel 186018.0:
scancel: error: slurm_kill_job2() failed Invalid job id specified

and when I do sacct --format=jobid,state,exitcode,start,end -j 186018, I get the following:
JobID      State ExitCode               Start                 End 
------------ ---------- -------- ------------------- ------------------- 
186018           FAILED     11:0 2016-02-03T14:08:05 2016-02-03T14:08:25 
186018.0         FAILED      0:0 2016-02-03T14:08:05 2016-02-03T14:08:25 
186018.1      COMPLETED      0:0 2016-02-03T14:08:05 2016-02-03T14:08:06 
186018.2      COMPLETED      0:0 2016-02-03T14:08:05 2016-02-03T14:08:06 
186018.3      COMPLETED      0:0 2016-02-03T14:08:05 2016-02-03T14:08:06 
186018.4         FAILED      1:0 2016-02-03T14:08:05 2016-02-03T14:08:25 
186018.5      COMPLETED      0:0 2016-02-03T14:08:05 2016-02-03T14:08:06 
186018.6      COMPLETED      0:0 2016-02-03T14:08:05 2016-02-03T14:08:25 
186018.7      COMPLETED      0:0 2016-02-03T14:08:05 2016-02-03T14:08:25 
186018.8         FAILED      1:0 2016-02-03T14:08:05 2016-02-03T14:08:25 
186018.9         FAILED      1:0 2016-02-03T14:08:05 2016-02-03T14:08:25 
186018.10        FAILED      1:0 2016-02-03T14:08:05 2016-02-03T14:08:25 
186018.11        FAILED      1:0 2016-02-03T14:08:05 2016-02-03T14:08:25 
186018.12        FAILED      1:0 2016-02-03T14:08:05 2016-02-03T14:08:25

Comment 14 Hadrian 2016-02-22 02:29:23 MST

Any updates, Tim?

Comment 15 Tim Wickberg 2016-02-22 04:26:47 MST

It looks like scancel won't cooperate here. It's still unclear how that job got into such a state, and we're trying to sort that out further.

Can you pull some additional logs from each of the nodes for us, and check that no processes / slurmstepd's are running for this job on any of the originally allocated nodes?  (comp001t,comp002t,comp009t,comp010t,comp011t,comp012t,comp013t,comp014t,comp015t,comp016t,comp125t)

We believe that at least one stepd must still be running somewhere. If you find it, if you're able to connect to it with `gdb -p (PID for stepd)` and give us the output of 'thread apply all bt full' that'd give us a better picture of the problem.

One thing I noticed from a possibly related error message for comp001 - does /tmp/slurm exist on each node? Is there anything that may be automatically removing files under /tmp on the nodes such as RHEL's tmpwatch?

Comment 16 Hadrian 2016-02-22 04:46:21 MST

You are right, Tim. All the nodes had a zombie slurmstepd process running. I killed them manually, and the job was eventually removed. /tmp/slurm is manually cleared but /tmp has a huge space. I shall configure tmpwatch as you suggested. Thank you again.

Comment 17 Tim Wickberg 2016-02-22 04:53:41 MST

(In reply to Hadrian from comment #16)
> You are right, Tim. All the nodes had a zombie slurmstepd process running. I
> killed them manually, and the job was eventually removed.

If you see another job in this state in the future, we'd like to see the backtrace before you kill it. Without the backtrace I don't have a way to find out what the underlying problem is

> /tmp/slurm is
> manually cleared but /tmp has a huge space. I shall configure tmpwatch as
> you suggested. Thank you again.

Just to check - I didn't mean to suggest that you use it or not, I just wanted to make sure it wasn't causing problems for /tmp/slurm. Removing anything under there automatically could cause some odd problems, although I don't think it was a factor here.

Comment 18 Tim Wickberg 2016-03-03 09:39:31 MST

I'm going to go ahead and mark this as resolved for now. If it recurs, please try to get a backtrace from one of the slurmstepd's before killing them and reopen this bug.

- Tim