Ticket 2445

Summary:	Cannot Release Jobs with JobHeldUser
Product:	Slurm	Reporter:	Will French <will>
Component:	Scheduling	Assignee:	Alejandro Sanchez <alex>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	alex, davide.vanzo
Version:	15.08.7
Hardware:	Linux
OS:	Linux
Site:	Vanderbilt	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf slurmdbd.conf cgroup.conf slurmctld logs

Description Will French 2016-02-12 09:52:24 MST

We had a node die (hard drive failure) today, which we identified when we noticed several jobs getting put in either a JobHeldUser or "launch failed requeued held". After downing the node, we were able to release jobs with "launch failed requeued held" status, but not jobs that list JobHeldUser:

[root@vmps11 ~]# squeue | grep -i held | tail -2
        7131402_18 productio    BCell  XXXXXXX PD       0:00      1 (JobHeldUser)
        7131402_19 productio    BCell  XXXXXXX PD       0:00      1 (JobHeldUser)
[root@vmps11 ~]# scontrol release 7131402_18
[root@vmps11 ~]# squeue | grep -i held | tail -2
        7131402_18 productio    BCell  XXXXXXX PD       0:00      1 (JobHeldUser)
        7131402_19 productio    BCell  XXXXXXX PD       0:00      1 (JobHeldUser)
[root@vmps11 ~]# scontrol release 7131402_19
[root@vmps11 ~]# squeue | grep -i held | tail -2
        7131402_18 productio    BCell  XXXXXXX PD       0:00      1 (JobHeldUser)
        7131402_19 productio    BCell  XXXXXXX PD       0:00      1 (JobHeldUser)

We have about 150 jobs (from two different users and all job arrays, if that's important) in this stuck state where we cannot release them for scheduling. We have also tried releasing the job while logged in as the user. No luck.

The JobHeldUser state is especially interesting since SLURM docs appear to indicate that this is only listed when a user places a hold on his/her own job. However, both users have confirmed that they did not initiate the hold.

Here are some logs from slurmctld:

root@slurmsched1:~# grep 7131402_18 /var/log/messages
Feb 12 13:09:43 slurmsched1 slurmctld[12411]: backfill: Started JobId=7131402_18 (7136844) in production on vmp717
Feb 12 13:09:43 slurmsched1 slurmctld[12411]: Requeuing JobID=7131402_18(7136844) State=0x0 NodeCnt=0
Feb 12 17:00:50 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete JobId=7131402_18 uid=112888 usec=413
Feb 12 17:43:52 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete JobId=7131402_18 uid=0 usec=567
root@slurmsched1:~# grep 7131402_19 /var/log/messages
Feb 12 13:11:57 slurmsched1 slurmctld[12411]: backfill: Started JobId=7131402_19 (7136862) in production on vmp717
Feb 12 17:00:51 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete JobId=7131402_19 uid=112888 usec=380
Feb 12 17:44:00 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete JobId=7131402_19 uid=0 usec=466

Here is info about one of these jobs:

root@slurmsched1:~# scontrol show job 7131402_19 -dd
JobId=7136862 ArrayJobId=7131402 ArrayTaskId=19 JobName=BCell
   UserId= GroupId=
   Priority=336 Nice=0 Account=chgr QOS=normal
   JobState=PENDING Reason=JobHeldUser Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:00 TimeLimit=10:00:00 TimeMin=N/A
   SubmitTime=2016-02-12T13:11:59 EligibleTime=2016-02-12T13:14:00
   StartTime=Unknown EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=production AllocNode:Sid=vmps09:23221
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   BatchHost=vmp717
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=19200,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
     Nodes=vmp717 CPU_IDs=4-5 Mem=19200
   MinCPUsNode=1 MinMemoryNode=19200M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   StdIn=/dev/null
   BatchScript=
#!/bin/bash

##set job-name to match directory name containing split.raw files
##update wall time as needed
##
#SBATCH --job-name=BCell
#SBATCH --mail-type=ALL
#SBATCH --time=0-10:00:00 
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=19200
#SBATCH --array=0-24
#SBATCH --account=chgr
setpkgs -a R_3.2.0
cd /home/rinkerd/scripts/R/
cp runPheWAS_MASTER_split runPheWAS_${SLURM_JOB_NAME}_${SLURM_ARRAY_TASK_ID}
sed -i "s/TISSUE/${SLURM_JOB_NAME}/g" runPheWAS_${SLURM_JOB_NAME}_${SLURM_ARRAY_TASK_ID}
sed -i "s/XX/${SLURM_ARRAY_TASK_ID}/g" runPheWAS_${SLURM_JOB_NAME}_${SLURM_ARRAY_TASK_ID}
echo `date`
time R --vanilla <runPheWAS_${SLURM_JOB_NAME}_${SLURM_ARRAY_TASK_ID}> runPheWAS_${SLURM_JOB_NAME}_${SLURM_ARRAY_TASK_ID}.R.out
echo `date`
rm runPheWAS_${SLURM_JOB_NAME}_${SLURM_ARRAY_TASK_ID}



Thanks,
Will

Comment 1 Will French 2016-02-12 09:54:27 MST

Created attachment 2730 [details]
slurm.conf

Comment 2 Will French 2016-02-12 09:54:54 MST

Created attachment 2731 [details]
slurmdbd.conf

Comment 3 Will French 2016-02-12 09:55:55 MST

Created attachment 2732 [details]
cgroup.conf

Comment 4 Will French 2016-02-15 04:38:33 MST

To provide an update since Friday, these jobs did end up getting scheduled and ran correctly. There are still a few lingering issues:

1. Why were the jobs marked as JobHeldUser in the first place? The user never put a hold on these jobs and it appears that this happened as a result of the job(s) landing on a node that was dying.

2. Why was the JobHeldUser field not updated after admins released those jobs for scheduling? It's also unclear if releasing the jobs was a necessary step at all. At the time, the user had run up against its GrpCPU limit so these jobs may have not been starting simply due to resource restrictions.

Comment 5 Alejandro Sanchez 2016-02-15 22:55:04 MST

Hi Will,

(In reply to Will French from comment #4)
> To provide an update since Friday, these jobs did end up getting scheduled
> and ran correctly. There are still a few lingering issues:
> 
> 1. Why were the jobs marked as JobHeldUser in the first place? The user
> never put a hold on these jobs and it appears that this happened as a result
> of the job(s) landing on a node that was dying.

On a prolog or job launch failure, the job may end up marked as JobHeldUser. So if the node suffered a hard drive failure, this behavior is expected.

> 
> 2. Why was the JobHeldUser field not updated after admins released those
> jobs for scheduling? It's also unclear if releasing the jobs was a necessary
> step at all. At the time, the user had run up against its GrpCPU limit so
> these jobs may have not been starting simply due to resource restrictions.

Let me investigate why JobHeldUser was not updated after admins released these jobs. I believe releasing the jobs is a necessary step.

Comment 6 Alejandro Sanchez 2016-02-16 01:31:30 MST

Will,

could you please attach your slurmctld.log file?

I want to look for any of these messages:

info("sched: update_job: releasing hold for job_id %u uid %u", job_ptr->job_id, uid);

info("ignore priority reset request on held job %u", job_ptr->job_id);

debug("%s: job %d already release ignoring request", __func__, job_ptr->job_id);

I believe the Reason should had been changed to WAIT_NO_REASON ("None") after the release command, as it is coded in src/slurmctld/job_mgr.c line 10639, inside the _update_job() function. Meanwhile I'm gonna try to reproduce this by myself.

Comment 7 Will French 2016-02-16 02:34:01 MST

Created attachment 2736 [details]
slurmctld logs

actually all of /var/log/messages on our primary slurm controller server for last week

Comment 8 Alejandro Sanchez 2016-02-16 03:44:04 MST

So filtering the slurmctld.log I see these messages from these 2 tasks:

alex@pc:~/Downloads$ grep -E "7131402_18|7131402_19" messages-20160214
Feb 12 13:09:43 slurmsched1 slurmctld[12411]: backfill: Started JobId=7131402_18 (7136844) in production on vmp717
Feb 12 13:09:43 slurmsched1 slurmctld[12411]: Requeuing JobID=7131402_18(7136844) State=0x0 NodeCnt=0
Feb 12 13:11:57 slurmsched1 slurmctld[12411]: backfill: Started JobId=7131402_19 (7136862) in production on vmp717
Feb 12 17:00:50 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete JobId=7131402_18 uid=112888 usec=413
Feb 12 17:00:51 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete JobId=7131402_19 uid=112888 usec=380
Feb 12 17:43:52 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete JobId=7131402_18 uid=0 usec=567
Feb 12 17:44:00 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete JobId=7131402_19 uid=0 usec=466
Feb 13 08:19:12 slurmsched1 slurmctld[12411]: backfill: Started JobId=7131402_18 (7136844) in production on vmp478
Feb 13 08:35:09 slurmsched1 slurmctld[12411]: backfill: Started JobId=7131402_19 (7136862) in production on vmp424
Feb 13 13:27:14 slurmsched1 slurmctld[12411]: job_complete: JobID=7131402_18(7136844) State=0x1 NodeCnt=1 WEXITSTATUS 0
Feb 13 13:27:14 slurmsched1 slurmctld[12411]: job_complete: JobID=7131402_18(7136844) State=0x8003 NodeCnt=1 done
Feb 13 13:31:56 slurmsched1 slurmctld[12411]: job_complete: JobID=7131402_19(7136862) State=0x1 NodeCnt=1 WEXITSTATUS 0
Feb 13 13:31:56 slurmsched1 slurmctld[12411]: job_complete: JobID=7131402_19(7136862) State=0x8003 NodeCnt=1 done
alex@pc:~/Downloads$

I've tried to reproduce myself by creating a Prolog script that exits with a non-zero value.

$ scontrol show config | grep -w Prolog
Prolog                  = /path/to/prolog
$ cat /path/to/prolog
#!/bin/bash

exit 1

Check sinfo node idle:

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
part1*       up    1:00:00      1   idle compute1

Submit array batch job with 2 tasks:

$ sbatch --array=0-1 --wrap="hostname"
Submitted batch job 20026
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
       20026_[0-1]     part1     wrap     alex PD       0:00      1 (Resources,JobHeldUser)
$ scontrol show job 20026 | grep -E "JobId|JobState|Reason"
JobId=20026 ArrayJobId=20026 ArrayTaskId=1 JobName=wrap
   JobState=PENDING Reason=Resources Dependency=(null)
JobId=20027 ArrayJobId=20026 ArrayTaskId=0 JobName=wrap
   JobState=PENDING Reason=launch_failed_requeued_held Dependency=(null)

We see task 0 Reason=launch_failed_requeued_held

slurmctld: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=0
slurmctld: error: Prolog failure on node compute1, draining the node

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
part1*       up    1:00:00      1  drain compute1

$ scontrol release 20026_0
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
       20026_[0-1]     part1     wrap     alex PD       0:00      1 (Resources)

slurmctld: sched: update_job: releasing hold for job_id 20027 uid 1000

$ scontrol update nodename=compute1 state=resume

slurmctld: update_node: node compute1 state set to IDLE

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
part1*       up    1:00:00      1  drain compute1
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
       20026_[0-1]     part1     wrap     alex PD       0:00      1 (Resources,JobHeldUser)

Trying to reproduce your context, what I see is that 'scontrol release <jobid>' works as expected. Probably what happened is that jobs were released but node was still failing for whatever reason (prolog/node failures) and job tasks were requeued again.

Would this make sense?

Comment 9 Will French 2016-02-16 07:16:06 MST

> Trying to reproduce your context, what I see is that 'scontrol release
> <jobid>' works as expected. Probably what happened is that jobs were
> released but node was still failing for whatever reason (prolog/node
> failures) and job tasks were requeued again.
> 
> Would this make sense?

Do you mean that these jobs kept going to the same failing node repeatedly? If so, that shouldn't have happened since we downed the node before releasing these jobs.

Comment 10 Alejandro Sanchez 2016-02-16 23:06:28 MST

(In reply to Will French from comment #9)
> > Trying to reproduce your context, what I see is that 'scontrol release
> > <jobid>' works as expected. Probably what happened is that jobs were
> > released but node was still failing for whatever reason (prolog/node
> > failures) and job tasks were requeued again.
> > 
> > Would this make sense?
> 
> Do you mean that these jobs kept going to the same failing node repeatedly?
> If so, that shouldn't have happened since we downed the node before
> releasing these jobs.

Well what I'm suggesting, despite I can be wrong, is that the job kept going to the same or another failing node and then was requeued and marked as JobHeldUser again. If you can assure that the node was down before releasing, maybe the job allocated a different node. In fact, seeing the logs the job is allocated different nodes:

Feb 12 13:09:43 slurmsched1 slurmctld[12411]: backfill: Started JobId=7131402_18 (7136844) in production on vmp717
Feb 13 08:19:12 slurmsched1 slurmctld[12411]: backfill: Started JobId=7131402_18 (7136844) in production on vmp478

I've tried submitting an array of 2 tasks, one of the tasks is marked as JobHeldUser and node state changed to drain, then I changed the node to down and released the held job, the Reason is changed properly to Resources. So I think that at least in my case this is correct, maybe in your system context there's something that we are missing to reproduce:

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
part1*       up    1:00:00      1   idle compute1
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
$ sbatch --array=0-1 --wrap="hostname"
Submitted batch job 20043
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
       20043_[0-1]     part1     wrap     alex PD       0:00      1 (Resources,JobHeldUser)
$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
part1*       up    1:00:00      1  drain compute1
$ scontrol update nodename=compute1 state=down reason="test"
$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
part1*       up    1:00:00      1  drain compute1
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
       20043_[0-1]     part1     wrap     alex PD       0:00      1 (Resources,JobHeldUser)
$ scontrol update nodename=compute1 state=resume
$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
part1*       up    1:00:00      1   idle compute1
$ scontrol update nodename=compute1 state=down reason="test"
$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
part1*       up    1:00:00      1   down compute1
$ scontrol release 20043_0
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
       20043_[0-1]     part1     wrap     alex PD       0:00      1 (Resources,BeginTime)

It's also strange that I can't find in the logs a message like this:

Feb 12 17:00:49 slurmsched1 slurmctld[12411]: sched: update_job: releasing hold for job_id 7136808 uid 112888

But for job_id 7131402_18 or 7131402_19.

$ grep -E "7131402_18|7131402_19|hold for job_id 7131402" messages-20160214

Comment 11 Alejandro Sanchez 2016-02-17 04:12:59 MST

Will,

I think we've managed to clearly identify what's going on here. When I 'greped' your slurmctld.log file in my previous comments, I was just taking into account the array_job_id and the array_task_id, but didn't grep for the job_id itself. If I add the job_id of both tasks to the grep, I can find the release message:

alex@pc:~/Downloads$ grep -E "7131402_18|7131402_19|7136844|7136862" messages-20160214 
Feb 12 13:09:43 slurmsched1 slurmctld[12411]: backfill: Started JobId=7131402_18 (7136844) in production on vmp717
Feb 12 13:09:43 slurmsched1 slurmctld[12411]: Requeuing JobID=7131402_18(7136844) State=0x0 NodeCnt=0
Feb 12 13:11:57 slurmsched1 slurmctld[12411]: backfill: Started JobId=7131402_19 (7136862) in production on vmp717
Feb 12 17:00:50 slurmsched1 slurmctld[12411]: sched: update_job: releasing hold for job_id 7136844 uid 112888
Feb 12 17:00:50 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete JobId=7131402_18 uid=112888 usec=413
Feb 12 17:00:51 slurmsched1 slurmctld[12411]: sched: update_job: releasing hold for job_id 7136862 uid 112888
Feb 12 17:00:51 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete JobId=7131402_19 uid=112888 usec=380
Feb 12 17:10:03 slurmsched1 slurmctld[12411]: sched: update_job: setting priority to 293 for job_id 7136862
Feb 12 17:10:03 slurmsched1 slurmctld[12411]: sched: update_job: setting priority to 296 for job_id 7136844
Feb 12 17:43:52 slurmsched1 slurmctld[12411]: sched: update_job: setting priority to 338 for job_id 7136844
Feb 12 17:43:52 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete JobId=7131402_18 uid=0 usec=567
Feb 12 17:44:00 slurmsched1 slurmctld[12411]: sched: update_job: setting priority to 335 for job_id 7136862
Feb 12 17:44:00 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete JobId=7131402_19 uid=0 usec=466
Feb 13 08:19:12 slurmsched1 slurmctld[12411]: backfill: Started JobId=7131402_18 (7136844) in production on vmp478
Feb 13 08:35:09 slurmsched1 slurmctld[12411]: backfill: Started JobId=7131402_19 (7136862) in production on vmp424
Feb 13 13:27:14 slurmsched1 slurmctld[12411]: job_complete: JobID=7131402_18(7136844) State=0x1 NodeCnt=1 WEXITSTATUS 0
Feb 13 13:27:14 slurmsched1 slurmctld[12411]: job_complete: JobID=7131402_18(7136844) State=0x8003 NodeCnt=1 done
Feb 13 13:31:56 slurmsched1 slurmctld[12411]: job_complete: JobID=7131402_19(7136862) State=0x1 NodeCnt=1 WEXITSTATUS 0
Feb 13 13:31:56 slurmsched1 slurmctld[12411]: job_complete: JobID=7131402_19(7136862) State=0x8003 NodeCnt=1 done
alex@pc:~/Downloads$

So as you can see, both tasks started in vmp717, which failed, and both got requeud. Then you released them and finally they completed in vmp478 and vmp424 respectively.

So the release worked as expected.

Regarding the REASON, Slurm had not yet update the REASON when you executed squeue after the scontrol release command. But with time it gets updated.

Hope this makes things clear to you now.

Comment 12 Will French 2016-02-17 05:43:09 MST

(In reply to Alejandro Sanchez from comment #11)
> Will,
> 
> I think we've managed to clearly identify what's going on here. When I
> 'greped' your slurmctld.log file in my previous comments, I was just taking
> into account the array_job_id and the array_task_id, but didn't grep for the
> job_id itself. If I add the job_id of both tasks to the grep, I can find the
> release message:
> 
> alex@pc:~/Downloads$ grep -E "7131402_18|7131402_19|7136844|7136862"
> messages-20160214 
> Feb 12 13:09:43 slurmsched1 slurmctld[12411]: backfill: Started
> JobId=7131402_18 (7136844) in production on vmp717
> Feb 12 13:09:43 slurmsched1 slurmctld[12411]: Requeuing
> JobID=7131402_18(7136844) State=0x0 NodeCnt=0
> Feb 12 13:11:57 slurmsched1 slurmctld[12411]: backfill: Started
> JobId=7131402_19 (7136862) in production on vmp717
> Feb 12 17:00:50 slurmsched1 slurmctld[12411]: sched: update_job: releasing
> hold for job_id 7136844 uid 112888
> Feb 12 17:00:50 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete
> JobId=7131402_18 uid=112888 usec=413
> Feb 12 17:00:51 slurmsched1 slurmctld[12411]: sched: update_job: releasing
> hold for job_id 7136862 uid 112888
> Feb 12 17:00:51 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete
> JobId=7131402_19 uid=112888 usec=380
> Feb 12 17:10:03 slurmsched1 slurmctld[12411]: sched: update_job: setting
> priority to 293 for job_id 7136862
> Feb 12 17:10:03 slurmsched1 slurmctld[12411]: sched: update_job: setting
> priority to 296 for job_id 7136844
> Feb 12 17:43:52 slurmsched1 slurmctld[12411]: sched: update_job: setting
> priority to 338 for job_id 7136844
> Feb 12 17:43:52 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete
> JobId=7131402_18 uid=0 usec=567
> Feb 12 17:44:00 slurmsched1 slurmctld[12411]: sched: update_job: setting
> priority to 335 for job_id 7136862
> Feb 12 17:44:00 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete
> JobId=7131402_19 uid=0 usec=466
> Feb 13 08:19:12 slurmsched1 slurmctld[12411]: backfill: Started
> JobId=7131402_18 (7136844) in production on vmp478
> Feb 13 08:35:09 slurmsched1 slurmctld[12411]: backfill: Started
> JobId=7131402_19 (7136862) in production on vmp424
> Feb 13 13:27:14 slurmsched1 slurmctld[12411]: job_complete:
> JobID=7131402_18(7136844) State=0x1 NodeCnt=1 WEXITSTATUS 0
> Feb 13 13:27:14 slurmsched1 slurmctld[12411]: job_complete:
> JobID=7131402_18(7136844) State=0x8003 NodeCnt=1 done
> Feb 13 13:31:56 slurmsched1 slurmctld[12411]: job_complete:
> JobID=7131402_19(7136862) State=0x1 NodeCnt=1 WEXITSTATUS 0
> Feb 13 13:31:56 slurmsched1 slurmctld[12411]: job_complete:
> JobID=7131402_19(7136862) State=0x8003 NodeCnt=1 done
> alex@pc:~/Downloads$
> 
> So as you can see, both tasks started in vmp717, which failed, and both got
> requeud. Then you released them and finally they completed in vmp478 and
> vmp424 respectively.
> 
> So the release worked as expected.

Yes, that's my interpretation as well.

> 
> Regarding the REASON, Slurm had not yet update the REASON when you executed
> squeue after the scontrol release command. But with time it gets updated.

We have a cron job that runs about every hour that checks for any held jobs, and emails admins when there are any held jobs. Based on those email alerts, it appears that after the jobs were released but while they were still in the PENDING state (~6-8 hours depending on the job), the REASON remained JobHeldUser.

If that's normal or expected then so be it, I just want to provide all the details in the event that this is not the intended behavior.

Comment 13 Alejandro Sanchez 2016-02-17 20:50:30 MST

Reason should be changed right after the info message:

[...]
info("sched: update_job: releasing hold for job_id %u uid %u", job_ptr->job_id, uid);
job_ptr->state_reason = WAIT_NO_REASON;
job_ptr->job_state &= ~JOB_SPECIAL_EXIT;
[...]

src/slurmctld/job_mgr.c 15279L

case WAIT_NO_REASON:
                return "None";

src/common/slurm_protocol_defs.c 4305L

So closing this for now as resolved infogiven. Please, if you encounter more jobs stuck in JobHeldUser after release, it would be great to reopen the ticket and attach slurmctld.log and involving slurmd.log files.