1479 – prolog job requeue put the job in held state

Ticket 1479 - prolog job requeue put the job in held state

Summary: prolog job requeue put the job in held state

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	14.11.4
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	David Bigagli
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2015-02-23 23:02 MST by mn
Modified:	2015-03-05 07:31 MST (History)
CC List:	2 users (show)

See Also:
Site:	KAUST
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	14.11.5
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description mn 2015-02-23 23:02:05 MST

We are trying to requeue the job in slurm prolog based on the node health node and closed the node.
job requeue is working but job is going into held state

from task prolog it is working fine as I want but running this prolog from task prolog is not efficient

best Regards

Comment 1 David Bigagli 2015-02-24 04:25:58 MST

Hi, this is not supposed to happen the jobs should simply go back to pend state
and its scheduling delayed by a few minutes. How do you requeue the job?
Do you specify requeue held?

Here is an example of prolog that simply requeues the job and set the node
state down.

#!/bin/sh
LOG="/tmp/prolog.log"
echo "prolog running `date`" >> $LOG
env|grep SLURM|sort >> $LOG
/home/david/clusters/1411/linux/bin/scontrol update node=$SLURM_NODELIST state=down reason="aa" >> $LOG
echo "requeue exit $? " >> $LOG
/home/david/clusters/1411/linux/bin/scontrol requeue $SLURM_JOBID >> $LOG
echo "requeue exit $? " >> $LOG
exit 0

David

Comment 2 Moe Jette 2015-02-24 04:37:13 MST

Generally you would want to set a node state to DRAIN rather than DOWN. Draining a node means jobs already allocated the node can continue running until complete or requeued. Setting a node DOWN will cause all jobs allocated to that node to get killed or requeued immediately.

Comment 3 mn 2015-02-24 04:55:07 MST

Here is the part of code which check Scratch Filesytem 

if [ $ScrHcRc -ge 1 ]
then
	echo "ProLog ScratchFs issue on $hostn,runuser=$runuser" >> $tmp_file
	/usr/local/slurm/bin/scontrol update node=$hostn state=DRAIN reason="HealthCheck Scratch FS Issues" >> $tmp_file 2>&1
        /usr/local/slurm/bin/scontrol requeue $SLURM_JOBID >> $tmp_file 2>&1 
	rc=100
fi
exit $rc

I can try this again on my test using latest slurm version and update the ticket again.

--Naseem

Comment 4 Moe Jette 2015-02-24 04:57:10 MST

(In reply to mn from comment #3)
> Here is the part of code which check Scratch Filesytem 
> 
> if [ $ScrHcRc -ge 1 ]
> then
> 	echo "ProLog ScratchFs issue on $hostn,runuser=$runuser" >> $tmp_file
> 	/usr/local/slurm/bin/scontrol update node=$hostn state=DRAIN
> reason="HealthCheck Scratch FS Issues" >> $tmp_file 2>&1
>         /usr/local/slurm/bin/scontrol requeue $SLURM_JOBID >> $tmp_file 2>&1 
> 	rc=100
> fi
> exit $rc
> 
> I can try this again on my test using latest slurm version and update the
> ticket again.
> 
> --Naseem

I think it's the non-zero exit code that's a problem. David, can you check that.

Comment 5 David Bigagli 2015-02-24 05:11:37 MST

No it is not I already tried. Are you using the Prolog or the PrologSlurmctld?

David

Comment 6 mn 2015-02-24 21:29:34 MST

I am using prolog
here is what I have in my slurm.conf
this what I have in my slurm.conf
Prolog=/usr/local/slurm/tools/prolog.sh

thx
--Naseem

Comment 7 Moe Jette 2015-02-25 00:47:51 MST

Please attach a copy of that file.

On February 25, 2015 3:29:34 AM PST, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1479
>
>--- Comment #6 from mn <mohammed.naseemuddin@kaust.edu.sa> ---
>I am using prolog
>here is what I have in my slurm.conf
>this what I have in my slurm.conf
>Prolog=/usr/local/slurm/tools/prolog.sh
>
>thx
>--Naseem
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.

Comment 8 David Bigagli 2015-02-25 03:22:48 MST

Hi,
  my prolog works the same as yours:

/home/david/clusters/1411/linux/bin/scontrol update node=$SLURM_JOB_NODELIST state=drain reason="aa" >> $LOG
echo "requeue exit $? " >> $LOG
/home/david/clusters/1411/linux/bin/scontrol requeue $SLURM_JOBID >> $LOG
echo "requeue exit $? " >> $LOG

unless you have some other instructions later on.

David

Comment 9 David Bigagli 2015-03-02 08:20:29 MST

Hi Naseem,
         do you have any update on this issue.

David

Comment 10 David Bigagli 2015-03-03 09:45:19 MST

Hi,
  did you try again to reproduce this issue. I suspect that the launch of the job fails even before prolog runs and that's why the job is requeued in held state.

David

Comment 11 David Bigagli 2015-03-04 04:26:52 MST

Please reopen if necessary.

David

Comment 12 David Bigagli 2015-03-04 11:19:52 MST

I know what is going on here. The fact is that if prolog exits with a value different from 0 the host is drained and the job is requeued in held state if
batch job or aborted if interactive.

What happened here is a race condition between the scontrol requeue command
in the prolog and the exit 100 from the prolog itself. In my test cluster
the scontrol was faster then the requeue API from slurmstepd so the controller
discards the second requeue|held request as the job was already requeued.
In your cluster was the opposite the requeue|held arrived first and the scontrol was discarded. The immediate workaround would be to sleep in the prolog 2 seconds allowing the scontrol to go through and then exit 100.
However this solution is also not perfect as in 2 seconds the scheduler may dispatch the job somewhere else and then it will be requued|held by the failed prolog.

The best way is to introduce another scheduler parameter, SchedulerParameters 
in slurm.conf, say requeue_on_prolog_fail to not requeue the job in held state.
With this parameter it won't be necessary for you to requeue the job using scontrol, just exit from the prolog with value !=0. 

Another option you may want to consider is to make the host check at the beginning of your job and then exit with a specific value taking 
advantage of the RequeueExit and RequeueExitHold features described in slurm.conf man page.

David

Comment 13 mn 2015-03-04 16:36:21 MST

Thanks for the root cause analysis good finding

I like the idea of requeue_on_prolog_fail

is this parameter is ready or need patch

Best Regards

Comment 14 David Bigagli 2015-03-05 07:31:23 MST

Patch available in 14.11.5. Commit 2c83bf4e1e830f0. Until 14.11.5 is released
you can back port the patch to you code base if you wish.

Thanks,
       David