We are trying to requeue the job in slurm prolog based on the node health node and closed the node. job requeue is working but job is going into held state from task prolog it is working fine as I want but running this prolog from task prolog is not efficient best Regards
Hi, this is not supposed to happen the jobs should simply go back to pend state and its scheduling delayed by a few minutes. How do you requeue the job? Do you specify requeue held? Here is an example of prolog that simply requeues the job and set the node state down. #!/bin/sh LOG="/tmp/prolog.log" echo "prolog running `date`" >> $LOG env|grep SLURM|sort >> $LOG /home/david/clusters/1411/linux/bin/scontrol update node=$SLURM_NODELIST state=down reason="aa" >> $LOG echo "requeue exit $? " >> $LOG /home/david/clusters/1411/linux/bin/scontrol requeue $SLURM_JOBID >> $LOG echo "requeue exit $? " >> $LOG exit 0 David
Generally you would want to set a node state to DRAIN rather than DOWN. Draining a node means jobs already allocated the node can continue running until complete or requeued. Setting a node DOWN will cause all jobs allocated to that node to get killed or requeued immediately.
Here is the part of code which check Scratch Filesytem if [ $ScrHcRc -ge 1 ] then echo "ProLog ScratchFs issue on $hostn,runuser=$runuser" >> $tmp_file /usr/local/slurm/bin/scontrol update node=$hostn state=DRAIN reason="HealthCheck Scratch FS Issues" >> $tmp_file 2>&1 /usr/local/slurm/bin/scontrol requeue $SLURM_JOBID >> $tmp_file 2>&1 rc=100 fi exit $rc I can try this again on my test using latest slurm version and update the ticket again. --Naseem
(In reply to mn from comment #3) > Here is the part of code which check Scratch Filesytem > > if [ $ScrHcRc -ge 1 ] > then > echo "ProLog ScratchFs issue on $hostn,runuser=$runuser" >> $tmp_file > /usr/local/slurm/bin/scontrol update node=$hostn state=DRAIN > reason="HealthCheck Scratch FS Issues" >> $tmp_file 2>&1 > /usr/local/slurm/bin/scontrol requeue $SLURM_JOBID >> $tmp_file 2>&1 > rc=100 > fi > exit $rc > > I can try this again on my test using latest slurm version and update the > ticket again. > > --Naseem I think it's the non-zero exit code that's a problem. David, can you check that.
No it is not I already tried. Are you using the Prolog or the PrologSlurmctld? David
I am using prolog here is what I have in my slurm.conf this what I have in my slurm.conf Prolog=/usr/local/slurm/tools/prolog.sh thx --Naseem
Please attach a copy of that file. On February 25, 2015 3:29:34 AM PST, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=1479 > >--- Comment #6 from mn <mohammed.naseemuddin@kaust.edu.sa> --- >I am using prolog >here is what I have in my slurm.conf >this what I have in my slurm.conf >Prolog=/usr/local/slurm/tools/prolog.sh > >thx >--Naseem > >-- >You are receiving this mail because: >You are on the CC list for the bug.
Hi, my prolog works the same as yours: /home/david/clusters/1411/linux/bin/scontrol update node=$SLURM_JOB_NODELIST state=drain reason="aa" >> $LOG echo "requeue exit $? " >> $LOG /home/david/clusters/1411/linux/bin/scontrol requeue $SLURM_JOBID >> $LOG echo "requeue exit $? " >> $LOG unless you have some other instructions later on. David
Hi Naseem, do you have any update on this issue. David
Hi, did you try again to reproduce this issue. I suspect that the launch of the job fails even before prolog runs and that's why the job is requeued in held state. David
Please reopen if necessary. David
I know what is going on here. The fact is that if prolog exits with a value different from 0 the host is drained and the job is requeued in held state if batch job or aborted if interactive. What happened here is a race condition between the scontrol requeue command in the prolog and the exit 100 from the prolog itself. In my test cluster the scontrol was faster then the requeue API from slurmstepd so the controller discards the second requeue|held request as the job was already requeued. In your cluster was the opposite the requeue|held arrived first and the scontrol was discarded. The immediate workaround would be to sleep in the prolog 2 seconds allowing the scontrol to go through and then exit 100. However this solution is also not perfect as in 2 seconds the scheduler may dispatch the job somewhere else and then it will be requued|held by the failed prolog. The best way is to introduce another scheduler parameter, SchedulerParameters in slurm.conf, say requeue_on_prolog_fail to not requeue the job in held state. With this parameter it won't be necessary for you to requeue the job using scontrol, just exit from the prolog with value !=0. Another option you may want to consider is to make the host check at the beginning of your job and then exit with a specific value taking advantage of the RequeueExit and RequeueExitHold features described in slurm.conf man page. David
Thanks for the root cause analysis good finding I like the idea of requeue_on_prolog_fail is this parameter is ready or need patch Best Regards
Patch available in 14.11.5. Commit 2c83bf4e1e830f0. Until 14.11.5 is released you can back port the patch to you code base if you wish. Thanks, David