Ticket 333

Summary: sbatch --signal not working?
Product: Slurm Reporter: Ryan Cox <ryan_cox>
Component: SchedulingAssignee: Moe Jette <jette>
Status: RESOLVED FIXED QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: da
Version: 2.5.x   
Hardware: Linux   
OS: Linux   
Site: BYU - Brigham Young University Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Ryan Cox 2013-06-14 09:01:41 MDT
Maybe I'm not understanding it correctly, but the documentation for sbatch seems to indicate that --signal should allow a specified signal to be sent to a job at a certain time before the job's time limit.  I am never able to receive that signal, though I can send it via "scancel -b" and I can also catch TERM and CONT as the job reaches its time limit.

Here's the script I submit:
========
#!/bin/bash

#SBATCH --qos=test -n1 -t3 --signal=USR2@60 --mem-per-cpu=100M

for a in {1..64}; do trap "echo -n SIG: $a @; date" $a; done

while true
do
        date
        sleep 1
done
========

"scancel -bs USR2 $jobid" does work.  It prints:
...
Fri Jun 14 13:35:01 MDT 2013
User defined signal 2
SIG: 12 @Fri Jun 14 13:35:02 MDT 2013
Fri Jun 14 13:35:02 MDT 2013
...


At the end of the job I get:
...
Fri Jun 14 14:11:31 MDT 2013
slurmd[m6-3-8]: *** JOB 1331833 CANCELLED AT 2013-06-14T14:11:32 ***
Terminated
SIG: 15 @Fri Jun 14 14:11:32 MDT 2013
SIG: 18 @Fri Jun 14 14:11:32 MDT 2013
Fri Jun 14 14:11:32 MDT 2013
... (continues for the period KillWait)


So it does catch TERM and CONT, just not the one I request.  It does catch a signal if I specify scancel -b.

Am I not understanding what the behavior is supposed to be like?  Basically, I want to request that a certain signal be sent at a certain time prior to the job hitting its walltime.  I want to trap that signal in the submission script and do something with it.
Comment 1 Danny Auble 2013-06-14 09:28:52 MDT
This option will not signal the batch script itself only on any srun running in the script.  The code specifically skips the batch step, most likely to avoid the script exiting prematurely.  If you would like that to change that behaviour you can remove the code in src/slurmd/slurmd/req.c around line 3225.

Outside of commenting out that code I don't think there is a way to get this to work the way you want to today.  I would also be very cautious with commenting the lines because of the issue mentioned earlier.

I'll look into this and see if there is something that will make this happen.  Looking at the code it appears it would be easy to add a new option for this, but it would be new code and would probably be one or the other, meaning either signal the batch script or the running sruns.  What do you think?
Comment 2 Ryan Cox 2013-06-14 09:34:29 MDT
We would only want it to signal the batch script.  A new option "--signal-batch" or something like that would work perfectly.
Comment 3 Danny Auble 2013-06-14 09:36:13 MDT
I think something like that is doable.  It will not make it in 2.6 though.  I'll keep this open until it is added.
Comment 4 Danny Auble 2013-06-21 08:34:35 MDT
Ryan this is being looked at the current thought is to do --signal=bSIGNAL where the b means send the signal to the batch script instead of the running steps.  That way you don't have to have a new option on the command line.  Let me know if you are against this idea.
Comment 5 Ryan Cox 2013-06-21 08:35:52 MDT
Sounds good to me.  Thanks.
Comment 6 Moe Jette 2013-08-06 09:49:55 MDT
This will be addressed in Slurm version 13.12 with the commit shown below. I would recommend not applying this patch to version 2.6 since it changes the job state save file and RPCs. I'm also going to close this ticket for tracking purposes.

https://github.com/SchedMD/slurm/commit/20395610a5b77aac3423231fc28bc705718e1e17