Maybe I'm not understanding it correctly, but the documentation for sbatch seems to indicate that --signal should allow a specified signal to be sent to a job at a certain time before the job's time limit. I am never able to receive that signal, though I can send it via "scancel -b" and I can also catch TERM and CONT as the job reaches its time limit. Here's the script I submit: ======== #!/bin/bash #SBATCH --qos=test -n1 -t3 --signal=USR2@60 --mem-per-cpu=100M for a in {1..64}; do trap "echo -n SIG: $a @; date" $a; done while true do date sleep 1 done ======== "scancel -bs USR2 $jobid" does work. It prints: ... Fri Jun 14 13:35:01 MDT 2013 User defined signal 2 SIG: 12 @Fri Jun 14 13:35:02 MDT 2013 Fri Jun 14 13:35:02 MDT 2013 ... At the end of the job I get: ... Fri Jun 14 14:11:31 MDT 2013 slurmd[m6-3-8]: *** JOB 1331833 CANCELLED AT 2013-06-14T14:11:32 *** Terminated SIG: 15 @Fri Jun 14 14:11:32 MDT 2013 SIG: 18 @Fri Jun 14 14:11:32 MDT 2013 Fri Jun 14 14:11:32 MDT 2013 ... (continues for the period KillWait) So it does catch TERM and CONT, just not the one I request. It does catch a signal if I specify scancel -b. Am I not understanding what the behavior is supposed to be like? Basically, I want to request that a certain signal be sent at a certain time prior to the job hitting its walltime. I want to trap that signal in the submission script and do something with it.
This option will not signal the batch script itself only on any srun running in the script. The code specifically skips the batch step, most likely to avoid the script exiting prematurely. If you would like that to change that behaviour you can remove the code in src/slurmd/slurmd/req.c around line 3225. Outside of commenting out that code I don't think there is a way to get this to work the way you want to today. I would also be very cautious with commenting the lines because of the issue mentioned earlier. I'll look into this and see if there is something that will make this happen. Looking at the code it appears it would be easy to add a new option for this, but it would be new code and would probably be one or the other, meaning either signal the batch script or the running sruns. What do you think?
We would only want it to signal the batch script. A new option "--signal-batch" or something like that would work perfectly.
I think something like that is doable. It will not make it in 2.6 though. I'll keep this open until it is added.
Ryan this is being looked at the current thought is to do --signal=bSIGNAL where the b means send the signal to the batch script instead of the running steps. That way you don't have to have a new option on the command line. Let me know if you are against this idea.
Sounds good to me. Thanks.
This will be addressed in Slurm version 13.12 with the commit shown below. I would recommend not applying this patch to version 2.6 since it changes the job state save file and RPCs. I'm also going to close this ticket for tracking purposes. https://github.com/SchedMD/slurm/commit/20395610a5b77aac3423231fc28bc705718e1e17