Ticket 6985 - sending signals to batch script works only by chance
Summary: sending signals to batch script works only by chance
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other tickets)
Version: 18.08.6
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Jason Booth
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-05-09 05:21 MDT by Tim Ehlers
Modified: 2019-05-14 04:55 MDT (History)
2 users (show)

See Also:
Site: GWDG
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Tim Ehlers 2019-05-09 05:21:20 MDT
Dear SchedMD,

we have a problem with signaling to the main batchscript, either by scancel or by setting a signal with "--signal=B:USR2@600".

What we want to achieve is to send the main jobscript a signal 10 minutes before the walltime is reached to be able to copy all data from the local filesystems, before it is erased (there is an epilog script deleting all local files, when a job is finished).

So main idea is to set "--signal=B:USR2@600" and define a trap in the script to copy the data back, for example to user's home (https://info.gwdg.de/docs/doku.php?id=en:services:application_services:high_performance_computing:orca)

The problem is, that the signal is mostly not sent, I even don't get a signal when using "scancel --signal=USR2 --batch {JOBID}".

I made a simple example:

---

#!/bin/sh
#SBATCH -A cramer
#SBATCH -p sa
#SBATCH --qos=short
#SBATCH -t 15:00
#SBATCH -o out.%J
#SBATCH -n 2
#SBATCH --ntasks-per-node=1
#SBATCH -J TRAP
###SBATCH --signal=B:USR2@600

#trap 'echo "trap"; srun -n ${SLURM_JOB_NUM_NODES} --ntasks-per-node=1 hostname; exit 12' SIGUSR2
trap 'echo "trap"; date' SIGUSR2

srun sleep 1500

echo "End of job reached"

---

When I submit the job, I send all 60 seconds an scancel:

tehlers@gwdu101:..tehlers/trap> while true; do scancel --signal=USR2 --batch 188199; sleep 60; done

Output (when finished) is:

tehlers@gwdu101:..tehlers/trap> cat out.188199 
slurmstepd: error: *** JOB 188199 ON sa029 CANCELLED AT 2019-05-09T12:37:37 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 188199.0 ON sa029 CANCELLED AT 2019-05-09T12:37:37 DUE TO TIME LIMIT ***

No trace from my "echo trap", so the trap was never executed. The same seems to be true for the signal configuration. Once I have managed to execute the trap, but the same job did not work again, the second time I ran it.

Am I doing something wrong? Or is there a problem with sending signals on Slurm 18.08.6-2?

Thanks

Tim
Comment 2 Jason Booth 2019-05-09 10:00:43 MDT
Hi Tim,

This question has been asked by other sites in the past. This is expected behavior due to how bash waits for the process to return before processing signals.

https://groups.google.com/forum/#!searchin/slurm-users/wait%7Csort:date/slurm-users/MWo-TfqgNww/0YCZAcC1BAAJ

sleep 1000 &   # <----- Your program goes here. It's important to run on 
the background otherwise bash will not process the signal until this 
command finishes 

wait   # <---- wait until all the background processes are finished. If a signal is received this will stop, process the signal and finish the script. 

More information can be found here:
http://mywiki.wooledge.org/SignalTrap#When_is_the

and here:
https://stackoverflow.com/questions/27694818/interrupt-sleep-in-bash-with-a-signal-trap
Comment 3 Marcus Boden 2019-05-14 04:55:05 MDT
Hey Jason,

thanks for the info, it works!

Best,
Marcus