Dear SchedMD, we have a problem with signaling to the main batchscript, either by scancel or by setting a signal with "--signal=B:USR2@600". What we want to achieve is to send the main jobscript a signal 10 minutes before the walltime is reached to be able to copy all data from the local filesystems, before it is erased (there is an epilog script deleting all local files, when a job is finished). So main idea is to set "--signal=B:USR2@600" and define a trap in the script to copy the data back, for example to user's home (https://info.gwdg.de/docs/doku.php?id=en:services:application_services:high_performance_computing:orca) The problem is, that the signal is mostly not sent, I even don't get a signal when using "scancel --signal=USR2 --batch {JOBID}". I made a simple example: --- #!/bin/sh #SBATCH -A cramer #SBATCH -p sa #SBATCH --qos=short #SBATCH -t 15:00 #SBATCH -o out.%J #SBATCH -n 2 #SBATCH --ntasks-per-node=1 #SBATCH -J TRAP ###SBATCH --signal=B:USR2@600 #trap 'echo "trap"; srun -n ${SLURM_JOB_NUM_NODES} --ntasks-per-node=1 hostname; exit 12' SIGUSR2 trap 'echo "trap"; date' SIGUSR2 srun sleep 1500 echo "End of job reached" --- When I submit the job, I send all 60 seconds an scancel: tehlers@gwdu101:..tehlers/trap> while true; do scancel --signal=USR2 --batch 188199; sleep 60; done Output (when finished) is: tehlers@gwdu101:..tehlers/trap> cat out.188199 slurmstepd: error: *** JOB 188199 ON sa029 CANCELLED AT 2019-05-09T12:37:37 DUE TO TIME LIMIT *** srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 188199.0 ON sa029 CANCELLED AT 2019-05-09T12:37:37 DUE TO TIME LIMIT *** No trace from my "echo trap", so the trap was never executed. The same seems to be true for the signal configuration. Once I have managed to execute the trap, but the same job did not work again, the second time I ran it. Am I doing something wrong? Or is there a problem with sending signals on Slurm 18.08.6-2? Thanks Tim
Hi Tim, This question has been asked by other sites in the past. This is expected behavior due to how bash waits for the process to return before processing signals. https://groups.google.com/forum/#!searchin/slurm-users/wait%7Csort:date/slurm-users/MWo-TfqgNww/0YCZAcC1BAAJ sleep 1000 & # <----- Your program goes here. It's important to run on the background otherwise bash will not process the signal until this command finishes wait # <---- wait until all the background processes are finished. If a signal is received this will stop, process the signal and finish the script. More information can be found here: http://mywiki.wooledge.org/SignalTrap#When_is_the and here: https://stackoverflow.com/questions/27694818/interrupt-sleep-in-bash-with-a-signal-trap
Hey Jason, thanks for the info, it works! Best, Marcus