Hi there, We've got a query from a staff member who says: > There are multiple jobs (e.g. 5946950 and 5811696) which were submitted > through preempt qos on Perlmutter and were cancelled with state FAILED. > Users are confident its nothing wrong on their end. My understanding is > that if a job is preempted, state should be PREEMPTED not FAILED. > These “FAILED” jobs are also not requeued despite requesting --requeue. I checked and from what I see these two jobs were indeed preempted: [2023-03-06T22:38:20.684] debug: setting 60 sec preemption grace time for JobId=5946950 to reclaim resources for JobId=5960852 and: [2023-03-01T12:04:02.728] debug: setting 60 sec preemption grace time for JobId=5811696 to reclaim resources for JobId=5801057 Are there cases where this can cause a job to be recorded as failed and not preempted? If so, is that why they weren't requeued? I also see for the two of them: [2023-03-01T12:04:28.190] _job_complete: JobId=5811696 WEXITSTATUS 143 and: [2023-03-06T22:38:44.442] _job_complete: JobId=5946950 WEXITSTATUS 143 Could that play into this? All the best, Chris
Having a quick look with sacct: csamuel@perlmutter-mgr:~> sacct -j 5811696 -o jobid%15,state,exitcode,derivedexitcode JobID State ExitCode DerivedExitCode --------------- ---------- -------- --------------- 5811696 FAILED 15:0 0:0 5811696.batch FAILED 15:0 5811696.extern COMPLETED 0:0 5811696.0 CANCELLED 0:15 and: csamuel@perlmutter-mgr:~> sacct -j 5946950 -o jobid%15,state,exitcode,derivedexitcode JobID State ExitCode DerivedExitCode --------------- ---------- -------- --------------- 5946950 FAILED 15:0 0:15 5946950.batch FAILED 15:0 5946950.extern COMPLETED 0:0 5946950.0 CANCELLED 0:15 Looking at one of the pair I do see: debug: _rpc_signal_tasks: sending signal 15 to all steps job 5811696 flag 4 I suspect I'd see identical for the other one.
Hi Chris, I would need to look further into this. Could you share your slurm.conf as well as your logs from slurmcltd of 2023-03-06? Thanks, Albert
Hi Chris, > I would need to look further into this. > Could you share your slurm.conf as well as your logs from slurmcltd of > 2023-03-06? Could you attach them? Thanks, Albert
Hi Chris, Is this still an issue? Could you attach the required info? Thanks, Albert
Hi Chris, If this is ok for you I'm closing this ticket, but please don't hesitate to reopen it if you need further support. Regards, Albert
Hey Albert, I'm so sorry, I didn't see any of the emails for this bug (they'll be lost in my email chaos sorry). I replicated this on a test system with the preemptee being: csamuel@muller:login01:~> srun -u -q debug_preempt -C cpu bash -c 'hostname; sleep 600' nid001003 slurmstepd: error: *** STEP 383498.0 ON nid001003 CANCELLED AT 2023-08-11T18:12:52 *** srun: error: nid001003: task 0: Terminated srun: Terminating StepId=383498.0 srun: Force Terminated StepId=383498.0 csamuel@muller:login01:~> sacct -j 383498 -XPno elapsed,state 00:05:17|FAILED and the preemptor being: csamuel@muller:login01:~> salloc -w nid001003 -q interactive -C cpu salloc: Pending job allocation 383499 salloc: job 383499 queued and waiting for resources salloc: job 383499 has been allocated resources salloc: Granted job allocation 383499 salloc: Waiting for resource configuration salloc: Nodes nid001003 are ready for job csamuel@nid001003:~> What I have spotted is that in _job_check_grace_internal() in src/interfaces/preempt.c it calls job_signal() with the preempt option (the last one) set to false, which I'm guessing should be set to true? /* send job warn signal always sends SIGCONT first */ if (preempt_send_user_signal && job_ptr->warn_signal && !(job_ptr->warn_flags & WARN_SENT)) send_job_warn_signal(job_ptr, true); else { job_signal(job_ptr, SIGCONT, 0, 0, 0); job_signal(job_ptr, SIGTERM, 0, 0, 0); } and job_signal() uses it here: if (preempt) job_term_state = JOB_PREEMPTED; else job_term_state = JOB_CANCELLED; That in turn calls deallocate_nodes() with the passed in preempt flag, and that uses it here: if (timeout) agent_args->msg_type = REQUEST_KILL_TIMELIMIT; else if (preempted) agent_args->msg_type = REQUEST_KILL_PREEMPTED; else agent_args->msg_type = REQUEST_TERMINATE_JOB; I'm not sure if there'll be other cases of this, but from a quick scan of src/interfaces/preempt.c everything else calls it with preempt set to true. I will try a local patch to see if this helps. All the best, Chris
Created attachment 31724 [details] Draft patch to fix up preempt signaling Hi there, This is what I'm going to test here at NERSC. Let me know if you've got any feedback! All the best, Chris
Hi there, After adding my patch I didn't see a change in behaviour, so I added some debug logging so we could see what state was getting set by job_signal() and that looked OK, so all I can assume is that the job state is getting clobbered later? [2023-08-11T20:09:00.195] debug: _job_create: job 383605, time_limit: 10, time_min: 0, desc time_limit: 10, desc time_min: 4294967294 [2023-08-11T20:09:00.253] sched: _slurm_rpc_allocate_resources JobId=383605 NodeList=nid001003 usec=63939 [2023-08-11T20:09:01.359] debug: reserved ports 63002-63003 for JobId=383605 StepId=0 [2023-08-11T20:14:14.952] debug: setting 60 sec preemption grace time for JobId=383605 to reclaim resources for JobId=383606 [2023-08-11T20:14:14.952] debug: job_signal: setting job_term_state for JobId=383605 to JOB_PREEMPTED [2023-08-11T20:14:14.952] job_signal: 18 of running JobId=383605 successful 0x400001 [2023-08-11T20:14:14.952] debug: job_signal: setting job_term_state for JobId=383605 to JOB_PREEMPTED [2023-08-11T20:14:14.952] job_signal: 15 of running JobId=383605 successful 0x400001 [2023-08-11T20:14:15.962] _job_complete: JobId=383605 WTERMSIG 15 [2023-08-11T20:14:16.106] _job_complete: JobId=383605 done [2023-08-11T20:14:16.106] debug: freed ports 63002-63003 for JobId=383605 StepId=0 Here's the state after: csamuel@muller:login01:~> sacct -j 383605 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 383605 bash regular_m+ nstaff 256 FAILED 0:15 383605.exte+ extern nstaff 256 COMPLETED 0:0 383605.0 bash nstaff 256 CANCELLED 0:15
More experimentation has provided more clarity, the basic behaviour is: * Only jobs that exceed their grace time and require Slurm to send SIGKILL get marked as preempted (if they get the warning signals and the job step aborts/exits then there seems no link back to the fact that was due to a signal sent for preemption). * A `--requeue` job will only get requeued if marked as preempted, so if it exits early it won't get requeued * For batch jobs only the job steps launched with `srun` seem to be sent the warning signals, not the batch script part. * If you are using a bash script to handle the signals then you need to run the long running part into the background with `&` and then `wait` after otherwise your signal handler won't run A number of these things are no documented, and I'm not sure how many of them are intentional - for instance this means that any job that wants to do requeuing with preemption needs to launch what they care about with srun and ensure that after they've cleaned up they wait as long as possible to ensure they get marked as preempted, otherwise they won't requeue. What really caught us out is that first point - to our thinking any job that gets preempted should be marked as such. That both makes it explicit in the database that is what has happened so users don't get confused, and also means that requeuing jobs don't need to waste compute time by waiting around to ensure they have to get SIGKILL'd in order to work. My patch appears unnecessary, a result of having been confused about when a job gets marked as preempted.
(In reply to Chris Samuel (NERSC) from comment #9) > More experimentation has provided more clarity, the basic behaviour is: > > * Only jobs that exceed their grace time and require Slurm to send SIGKILL > get marked as preempted (if they get the warning signals and the job step > aborts/exits then there seems no link back to the fact that was due to a > signal sent for preemption). I agree that this is confusing. I'm checking the patch to see if it fixes this. We also need to look at send_job_warn_signal since that calls job_signal. > * A `--requeue` job will only get requeued if marked as preempted, so if it > exits early it won't get requeued This is correct. https://slurm.schedmd.com/sbatch.html#OPT_requeue ``` Specifies that the batch job should be eligible for requeuing. The job may be requeued explicitly by a system administrator, after node failure, or upon preemption by a higher priority job. ``` Note that the job exiting is not a condition that causes a requeue. > * For batch jobs only the job steps launched with `srun` seem to be sent the > warning signals, not the batch script part. This is documented. See the slurm.conf man page: https://slurm.schedmd.com/slurm.conf.html#OPT_GraceTime ``` The job's tasks are immediately sent SIGCONT and SIGTERM signals in order to provide notification of its imminent termination. This is followed by the SIGCONT, SIGTERM and SIGKILL signal sequence upon reaching its new end time. This second set of signals is sent to both the tasks and the containing batch script, if applicable. ``` > * If you are using a bash script to handle the signals then you need to run > the long running part into the background with `&` and then `wait` after > otherwise your signal handler won't run This is not Slurm. This is just how bash interprets traps. From man bash: ``` If bash is waiting for a command to complete and receives a signal for which a trap has been set, the trap will not be executed until the command completes. When bash is waiting for an asynchronous command via the wait builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed. ``` > A number of these things are no documented, The first point is not documented, but I think it is a bug. The last point is just how the shell behaves, not a Slurm thing. The other points are documented. > and I'm not sure how many of > them are intentional - for instance this means that any job that wants to do > requeuing with preemption needs to launch what they care about with srun and > ensure that after they've cleaned up they wait as long as possible to ensure > they get marked as preempted, otherwise they won't requeue. Assuming the first point is a bug that gets fixed, is this a concern anymore? > What really caught us out is that first point - to our thinking any job that > gets preempted should be marked as such. That both makes it explicit in the > database that is what has happened so users don't get confused, and also > means that requeuing jobs don't need to waste compute time by waiting around > to ensure they have to get SIGKILL'd in order to work. I think the first point is likely a bug. > My patch appears unnecessary, a result of having been confused about when a > job gets marked as preempted. What do you mean that your patch is unnecessary?
Hi Marshall! (In reply to Marshall Garey from comment #11) > (In reply to Chris Samuel (NERSC) from comment #9) > > More experimentation has provided more clarity, the basic behaviour is: > > > > * Only jobs that exceed their grace time and require Slurm to send SIGKILL > > get marked as preempted (if they get the warning signals and the job step > > aborts/exits then there seems no link back to the fact that was due to a > > signal sent for preemption). > > I agree that this is confusing. I'm checking the patch to see if it fixes > this. We also need to look at send_job_warn_signal since that calls > job_signal. I think from what I saw that patch won't immediately help, as whilst my patch will cause `job_term_state` to get set correctly that only ever gets applied to the job when SIGKILL is sent. I was wondering if it might make sense to apply that JOB_PREEMPTED state to the job when the warning signals get sent (as that's really when it's committed to being preempted) but I am not sure if that would have unexpected consequences. > > * A `--requeue` job will only get requeued if marked as preempted, so if it > > exits early it won't get requeued > > This is correct. > > https://slurm.schedmd.com/sbatch.html#OPT_requeue > > ``` > Specifies that the batch job should be eligible for requeuing. The job may > be requeued explicitly by a system administrator, after node failure, or > upon preemption by a higher priority job. > ``` > > Note that the job exiting is not a condition that causes a requeue. But that's not what the user sees - the user sees their job is killed (because of preemption), but they don't get requeued because they don't know to rig things so they hang around long enough for SIGKILL to get sent and for the JOB_PREEMPTED state to get applied to their job. To my mind they're getting penalised for good behaviour and freeing the node up as quickly as possible, whereas if they either ignored the signals altogether or ensured they'd "sleep 1000000" at the end of their batch script so they'd get that SIGKILL _then_ they'll get requeued. My reading of the documentation is that if a requeueable job gets killed by the preemption code (no matter which signal causes it to happen) it should get requeued, does that help explain what I'm getting at? > > * For batch jobs only the job steps launched with `srun` seem to be sent the > > warning signals, not the batch script part. > > > This is documented. See the slurm.conf man page: > > https://slurm.schedmd.com/slurm.conf.html#OPT_GraceTime > > ``` > The job's tasks are immediately sent SIGCONT and SIGTERM signals in order to > provide notification of its imminent termination. This is followed by the > SIGCONT, SIGTERM and SIGKILL signal sequence upon reaching its new end time. > This second set of signals is sent to both the tasks and the containing > batch script, if applicable. > ``` Hmm, OK, that wasn't clear to me, I think I translated tasks into processes in my brain. I think if it said job steps then that might be clearer to people. Is there a way to get that batch script sent the initial warning? There's plenty of people out there who will want those signal but won't be using srun (single or sub node jobs for instance doing work that doesn't need srun). [...] > This is not Slurm. D'oh, my bad! I was copying and pasting from an internal issue to help our consulting staff understand what's needed and brought in too much! We're getting a lot of queries about preemption (I've gone through slurmctld logs for 3 jobs this afternoon to try and answer whether jobs were preempted or not because of user complaints, and 2 of those were preempted and exited before SIGKILL). > Assuming the first point is a bug that gets fixed, is this a concern anymore? Nope, I think as long as the behaviour that's intended for preempted jobs happens if a job exits on those initial signals (requeuing, marked as JOB_PREEMPTED in the database) then that solves that! [...] > > My patch appears unnecessary, a result of having been confused about when a > > job gets marked as preempted. > > What do you mean that your patch is unnecessary? That's bad wording on my part, I should have said it's not a solution in that it doesn't change the fact that (from what I can see) the job state only gets set to being JOB_PREEMPTED when SIGKILL is sent, ie in job_signal() in src/slurmctld/job_mgr.c: if (preempt) job_term_state = JOB_PREEMPTED; else job_term_state = JOB_CANCELLED; if (IS_JOB_SUSPENDED(job_ptr) && (signal == SIGKILL)) { last_job_update = now; job_ptr->end_time = job_ptr->suspend_time; job_ptr->tot_sus_time += difftime(now, job_ptr->suspend_time); job_ptr->job_state = job_term_state | JOB_COMPLETING; So I think my patch might still be needed, just won't change how that above bit of code works. Thanks again for the reply Marshall! All the best, Chris
(In reply to Chris Samuel (NERSC) from comment #12) > My reading of the documentation is that if a requeueable job gets killed by the > preemption code (no matter which signal causes it to happen) it should get > requeued, does that help explain what I'm getting at? I agree, that is also how I read this line in the docs about requeing: > upon preemption by a higher priority job. > Is there a way to get that batch script sent the initial warning? There's > plenty of people out there who will want those signal but won't be using srun > (single or sub node jobs for instance doing work that doesn't need srun). Use the --signal option in sbatch and set PreemptParameters=send_user_signal in slurm.conf. https://slurm.schedmd.com/sbatch.html#OPT_signal https://slurm.schedmd.com/slurm.conf.html#OPT_send_user_signal
(In reply to Marshall Garey from comment #13) > I agree, that is also how I read this line in the docs about requeing: > > > upon preemption by a higher priority job. Great, thank you! > > Is there a way to get that batch script sent the initial warning? There's > > plenty of people out there who will want those signal but won't be using srun > > (single or sub node jobs for instance doing work that doesn't need srun). > > Use the --signal option in sbatch and set PreemptParameters=send_user_signal > in slurm.conf. > > https://slurm.schedmd.com/sbatch.html#OPT_signal > https://slurm.schedmd.com/slurm.conf.html#OPT_send_user_signal Thanks! I did experiment with the `send_user_signal` option but was a bit wary of making that change as the user then can't tell preemption apart from normal end time. That I think is important as if they've asked for a 10 minute warning and they're getting preempted they will (on our systems) end up with just 1 minute instead and they won't be able to tell why without asking Slurm. All the best, Chris
RE comment 14: I can see why --signal plus send_user_signal are not always ideal. If you're interested in having an option for the batch script to receive the SIGCONT and SIGTERM signals at the beginning of GraceTime, will you open a new bug for this? It probably would not be too hard to implement, though all the usual caveats of enhancement requests apply.
(In reply to Marshall Garey from comment #15) > RE comment 14: > > I can see why --signal plus send_user_signal are not always ideal. If you're > interested in having an option for the batch script to receive the SIGCONT > and SIGTERM signals at the beginning of GraceTime, will you open a new bug > for this? It probably would not be too hard to implement, though all the > usual caveats of enhancement requests apply. Will do, thanks for the suggestion and I understand. :-)
Chris, as I've been studying this issue it has become clear to me that the fix is not simple. I think your best bet for now is to have users just make the job wait for the duration of GraceTime if their job gets preempted.
Chris, The fix is merged to our public repository(a689104d15). It will be released with Slurm 24.05. I'll go ahead and close the case. Should you have any questions please reopen. Cheers, Marcin
Hi Marcin, Thanks so much for this! Are you able to say briefly what was done please? We're still stuck on 23.02 for the foreseeable future due to issues around DB size & full table scan queries that we're trying to work our way around. All the best, Chris
Oh sorry I missed the commit ID reference in the comment! All good. Thanks again!
*** Ticket 21247 has been marked as a duplicate of this ticket. ***