Ticket 16263 - Why do some jobs that get preempted get listed as FAILED in Slurmdbd instead of PREEMPTED?
Summary: Why do some jobs that get preempted get listed as FAILED in Slurmdbd instead ...
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 23.02.4
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Marcin Stolarek
QA Contact:
URL:
: 21247 (view as ticket list)
Depends on:
Blocks: 19685
  Show dependency treegraph
 
Reported: 2023-03-13 18:37 MDT by Chris Samuel (NERSC)
Modified: 2024-10-24 00:27 MDT (History)
4 users (show)

See Also:
Site: NERSC
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 24.05.0rc1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Draft patch to fix up preempt signaling (531 bytes, patch)
2023-08-11 12:54 MDT, Chris Samuel (NERSC)
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Chris Samuel (NERSC) 2023-03-13 18:37:50 MDT
Hi there,

We've got a query from a staff member who says:

> There are multiple jobs (e.g. 5946950 and 5811696) which were submitted
> through preempt qos on Perlmutter and were cancelled with state FAILED.
> Users are confident its nothing wrong on their end. My understanding is
> that if a job is preempted, state should be PREEMPTED not FAILED.
> These “FAILED” jobs are also not requeued despite requesting --requeue.

I checked and from what I see these two jobs were indeed preempted:

[2023-03-06T22:38:20.684] debug:  setting 60 sec preemption grace time for JobId=5946950 to reclaim resources for JobId=5960852

and:

[2023-03-01T12:04:02.728] debug:  setting 60 sec preemption grace time for JobId=5811696 to reclaim resources for JobId=5801057

Are there cases where this can cause a job to be recorded as failed and not preempted? If so, is that why they weren't requeued?

I also see for the two of them:

[2023-03-01T12:04:28.190] _job_complete: JobId=5811696 WEXITSTATUS 143

and:

[2023-03-06T22:38:44.442] _job_complete: JobId=5946950 WEXITSTATUS 143

Could that play into this?

All the best,
Chris
Comment 1 Chris Samuel (NERSC) 2023-03-13 19:06:03 MDT
Having a quick look with sacct:

csamuel@perlmutter-mgr:~> sacct -j 5811696 -o jobid%15,state,exitcode,derivedexitcode
          JobID      State ExitCode DerivedExitCode 
--------------- ---------- -------- --------------- 
        5811696     FAILED     15:0             0:0 
  5811696.batch     FAILED     15:0                 
 5811696.extern  COMPLETED      0:0                 
      5811696.0  CANCELLED     0:15                 

and:

csamuel@perlmutter-mgr:~> sacct -j 5946950 -o jobid%15,state,exitcode,derivedexitcode
          JobID      State ExitCode DerivedExitCode 
--------------- ---------- -------- --------------- 
        5946950     FAILED     15:0            0:15 
  5946950.batch     FAILED     15:0                 
 5946950.extern  COMPLETED      0:0                 
      5946950.0  CANCELLED     0:15                 


Looking at one of the pair I do see:

debug:  _rpc_signal_tasks: sending signal 15 to all steps job 5811696 flag 4

I suspect I'd see identical for the other one.
Comment 2 Albert Gil 2023-03-20 04:09:42 MDT
Hi Chris,

I would need to look further into this.
Could you share your slurm.conf as well as your logs from slurmcltd of 2023-03-06?

Thanks,
Albert
Comment 3 Albert Gil 2023-03-28 09:12:20 MDT
Hi Chris,

> I would need to look further into this.
> Could you share your slurm.conf as well as your logs from slurmcltd of
> 2023-03-06?

Could you attach them?

Thanks,
Albert
Comment 4 Albert Gil 2023-05-16 08:15:53 MDT
Hi Chris,

Is this still an issue?
Could you attach the required info?

Thanks,
Albert
Comment 5 Albert Gil 2023-06-05 07:38:10 MDT
Hi Chris,

If this is ok for you I'm closing this ticket, but please don't hesitate to reopen it if you need further support.

Regards,
Albert
Comment 6 Chris Samuel (NERSC) 2023-08-11 12:44:36 MDT
Hey Albert,

I'm so sorry, I didn't see any of the emails for this bug (they'll be lost in my email chaos sorry).

I replicated this on a test system with the preemptee being:

csamuel@muller:login01:~> srun -u -q debug_preempt -C cpu bash -c 'hostname; sleep 600'
nid001003
slurmstepd: error: *** STEP 383498.0 ON nid001003 CANCELLED AT 2023-08-11T18:12:52 ***
srun: error: nid001003: task 0: Terminated
srun: Terminating StepId=383498.0
srun: Force Terminated StepId=383498.0

csamuel@muller:login01:~> sacct -j 383498 -XPno elapsed,state
00:05:17|FAILED

and the preemptor being:

csamuel@muller:login01:~> salloc -w nid001003 -q interactive -C cpu 
salloc: Pending job allocation 383499
salloc: job 383499 queued and waiting for resources
salloc: job 383499 has been allocated resources
salloc: Granted job allocation 383499
salloc: Waiting for resource configuration
salloc: Nodes nid001003 are ready for job
csamuel@nid001003:~>

What I have spotted is that in _job_check_grace_internal() in src/interfaces/preempt.c it calls job_signal() with the preempt option (the last one) set to false, which I'm guessing should be set to true?

                /* send job warn signal always sends SIGCONT first */
                if (preempt_send_user_signal && job_ptr->warn_signal &&
                    !(job_ptr->warn_flags & WARN_SENT))
                        send_job_warn_signal(job_ptr, true);
                else {
                        job_signal(job_ptr, SIGCONT, 0, 0, 0);
                        job_signal(job_ptr, SIGTERM, 0, 0, 0);
                }

and job_signal() uses it here:

        if (preempt)
                job_term_state = JOB_PREEMPTED;
        else
                job_term_state = JOB_CANCELLED;

That in turn calls deallocate_nodes() with the passed in preempt flag, and that uses it here:

        if (timeout)
                agent_args->msg_type = REQUEST_KILL_TIMELIMIT;
        else if (preempted)
                agent_args->msg_type = REQUEST_KILL_PREEMPTED;
        else
                agent_args->msg_type = REQUEST_TERMINATE_JOB;


I'm not sure if there'll be other cases of this, but from a quick scan of src/interfaces/preempt.c everything else calls it with preempt set to true.

I will try a local patch to see if this helps.

All the best,
Chris
Comment 7 Chris Samuel (NERSC) 2023-08-11 12:54:13 MDT
Created attachment 31724 [details]
Draft patch to fix up preempt signaling

Hi there,

This is what I'm going to test here at NERSC.

Let me know if you've got any feedback!

All the best,
Chris
Comment 8 Chris Samuel (NERSC) 2023-08-11 14:36:08 MDT
Hi there,

After adding my patch I didn't see a change in behaviour, so I added some debug logging so we could see what state was getting set by job_signal() and that looked OK, so all I can assume is that the job state is getting clobbered later?

[2023-08-11T20:09:00.195] debug:  _job_create: job 383605, time_limit: 10, time_min: 0, desc time_limit: 10, desc time_min: 4294967294
[2023-08-11T20:09:00.253] sched: _slurm_rpc_allocate_resources JobId=383605 NodeList=nid001003 usec=63939
[2023-08-11T20:09:01.359] debug:  reserved ports 63002-63003 for JobId=383605 StepId=0
[2023-08-11T20:14:14.952] debug:  setting 60 sec preemption grace time for JobId=383605 to reclaim resources for JobId=383606
[2023-08-11T20:14:14.952] debug:  job_signal: setting job_term_state for JobId=383605 to JOB_PREEMPTED
[2023-08-11T20:14:14.952] job_signal: 18 of running JobId=383605 successful 0x400001
[2023-08-11T20:14:14.952] debug:  job_signal: setting job_term_state for JobId=383605 to JOB_PREEMPTED
[2023-08-11T20:14:14.952] job_signal: 15 of running JobId=383605 successful 0x400001
[2023-08-11T20:14:15.962] _job_complete: JobId=383605 WTERMSIG 15
[2023-08-11T20:14:16.106] _job_complete: JobId=383605 done
[2023-08-11T20:14:16.106] debug:  freed ports 63002-63003 for JobId=383605 StepId=0

Here's the state after:

csamuel@muller:login01:~> sacct -j 383605
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
383605             bash regular_m+     nstaff        256     FAILED     0:15 
383605.exte+     extern                nstaff        256  COMPLETED      0:0 
383605.0           bash                nstaff        256  CANCELLED     0:15
Comment 9 Chris Samuel (NERSC) 2023-08-12 01:32:26 MDT
More experimentation has provided more clarity, the basic behaviour is:

* Only jobs that exceed their grace time and require Slurm to send SIGKILL get marked as preempted (if they get the warning signals and the job step aborts/exits then there seems no link back to the fact that was due to a signal sent for preemption).

* A `--requeue` job will only get requeued if marked as preempted, so if it exits early it won't get requeued

* For batch jobs only the job steps launched with `srun` seem to be sent the warning signals, not the batch script part.

* If you are using a bash script to handle the signals then you need to run the long running part into the background with `&` and then `wait` after otherwise your signal handler won't run

A number of these things are no documented, and I'm not sure how many of them are intentional - for instance this means that any job that wants to do requeuing with preemption needs to launch what they care about with srun and ensure that after they've cleaned up they wait as long as possible to ensure they get marked as preempted, otherwise they won't requeue.

What really caught us out is that first point - to our thinking any job that gets preempted should be marked as such. That both makes it explicit in the database that is what has happened so users don't get confused, and also means that requeuing jobs don't need to waste compute time by waiting around to ensure they have to get SIGKILL'd in order to work.

My patch appears unnecessary, a result of having been confused about when a job gets marked as preempted.
Comment 11 Marshall Garey 2023-08-16 16:39:21 MDT
(In reply to Chris Samuel (NERSC) from comment #9)
> More experimentation has provided more clarity, the basic behaviour is:
> 
> * Only jobs that exceed their grace time and require Slurm to send SIGKILL
> get marked as preempted (if they get the warning signals and the job step
> aborts/exits then there seems no link back to the fact that was due to a
> signal sent for preemption).


I agree that this is confusing. I'm checking the patch to see if it fixes this. We also need to look at send_job_warn_signal since that calls job_signal.


> * A `--requeue` job will only get requeued if marked as preempted, so if it
> exits early it won't get requeued


This is correct.

https://slurm.schedmd.com/sbatch.html#OPT_requeue

```
Specifies that the batch job should be eligible for requeuing. The job may be requeued explicitly by a system administrator, after node failure, or upon preemption by a higher priority job.
```

Note that the job exiting is not a condition that causes a requeue.


> * For batch jobs only the job steps launched with `srun` seem to be sent the
> warning signals, not the batch script part.


This is documented. See the slurm.conf man page:

https://slurm.schedmd.com/slurm.conf.html#OPT_GraceTime

```
The job's tasks are immediately sent SIGCONT and SIGTERM signals in order to provide notification of its imminent termination. This is followed by the SIGCONT, SIGTERM and SIGKILL signal sequence upon reaching its new end time. This second set of signals is sent to both the tasks and the containing batch script, if applicable.
```



> * If you are using a bash script to handle the signals then you need to run
> the long running part into the background with `&` and then `wait` after
> otherwise your signal handler won't run

This is not Slurm. This is just how bash interprets traps. From man bash:
```
       If bash is waiting for a command to complete and receives a signal for which a trap has been set, the trap will not be executed until the command completes.  When bash is waiting for an asynchronous command via the wait builtin, the reception of a  signal  for
       which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed.
```


> A number of these things are no documented,

The first point is not documented, but I think it is a bug.
The last point is just how the shell behaves, not a Slurm thing.
The other points are documented.


> and I'm not sure how many of
> them are intentional - for instance this means that any job that wants to do
> requeuing with preemption needs to launch what they care about with srun and
> ensure that after they've cleaned up they wait as long as possible to ensure
> they get marked as preempted, otherwise they won't requeue.


Assuming the first point is a bug that gets fixed, is this a concern anymore?


> What really caught us out is that first point - to our thinking any job that
> gets preempted should be marked as such. That both makes it explicit in the
> database that is what has happened so users don't get confused, and also
> means that requeuing jobs don't need to waste compute time by waiting around
> to ensure they have to get SIGKILL'd in order to work.


I think the first point is likely a bug.


> My patch appears unnecessary, a result of having been confused about when a
> job gets marked as preempted.

What do you mean that your patch is unnecessary?
Comment 12 Chris Samuel (NERSC) 2023-08-17 18:20:49 MDT
Hi Marshall!

(In reply to Marshall Garey from comment #11)

> (In reply to Chris Samuel (NERSC) from comment #9)

> > More experimentation has provided more clarity, the basic behaviour is:
> > 
> > * Only jobs that exceed their grace time and require Slurm to send SIGKILL
> > get marked as preempted (if they get the warning signals and the job step
> > aborts/exits then there seems no link back to the fact that was due to a
> > signal sent for preemption).
> 
> I agree that this is confusing. I'm checking the patch to see if it fixes
> this. We also need to look at send_job_warn_signal since that calls
> job_signal.

I think from what I saw that patch won't immediately help, as whilst my patch will cause `job_term_state` to get set correctly that only ever gets applied to the job when SIGKILL is sent.

I was wondering if it might make sense to apply that JOB_PREEMPTED state to the job when the warning signals get sent (as that's really when it's committed to being preempted) but I am not sure if that would have unexpected consequences.

> > * A `--requeue` job will only get requeued if marked as preempted, so if it
> > exits early it won't get requeued
> 
> This is correct.
> 
> https://slurm.schedmd.com/sbatch.html#OPT_requeue
> 
> ```
> Specifies that the batch job should be eligible for requeuing. The job may
> be requeued explicitly by a system administrator, after node failure, or
> upon preemption by a higher priority job.
> ```
> 
> Note that the job exiting is not a condition that causes a requeue.

But that's not what the user sees - the user sees their job is killed (because of preemption), but they don't get requeued because they don't know to rig things so they hang around long enough for SIGKILL to get sent and for the JOB_PREEMPTED state to get applied to their job.

To my mind they're getting penalised for good behaviour and freeing the node up as quickly as possible, whereas if they either ignored the signals altogether or ensured they'd "sleep 1000000" at the end of their batch script so they'd get that SIGKILL _then_ they'll get requeued.

My reading of the documentation is that if a requeueable job gets killed by the preemption code (no matter which signal causes it to happen) it should get requeued, does that help explain what I'm getting at?

> > * For batch jobs only the job steps launched with `srun` seem to be sent the
> > warning signals, not the batch script part.
> 
> 
> This is documented. See the slurm.conf man page:
> 
> https://slurm.schedmd.com/slurm.conf.html#OPT_GraceTime
> 
> ```
> The job's tasks are immediately sent SIGCONT and SIGTERM signals in order to
> provide notification of its imminent termination. This is followed by the
> SIGCONT, SIGTERM and SIGKILL signal sequence upon reaching its new end time.
> This second set of signals is sent to both the tasks and the containing
> batch script, if applicable.
> ```

Hmm, OK, that wasn't clear to me, I think I translated tasks into processes in my brain. I think if it said job steps then that might be clearer to people.

Is there a way to get that batch script sent the initial warning? There's plenty of people out there who will want those signal but won't be using srun (single or sub node jobs for instance doing work that doesn't need srun).

[...]
> This is not Slurm.

D'oh, my bad! I was copying and pasting from an internal issue to help our consulting staff understand what's needed and brought in too much!  We're getting a lot of queries about preemption (I've gone through slurmctld logs for 3 jobs this afternoon to try and answer whether jobs were preempted or not because of user complaints, and 2 of those were preempted and exited before SIGKILL).

> Assuming the first point is a bug that gets fixed, is this a concern anymore?

Nope, I think as long as the behaviour that's intended for preempted jobs happens if a job exits on those initial signals (requeuing, marked as JOB_PREEMPTED in the database) then that solves that!

[...]
> > My patch appears unnecessary, a result of having been confused about when a
> > job gets marked as preempted.
> 
> What do you mean that your patch is unnecessary?

That's bad wording on my part, I should have said it's not a solution in that it doesn't change the fact that (from what I can see) the job state only gets set to being JOB_PREEMPTED when SIGKILL is sent, ie in job_signal() in src/slurmctld/job_mgr.c:

        if (preempt)
                job_term_state = JOB_PREEMPTED;
        else
                job_term_state = JOB_CANCELLED;
        if (IS_JOB_SUSPENDED(job_ptr) && (signal == SIGKILL)) {
                last_job_update         = now;
                job_ptr->end_time       = job_ptr->suspend_time;
                job_ptr->tot_sus_time  += difftime(now, job_ptr->suspend_time);
                job_ptr->job_state      = job_term_state | JOB_COMPLETING;

So I think my patch might still be needed, just won't change how that above bit of code works.

Thanks again for the reply Marshall!

All the best,
Chris
Comment 13 Marshall Garey 2023-08-21 18:12:59 MDT
(In reply to Chris Samuel (NERSC) from comment #12)
> My reading of the documentation is that if a requeueable job gets killed by the
> preemption code (no matter which signal causes it to happen) it should get
> requeued, does that help explain what I'm getting at?

I agree, that is also how I read this line in the docs about requeing:

> upon preemption by a higher priority job.




> Is there a way to get that batch script sent the initial warning? There's
> plenty of people out there who will want those signal but won't be using srun
> (single or sub node jobs for instance doing work that doesn't need srun).

Use the --signal option in sbatch and set PreemptParameters=send_user_signal in slurm.conf.

https://slurm.schedmd.com/sbatch.html#OPT_signal
https://slurm.schedmd.com/slurm.conf.html#OPT_send_user_signal
Comment 14 Chris Samuel (NERSC) 2023-08-21 18:23:08 MDT
(In reply to Marshall Garey from comment #13)

> I agree, that is also how I read this line in the docs about requeing:
> 
> > upon preemption by a higher priority job.

Great, thank you!

> > Is there a way to get that batch script sent the initial warning? There's
> > plenty of people out there who will want those signal but won't be using srun
> > (single or sub node jobs for instance doing work that doesn't need srun).
> 
> Use the --signal option in sbatch and set PreemptParameters=send_user_signal
> in slurm.conf.
> 
> https://slurm.schedmd.com/sbatch.html#OPT_signal
> https://slurm.schedmd.com/slurm.conf.html#OPT_send_user_signal

Thanks! I did experiment with the `send_user_signal` option but was a bit wary of making that change as the user then can't tell preemption apart from normal end time. That I think is important as if they've asked for a 10 minute warning and they're getting preempted they will (on our systems) end up with just 1 minute instead and they won't be able to tell why without asking Slurm.

All the best,
Chris
Comment 15 Marshall Garey 2023-08-21 18:49:47 MDT
RE comment 14:

I can see why --signal plus send_user_signal are not always ideal. If you're interested in having an option for the batch script to receive the SIGCONT and SIGTERM signals at the beginning of GraceTime, will you open a new bug for this? It probably would not be too hard to implement, though all the usual caveats of enhancement requests apply.
Comment 16 Chris Samuel (NERSC) 2023-08-21 18:54:22 MDT
(In reply to Marshall Garey from comment #15)

> RE comment 14:
> 
> I can see why --signal plus send_user_signal are not always ideal. If you're
> interested in having an option for the batch script to receive the SIGCONT
> and SIGTERM signals at the beginning of GraceTime, will you open a new bug
> for this? It probably would not be too hard to implement, though all the
> usual caveats of enhancement requests apply.

Will do, thanks for the suggestion and I understand. :-)
Comment 20 Marshall Garey 2023-08-24 12:52:57 MDT
Chris, as I've been studying this issue it has become clear to me that the fix is not simple. I think your best bet for now is to have users just make the job wait for the duration of GraceTime if their job gets preempted.
Comment 30 Marcin Stolarek 2024-04-29 23:50:15 MDT
Chris,

The fix is merged to our public repository(a689104d15). It will be released with Slurm 24.05.

I'll go ahead and close the case. Should you have any questions please reopen.

Cheers,
Marcin
Comment 31 Chris Samuel (NERSC) 2024-04-30 13:01:49 MDT
Hi Marcin,

Thanks so much for this! Are you able to say briefly what was done please?

We're still stuck on 23.02 for the foreseeable future due to issues around DB size & full table scan queries that we're trying to work our way around.

All the best,
Chris
Comment 32 Chris Samuel (NERSC) 2024-04-30 13:03:15 MDT
Oh sorry I missed the commit ID reference in the comment! All good.

Thanks again!
Comment 33 Joel Criado 2024-10-24 00:27:56 MDT
*** Ticket 21247 has been marked as a duplicate of this ticket. ***