Ticket 11404

Summary:	Reservation with flags=ignore_jobs causes jobs to block with (ReqNodeNotAvail, Reserved for maintenance)
Product:	Slurm	Reporter:	Ole.H.Nielsen <Ole.H.Nielsen>
Component:	reservations	Assignee:	Dominik Bartkiewicz <bart>
Status:	OPEN ---	QA Contact:
Severity:	5 - Enhancement
Priority:	---
Version:	20.11.5
Hardware:	Linux
OS:	Linux
Site:	DTU Physics	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Ole.H.Nielsen@fysik.dtu.dk 2021-04-19 03:06:29 MDT

I am testing to exclude users in an account from starting any new jobs during a reservation created as follows:

$ scontrol create reservation starttime=11:00:00 duration=12:00:00 flags=ignore_jobs ReservationName=migrate_ecs nodes=ALL Accounts=-ecsstud

The starttime is within the hour, and the account does have running jobs that would conflict with the reservation, hence I use the flags=ignore_jobs during my testing.

The purpose of this reservation is that I would like to migrate all user home directories in account "ecsstud" from one NFS file server to another NFS file server, so I have to ensure that there are no running jobs from this account during a particular time interval.  All other accounts in the system should continue to run jobs unaffected by the reservation.

Unfortunately, the above reservation apparently blocks all user jobs in the system:

$ squeue | grep Reserved
     3589143_12     xeon16 normal      job   xxxxx   catvip    PENDING     270196       0:00 2021-04-19 1-00:00:00      2     32   3900M (ReqNodeNotAvail, Reserved for maintenance)
      3590179_1     xeon16 normal      job   xxxx   catvip    PENDING     269973       0:00 2021-04-19 1-00:00:00      2     32   3900M (ReqNodeNotAvail, Reserved for maintenance)
        3590233     xeon16 normal      ann     yyyyy   ecsvip    PENDING     267121       0:00 2021-04-19 2-02:00:00      1     16     60G (ReqNodeNotAvail, Reserved for maintenance)
(many lines deleted)

According to the scontrol man-page, the flags=ignore_jobs, which I used in this test, should not necessarily imply (ReqNodeNotAvail, Reserved for maintenance). 

Question: Can you confirm the apparent implication ignore_jobs => maintenance?

When I omit the flags=ignore_jobs and select a starttime beyond the longest currently running job:

$ scontrol create reservation starttime=2021-05-01T11:00:00 duration=12:00:00 ReservationName=migrate_ecs nodes=ALL Accounts=-ecsstud

then I don't see the blocked jobs problem.

Question: Can you offer me advice on the idea of creating a reservation which excludes ALL nodes for a few selected accounts?

Comment 1 Ben Roberts 2021-04-19 11:08:09 MDT

Hi Ole,

You've got the right idea for the reservation, but it looks like you're just missing a couple flags that should get it to do what you want.  If you add the 'FLEX' flag it will allow jobs that qualify for the reservation to start before the reservation begins and continue after it starts, rather than the default behavior of waiting until they can start during the time of the reservation.  

Another flag you would want to add is the 'MAGNETIC' flag.  This makes it so that any job that qualifies for the reservation is allowed to run in that reservation, without having requested it at submit time.

Here's an example of how it would look with these flags added to what you were already doing.

$ scontrol create reservation reservationname=exclude_account starttime=12:10:00 duration=30:00 flags=ignore_jobs,magnetic,flex nodes=ALL accounts=-sub1
Reservation created: exclude_account

$ scontrol show res
ReservationName=exclude_account StartTime=2021-04-19T12:10:00 EndTime=2021-04-19T12:40:00 Duration=00:30:00
   Nodes=kitt,node[01-18] NodeCnt=19 CoreCnt=456 Features=(null) PartitionName=(null) Flags=FLEX,IGNORE_JOBS,SPEC_NODES,ALL_NODES,MAGNETIC
   TRES=cpu=456
   Users=(null) Groups=(null) Accounts=-sub1 Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)



I submit one job to the account that is excluded (sub1) and another to an account that is able to run (sub2).

$ sbatch -N1 -t10:00 -Asub1 --wrap='srun sleep 600'
Submitted batch job 26216

$ sbatch -N1 -t10:00 -Asub2 --wrap='srun sleep 600'
Submitted batch job 26217



The job requesting the 'sub2' account is able to start while the 'sub1' job is held.

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             26216     debug     wrap      ben PD       0:00      1 (ReqNodeNotAvail, Reserved for maintenance)
             26217     debug     wrap      ben  R       0:02      1 node01



Let me know if you have any questions about this or if you don't see the same behavior.

Thanks,
Ben

Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2021-04-20 02:10:26 MDT

Hi Ben,

Thanks for the useful suggestions of flags=flex,magnetic.  I just now created a new reservation about 1 hour into the future:

$ scontrol create reservation starttime=11:00:00  duration=1:00:00 flags=ignore_jobs,magnetic,flex  ReservationName=migrate_ecs nodes=ALL Accounts=-ecsstud
Reservation created: migrate_ecs

$ scontrol show reservation
ReservationName=migrate_ecs StartTime=2021-04-20T11:00:00 EndTime=2021-04-20T12:00:00 Duration=01:00:00
   Nodes=a[001-128],b[001-012],c[001-196],d[001-019,021-033,035-054,056-068],g[001-021,024-066,068-110],h[001-002],i[004-050],s[001-004],x[001-192] NodeCnt=753 CoreCnt=21224 Features=(null) PartitionName=(null) Flags=FLEX,IGNORE_JOBS,SPEC_NODES,ALL_NODES,MAGNETIC
   TRES=cpu=21384
   Users=(null) Groups=(null) Accounts=-ecsstud Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

Unfortunately, this still causes queued jobs to get the incorrect state (ReqNodeNotAvail, Reserved for maintenance):

$ squeue | grep Reser | head
        3597704     xeon16 normal Ba2HNNH-   xxxx   ecsvip    PENDING     260389       0:00 2021-04-20 6-06:00:00      8    128   3900M (ReqNodeNotAvail, Reserved for maintenance)
        3569139     xeon24 normal asr.gs@c     zzzz  camdvip    PENDING     260289       0:00 2021-04-10 2-00:00:00      3     72 250000M (ReqNodeNotAvail, Reserved for maintenance)
        3596587     xeon24 normal      job   yyyy   catvip    PENDING     260246       0:00 2021-04-20 2-00:00:00      6    144  10000M (ReqNodeNotAvail, Reserved for maintenance)
        3569162     xeon24 normal asr.gs@c     zzzz  camdvip    PENDING     260181       0:00 2021-04-10 2-00:00:00      3     72 250000M (ReqNodeNotAvail, Reserved for maintenance)
        3575434     xeon24 normal asr.gs@c     zzzz  camdvip    PENDING     260096       0:00 2021-04-12 2-02:00:00      3     72 250000M (ReqNodeNotAvail, Reserved for maintenance)
        3575447     xeon24 normal asr.gs@c     zzzz  camdvip    PENDING     260068       0:00 2021-04-12 2-00:00:00      3     72 250000M (ReqNodeNotAvail, Reserved for maintenance)
        3575448     xeon24 normal asr.gs@c     zzzz  camdvip    PENDING     259983       0:00 2021-04-12 2-00:00:00      3     72 250000M (ReqNodeNotAvail, Reserved for maintenance)
        3575466     xeon24 normal asr.gs@c     zzzz  camdvip    PENDING     259915       0:00 2021-04-12 2-00:00:00      3     72 250000M (ReqNodeNotAvail, Reserved for maintenance)
        3576381     xeon24 normal asr.gs@c     zzzz  camdvip    PENDING     259747       0:00 2021-04-12 2-00:00:00      3     72 250000M (ReqNodeNotAvail, Reserved for maintenance)
        3576437     xeon24 normal asr.gs@c     zzzz  camdvip    PENDING     259728       0:00 2021-04-12 2-00:00:00      3     72 250000M (ReqNodeNotAvail, Reserved for maintenance)

Do you have any ideas how to avoid this problem?

Thanks,
Ole

Comment 3 Ben Roberts 2021-04-20 12:52:16 MDT

Hi Ole,

I think you may be seeing this Reason for queued jobs that don't have resources to start immediately. When the reservation is in place it's showing that as a reason when there aren't resources free, even though the jobs would be able to run in the reservation. Here's an example showing this.

I create a reservation on all my nodes, as shown previously.

$ scontrol create reservation reservationname=exclude_account starttime=13:40:00 duration=30:00 flags=ignore_jobs,magnetic,flex nodes=ALL accounts=-sub1
Reservation created: exclude_account

Then I submit a job that uses all the nodes and a second job that just requests one.

$ sbatch -N19 --exclusive -t10:00 -Asub2 --wrap='srun sleep 600'
Submitted batch job 26227

$ sbatch -N1 --exclusive -t10:00 -Asub2 --wrap='srun sleep 10'
Submitted batch job 26228

The system-wide job starts first, so my small job isn't able to start and shows a Reason of "ReqNodeNotAvail, Reserved for maintenance".

$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
26228 debug wrap ben PD 0:00 1 (ReqNodeNotAvail, Reserved for maintenance)
26227 debug wrap ben R 0:13 19 kitt,node[01-18]

In the output you sent I only see the jobs that have this as a reason, so I don't know what the availability of system resources looked like at the time that you got that output. Did you see some jobs start during this time? Let me know if it doesn't look like the example I showed seems to apply in your case.

Thanks,
Ben

Comment 4 Ole.H.Nielsen@fysik.dtu.dk 2021-04-21 06:12:35 MDT

Hi Ben,

(In reply to Ben Roberts from comment #3)
> I think you may be seeing this Reason for queued jobs that don't have
> resources to start immediately.  When the reservation is in place it's
> showing that as a reason when there aren't resources free, even though the
> jobs would be able to run in the reservation.  Here's an example showing
> this.

It seems to me that your reservation exclude_account impacts all accounts (such as sub2 and its two jobs), when my expectation was that it would only be noticeable to users in the sub1 account.

> The system-wide job starts first, so my small job isn't able to start and
> shows a Reason of "ReqNodeNotAvail, Reserved for maintenance".
> $ squeue
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>              26228     debug     wrap      ben PD       0:00      1
> (ReqNodeNotAvail, Reserved for maintenance)
>              26227     debug     wrap      ben  R       0:13     19
> kitt,node[01-18]

Yes, this is what I don't understand:  I would expect job 26228 to be Pending with a state of Resources in stead, just as if the reservation exclude_account didn't exist.

> In the output you sent I only see the jobs that have this as a reason, so I
> don't know what the availability of system resources looked like at the time
> that you got that output.  Did you see some jobs start during this time? 
> Let me know if it doesn't look like the example I showed seems to apply in
> your case.

I created a new reservation now, just as in Comment 2, and all Pending jobs now have the unexpected Reason (except for jobs with a Dependency):

$ squeue -t PD | head
          JOBID  PARTITION    QOS     NAME     USER  ACCOUNT      STATE   PRIORITY       TIME SUBMIT_TIM TIME_LIMIT  NODES   CPUS MIN_MEM NODELIST(REASON)
        3603165     xeon16 normal FeCoNi_n  xxx camdstud    PENDING     349746       0:00 2021-04-21 1-00:00:00      1     16  60000M (ReqNodeNotAvail, Reserved for maintenance)
        3603164     xeon16 normal FeCoNi_n  xxx camdstud    PENDING     349746       0:00 2021-04-21 1-00:00:00      1     16  60000M (ReqNodeNotAvail, Reserved for maintenance)
        3603163     xeon16 normal FeCoNi_n  xxx camdstud    PENDING     349746       0:00 2021-04-21 1-00:00:00      1     16  60000M (ReqNodeNotAvail, Reserved for maintenance)
        3603162     xeon16 normal FeCoNi_n  xxx camdstud    PENDING     349746       0:00 2021-04-21 1-00:00:00      1     16  60000M (ReqNodeNotAvail, Reserved for maintenance)
        3603161     xeon16 normal FeCoNi_n  xxx camdstud    PENDING     349746       0:00 2021-04-21 1-00:00:00      1     16  60000M (ReqNodeNotAvail, Reserved for maintenance)
        3603160     xeon16 normal FeCoNi_n  xxx camdstud    PENDING     349746       0:00 2021-04-21 1-00:00:00      1     16  60000M (ReqNodeNotAvail, Reserved for maintenance)
        3603166     xeon16 normal FeCoNi_n  xxx camdstud    PENDING     349737       0:00 2021-04-21 1-00:00:00      1     16  60000M (ReqNodeNotAvail, Reserved for maintenance)
        3598955     xeon16 normal graphene    yyy  camdvip    PENDING     328415       0:00 2021-04-20    2:00:00      2     32  62.50G (Dependency)
        3602892     xeon16 normal     FeCN  zzz   ecsvip    PENDING     303439       0:00 2021-04-21 7-00:00:00      6     96   3900M (ReqNodeNotAvail, Reserved for maintenance)

In the slurmctld logfile I do see that jobs are starting both before and after the reservation's starttime, so that part seems to work correctly.

Maybe I don't understand the concept of Reservations deeply enough, but in the present scenario I think the Reason=(ReqNodeNotAvail, Reserved for maintenance) should not be printed for accounts that are unaffected by the reservation.

I guess my observations boil down to a request, referring to your example in Comment 3:

1. Jobs in account sub1 should be blocked with Reason=(ReqNodeNotAvail, Reserved for maintenance).

2. Jobs in all other accounts (such as sub2) should simply be Pending with Reason=Resources, just as if the reservationname=exclude_account didn't exist at all.

Does this make sense to you?  If so, could I ask the Slurm developers to consider this request for a future 20.11.x version?

Thanks,
Ole

Comment 8 Ben Roberts 2021-04-26 09:57:20 MDT

Hi Ole,

Thanks for your patience while we looked at this.  Changing the behavior to have the Reason for jobs in this situation show "Resources" has been identified as an enhancement.  This enhancement will be worked on by another engineer, but there isn't a target version identified for this work.  We'll let you know when there is progress on it.

Thanks,
Ben