Summary: | Reservation with flags=ignore_jobs causes jobs to block with (ReqNodeNotAvail, Reserved for maintenance) | ||
---|---|---|---|
Product: | Slurm | Reporter: | Ole.H.Nielsen <Ole.H.Nielsen> |
Component: | reservations | Assignee: | Dominik Bartkiewicz <bart> |
Status: | OPEN --- | QA Contact: | |
Severity: | 5 - Enhancement | ||
Priority: | --- | ||
Version: | 20.11.5 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | DTU Physics | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
Ole.H.Nielsen@fysik.dtu.dk
2021-04-19 03:06:29 MDT
Hi Ole, You've got the right idea for the reservation, but it looks like you're just missing a couple flags that should get it to do what you want. If you add the 'FLEX' flag it will allow jobs that qualify for the reservation to start before the reservation begins and continue after it starts, rather than the default behavior of waiting until they can start during the time of the reservation. Another flag you would want to add is the 'MAGNETIC' flag. This makes it so that any job that qualifies for the reservation is allowed to run in that reservation, without having requested it at submit time. Here's an example of how it would look with these flags added to what you were already doing. $ scontrol create reservation reservationname=exclude_account starttime=12:10:00 duration=30:00 flags=ignore_jobs,magnetic,flex nodes=ALL accounts=-sub1 Reservation created: exclude_account $ scontrol show res ReservationName=exclude_account StartTime=2021-04-19T12:10:00 EndTime=2021-04-19T12:40:00 Duration=00:30:00 Nodes=kitt,node[01-18] NodeCnt=19 CoreCnt=456 Features=(null) PartitionName=(null) Flags=FLEX,IGNORE_JOBS,SPEC_NODES,ALL_NODES,MAGNETIC TRES=cpu=456 Users=(null) Groups=(null) Accounts=-sub1 Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a MaxStartDelay=(null) I submit one job to the account that is excluded (sub1) and another to an account that is able to run (sub2). $ sbatch -N1 -t10:00 -Asub1 --wrap='srun sleep 600' Submitted batch job 26216 $ sbatch -N1 -t10:00 -Asub2 --wrap='srun sleep 600' Submitted batch job 26217 The job requesting the 'sub2' account is able to start while the 'sub1' job is held. $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 26216 debug wrap ben PD 0:00 1 (ReqNodeNotAvail, Reserved for maintenance) 26217 debug wrap ben R 0:02 1 node01 Let me know if you have any questions about this or if you don't see the same behavior. Thanks, Ben Hi Ben, Thanks for the useful suggestions of flags=flex,magnetic. I just now created a new reservation about 1 hour into the future: $ scontrol create reservation starttime=11:00:00 duration=1:00:00 flags=ignore_jobs,magnetic,flex ReservationName=migrate_ecs nodes=ALL Accounts=-ecsstud Reservation created: migrate_ecs $ scontrol show reservation ReservationName=migrate_ecs StartTime=2021-04-20T11:00:00 EndTime=2021-04-20T12:00:00 Duration=01:00:00 Nodes=a[001-128],b[001-012],c[001-196],d[001-019,021-033,035-054,056-068],g[001-021,024-066,068-110],h[001-002],i[004-050],s[001-004],x[001-192] NodeCnt=753 CoreCnt=21224 Features=(null) PartitionName=(null) Flags=FLEX,IGNORE_JOBS,SPEC_NODES,ALL_NODES,MAGNETIC TRES=cpu=21384 Users=(null) Groups=(null) Accounts=-ecsstud Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a MaxStartDelay=(null) Unfortunately, this still causes queued jobs to get the incorrect state (ReqNodeNotAvail, Reserved for maintenance): $ squeue | grep Reser | head 3597704 xeon16 normal Ba2HNNH- xxxx ecsvip PENDING 260389 0:00 2021-04-20 6-06:00:00 8 128 3900M (ReqNodeNotAvail, Reserved for maintenance) 3569139 xeon24 normal asr.gs@c zzzz camdvip PENDING 260289 0:00 2021-04-10 2-00:00:00 3 72 250000M (ReqNodeNotAvail, Reserved for maintenance) 3596587 xeon24 normal job yyyy catvip PENDING 260246 0:00 2021-04-20 2-00:00:00 6 144 10000M (ReqNodeNotAvail, Reserved for maintenance) 3569162 xeon24 normal asr.gs@c zzzz camdvip PENDING 260181 0:00 2021-04-10 2-00:00:00 3 72 250000M (ReqNodeNotAvail, Reserved for maintenance) 3575434 xeon24 normal asr.gs@c zzzz camdvip PENDING 260096 0:00 2021-04-12 2-02:00:00 3 72 250000M (ReqNodeNotAvail, Reserved for maintenance) 3575447 xeon24 normal asr.gs@c zzzz camdvip PENDING 260068 0:00 2021-04-12 2-00:00:00 3 72 250000M (ReqNodeNotAvail, Reserved for maintenance) 3575448 xeon24 normal asr.gs@c zzzz camdvip PENDING 259983 0:00 2021-04-12 2-00:00:00 3 72 250000M (ReqNodeNotAvail, Reserved for maintenance) 3575466 xeon24 normal asr.gs@c zzzz camdvip PENDING 259915 0:00 2021-04-12 2-00:00:00 3 72 250000M (ReqNodeNotAvail, Reserved for maintenance) 3576381 xeon24 normal asr.gs@c zzzz camdvip PENDING 259747 0:00 2021-04-12 2-00:00:00 3 72 250000M (ReqNodeNotAvail, Reserved for maintenance) 3576437 xeon24 normal asr.gs@c zzzz camdvip PENDING 259728 0:00 2021-04-12 2-00:00:00 3 72 250000M (ReqNodeNotAvail, Reserved for maintenance) Do you have any ideas how to avoid this problem? Thanks, Ole Hi Ole, I think you may be seeing this Reason for queued jobs that don't have resources to start immediately. When the reservation is in place it's showing that as a reason when there aren't resources free, even though the jobs would be able to run in the reservation. Here's an example showing this. I create a reservation on all my nodes, as shown previously. $ scontrol create reservation reservationname=exclude_account starttime=13:40:00 duration=30:00 flags=ignore_jobs,magnetic,flex nodes=ALL accounts=-sub1 Reservation created: exclude_account Then I submit a job that uses all the nodes and a second job that just requests one. $ sbatch -N19 --exclusive -t10:00 -Asub2 --wrap='srun sleep 600' Submitted batch job 26227 $ sbatch -N1 --exclusive -t10:00 -Asub2 --wrap='srun sleep 10' Submitted batch job 26228 The system-wide job starts first, so my small job isn't able to start and shows a Reason of "ReqNodeNotAvail, Reserved for maintenance". $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 26228 debug wrap ben PD 0:00 1 (ReqNodeNotAvail, Reserved for maintenance) 26227 debug wrap ben R 0:13 19 kitt,node[01-18] In the output you sent I only see the jobs that have this as a reason, so I don't know what the availability of system resources looked like at the time that you got that output. Did you see some jobs start during this time? Let me know if it doesn't look like the example I showed seems to apply in your case. Thanks, Ben Hi Ben, (In reply to Ben Roberts from comment #3) > I think you may be seeing this Reason for queued jobs that don't have > resources to start immediately. When the reservation is in place it's > showing that as a reason when there aren't resources free, even though the > jobs would be able to run in the reservation. Here's an example showing > this. It seems to me that your reservation exclude_account impacts all accounts (such as sub2 and its two jobs), when my expectation was that it would only be noticeable to users in the sub1 account. > The system-wide job starts first, so my small job isn't able to start and > shows a Reason of "ReqNodeNotAvail, Reserved for maintenance". > $ squeue > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > 26228 debug wrap ben PD 0:00 1 > (ReqNodeNotAvail, Reserved for maintenance) > 26227 debug wrap ben R 0:13 19 > kitt,node[01-18] Yes, this is what I don't understand: I would expect job 26228 to be Pending with a state of Resources in stead, just as if the reservation exclude_account didn't exist. > In the output you sent I only see the jobs that have this as a reason, so I > don't know what the availability of system resources looked like at the time > that you got that output. Did you see some jobs start during this time? > Let me know if it doesn't look like the example I showed seems to apply in > your case. I created a new reservation now, just as in Comment 2, and all Pending jobs now have the unexpected Reason (except for jobs with a Dependency): $ squeue -t PD | head JOBID PARTITION QOS NAME USER ACCOUNT STATE PRIORITY TIME SUBMIT_TIM TIME_LIMIT NODES CPUS MIN_MEM NODELIST(REASON) 3603165 xeon16 normal FeCoNi_n xxx camdstud PENDING 349746 0:00 2021-04-21 1-00:00:00 1 16 60000M (ReqNodeNotAvail, Reserved for maintenance) 3603164 xeon16 normal FeCoNi_n xxx camdstud PENDING 349746 0:00 2021-04-21 1-00:00:00 1 16 60000M (ReqNodeNotAvail, Reserved for maintenance) 3603163 xeon16 normal FeCoNi_n xxx camdstud PENDING 349746 0:00 2021-04-21 1-00:00:00 1 16 60000M (ReqNodeNotAvail, Reserved for maintenance) 3603162 xeon16 normal FeCoNi_n xxx camdstud PENDING 349746 0:00 2021-04-21 1-00:00:00 1 16 60000M (ReqNodeNotAvail, Reserved for maintenance) 3603161 xeon16 normal FeCoNi_n xxx camdstud PENDING 349746 0:00 2021-04-21 1-00:00:00 1 16 60000M (ReqNodeNotAvail, Reserved for maintenance) 3603160 xeon16 normal FeCoNi_n xxx camdstud PENDING 349746 0:00 2021-04-21 1-00:00:00 1 16 60000M (ReqNodeNotAvail, Reserved for maintenance) 3603166 xeon16 normal FeCoNi_n xxx camdstud PENDING 349737 0:00 2021-04-21 1-00:00:00 1 16 60000M (ReqNodeNotAvail, Reserved for maintenance) 3598955 xeon16 normal graphene yyy camdvip PENDING 328415 0:00 2021-04-20 2:00:00 2 32 62.50G (Dependency) 3602892 xeon16 normal FeCN zzz ecsvip PENDING 303439 0:00 2021-04-21 7-00:00:00 6 96 3900M (ReqNodeNotAvail, Reserved for maintenance) In the slurmctld logfile I do see that jobs are starting both before and after the reservation's starttime, so that part seems to work correctly. Maybe I don't understand the concept of Reservations deeply enough, but in the present scenario I think the Reason=(ReqNodeNotAvail, Reserved for maintenance) should not be printed for accounts that are unaffected by the reservation. I guess my observations boil down to a request, referring to your example in Comment 3: 1. Jobs in account sub1 should be blocked with Reason=(ReqNodeNotAvail, Reserved for maintenance). 2. Jobs in all other accounts (such as sub2) should simply be Pending with Reason=Resources, just as if the reservationname=exclude_account didn't exist at all. Does this make sense to you? If so, could I ask the Slurm developers to consider this request for a future 20.11.x version? Thanks, Ole Hi Ole, Thanks for your patience while we looked at this. Changing the behavior to have the Reason for jobs in this situation show "Resources" has been identified as an enhancement. This enhancement will be worked on by another engineer, but there isn't a target version identified for this work. We'll let you know when there is progress on it. Thanks, Ben |