Ticket 15011

Summary:	Future MAGNETIC reservations attracting jobs, preventing them from running until later
Product:	Slurm	Reporter:	Ali Nikkhah <alin4>
Component:	reservations	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	alin4, azoshima, bas.vandervlies, ihmesa
Version:	22.05.3
Hardware:	Linux
OS:	Linux
Site:	U WA Health Metrics	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	22.05.6 23.02.0pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	scontrol show res reservations-2022-09-12 sinfo-2022-09-21 sshare-2022-09-21 squeue-affected-user-2022-09-21 squeue-2022-09-21 slurmctld.log slurm.conf slurmctld.log 20221004 operation output 20221004 job requested body json 20221004 squeue output slurmctld.log 20221005

Description Ali Nikkhah 2022-09-20 19:23:53 MDT

Created attachment 26895 [details]
scontrol show res

When scheduling MAGNETIC reservations for the future, some jobs submitted to the account now for which the reservation is scheduled will never run before the reservation starts with reason "Reservation."

At the time we noticed this, there were five future reservations and no active reservations. A user submitted 300 jobs under the account assigned to all the upcoming reservations. All 300 jobs were pending, 200 had reason "Priority" and 100 had reason "Reservation." 184 nodes in the partition these jobs were launched were idle.

All jobs were launched with a request of 10 cores, 50GB memory, and timelimit 84 minutes. All are values free nodes could accommodate.

We initially suspected priority and reset priorities manually to no avail. We then realized that there were, in fact, upcoming reservations and started poking at them. We then started removing the MAGNETIC flag from the upcoming reservations one at a time.

This happened morning of 2022-09-16 

1. Removed MAGNETIC flag from soonest upcoming reservation (StartTime=2022-09-19T17:00:40 NodeCnt=50)
2. User's jobs all still pending, but now
  - 125 reason Reservation
  - 175 reason Priority
3. Removed MAGNETIC flag from second soonest upcoming reservation (StartTime=2022-09-26T08:00:03 NodeCnt=113)
4. User's jobs all still pending, but now
  - 167 reason Reservation
  - 133 reason Priority
3. Removed MAGNETIC flag from third soonest upcoming reservation (StartTime=2022-09-29T08:01:00 NodeCnt=113)
4. User's jobs all still pending, but now
  - 250 reason Reservation
  - 50 reason Priority
5. Removed MAGNETIC flag from fourth soonest upcoming reservation (StartTime=2022-10-31T08:00:05 NodeCnt=113)
6. All users jobs were scheduled and ran


We were not expecting that future reservations to attract jobs so far in advance and prevent them from running- this is not documented behavior (that I can find).

Comment 1 Ali Nikkhah 2022-09-21 18:48:12 MDT

Created attachment 26915 [details]
reservations-2022-09-12

Adding more attachments with info from the most current instance of this.

starting with scontrol show res.

Comment 2 Ali Nikkhah 2022-09-21 18:49:22 MDT

Created attachment 26916 [details]
sinfo-2022-09-21

sinfo

Comment 3 Ali Nikkhah 2022-09-21 18:51:29 MDT

Created attachment 26917 [details]
sshare-2022-09-21

sshare -a -l

Comment 4 Ali Nikkhah 2022-09-21 18:52:30 MDT

Created attachment 26918 [details]
squeue-affected-user-2022-09-21

squeue for affected user/account

Comment 5 Ali Nikkhah 2022-09-21 18:52:58 MDT

Created attachment 26919 [details]
squeue-2022-09-21

full squeue output

Comment 6 Ben Roberts 2022-09-22 09:26:28 MDT

Hi Ali,

This does sound like strange behavior.  I did some quick tests this morning to see if I could reproduce behavior similar to what you are reporting, but I haven't seen this happen to me yet.  The magnetic flag should have jobs that qualify run in the reservation first (if able) but should allow them to run on other resources if the reservation isn't available for whatever reason.  I appreciate you sending the output from the different commands, that did help.  I have a few more commands I'd like to have you run to see what's happening.  Can I get the 'scontrol show job' output for a few of the jobs from rmbarber that aren't running?  For example, jobs 31474879, 31474930, 31475024, 31475025 and 31475048.  Hopefully those jobs are still around.  If not, then I'd like a selection of jobs that aren't running with a Reason of 'Reservation' and some with a Reason of 'Priority'.

I would also like to see the current state of the nodes with sinfo.  

Is this problem affecting more than this one user?  Do other users qualify for the same reservations?  

Thanks,
Ben

Comment 7 azoshima 2022-09-22 11:26:57 MDT

Hi Ben

We could reproduce this behavior in dev environment as well.

It seems this issue happens when the jobs submitted with API while having future reservations.


The jobs are submitted via `/slurm/v0.0.36/job/submit`

body passed to post request:

```
{
  "script": "#!/bin/bash\necho hello",
  "jobs": [
      {
      "account": "infra",
      "argv": [],
      "cpus_per_task": 1,
      "environment": {"PATH": "/bin:/usr/bin/:/usr/local/bin/:/opt/slurm"},
      "memory_per_node": 128,
      "name": "api_test",
      "partition": "all.q"
      }
  ]
}
```


We have those test reservations:

```
ReservationName=infra-1663368965 StartTime=2022-11-03T16:55:41 EndTime=2022-11-05T17:55:41 Duration=2-01:00:00
   Nodes=gen-slurm-sarchive-d02 NodeCnt=1 CoreCnt=2 Features=(null) PartitionName=all.q Flags=NO_HOLD_JOBS_AFTER_END,MAGNETIC
   TRES=cpu=2
   Users=(null) Groups=(null) Accounts=infra Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

ReservationName=infra-1663866154 StartTime=2022-09-24T11:02:00 EndTime=2022-09-24T12:02:00 Duration=01:00:00
   Nodes=gen-slurm-sarchive-d02 NodeCnt=1 CoreCnt=2 Features=(null) PartitionName=all.q Flags=NO_HOLD_JOBS_AFTER_END,MAGNETIC
   TRES=cpu=2
   Users=(null) Groups=(null) Accounts=infra Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

ReservationName=infra-1663866171 StartTime=2022-09-25T11:02:00 EndTime=2022-09-25T12:02:00 Duration=01:00:00
   Nodes=gen-slurm-sarchive-d02 NodeCnt=1 CoreCnt=2 Features=(null) PartitionName=all.q Flags=NO_HOLD_JOBS_AFTER_END,MAGNETIC
   TRES=cpu=2
   Users=(null) Groups=(null) Accounts=infra Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

ReservationName=infra-1663866181 StartTime=2022-09-26T11:02:00 EndTime=2022-09-26T12:02:00 Duration=01:00:00
   Nodes=gen-slurm-sarchive-d02 NodeCnt=1 CoreCnt=2 Features=(null) PartitionName=all.q Flags=NO_HOLD_JOBS_AFTER_END,MAGNETIC
   TRES=cpu=2
   Users=(null) Groups=(null) Accounts=infra Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

ReservationName=infra-1663866196 StartTime=2022-09-27T11:03:00 EndTime=2022-09-27T12:03:00 Duration=01:00:00
   Nodes=gen-slurm-sarchive-d02 NodeCnt=1 CoreCnt=2 Features=(null) PartitionName=all.q Flags=NO_HOLD_JOBS_AFTER_END,MAGNETIC
   TRES=cpu=2
   Users=(null) Groups=(null) Accounts=infra Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

ReservationName=infra-1663866207 StartTime=2022-09-28T11:03:00 EndTime=2022-09-28T12:03:00 Duration=01:00:00
   Nodes=gen-slurm-sarchive-d02 NodeCnt=1 CoreCnt=2 Features=(null) PartitionName=all.q Flags=NO_HOLD_JOBS_AFTER_END,MAGNETIC
   TRES=cpu=2
   Users=(null) Groups=(null) Accounts=infra Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

```


Here is output of some of scontrol show job command:

```
JobId=5892008 HetJobId=5892008 HetJobOffset=0 JobName=api_test
   HetJobIdSet=5892008
   UserId=sadm_azoshima(701017) GroupId=Domain Users(50513) MCS_label=N/A
   Priority=46 Nice=0 Account=infra QOS=normal
   JobState=PENDING Reason=Reservation Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2022-09-22T10:18:27 EligibleTime=2022-09-22T10:18:27
   AccrueTime=2022-09-22T10:18:27
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-09-22T10:20:34 Scheduler=Main
   Partition=all.q AllocNode:Sid=10.158.157.160:1359344
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=128M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=128M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/
   StdErr=//slurm-5892008.out
   StdIn=/dev/null
   StdOut=//slurm-5892008.out
   Power=


JobId=5892007 HetJobId=5892007 HetJobOffset=0 JobName=api_test
   HetJobIdSet=5892007
   UserId=sadm_azoshima(701017) GroupId=Domain Users(50513) MCS_label=N/A
   Priority=46 Nice=0 Account=infra QOS=normal
   JobState=PENDING Reason=Reservation Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2022-09-22T10:18:27 EligibleTime=2022-09-22T10:18:27
   AccrueTime=2022-09-22T10:18:27
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-09-22T10:20:34 Scheduler=Main
   Partition=all.q AllocNode:Sid=10.158.157.141:3871985
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=128M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=128M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/
   StdErr=//slurm-5892007.out
   StdIn=/dev/null
   StdOut=//slurm-5892007.out
   Power=
```

Comment 8 azoshima 2022-09-22 11:31:53 MDT

note - we also observe the same behavior on other api versions as well.

/slurm/v0.0.37/job/submit
/slurm/v0.0.38/job/submit

Comment 10 Ben Roberts 2022-09-23 13:19:41 MDT

Hi Ali,

Thanks for the additional information.  I tried submitting some test jobs with slurmrestd to see I could reproduce the behavior you're seeing that way.  In my testing, all my jobs are able to start on nodes that don't have a future magnetic reservation on them.

Do jobs submitted normally with sbatch exhibit the same behavior on your test environment?  

Since you can reproduce this on your dev environment, can you enable higher level logging while you run some more test jobs?  You can enable debug logs like this:
scontrol setdebug debug3

When you're done you can set the logging back down to 'info' level like this:
scontrol setdebug info

Thanks,
Ben

Comment 11 azoshima 2022-09-23 16:40:33 MDT

Created attachment 26976 [details]
slurmctld.log

slurmctld.log

Comment 12 azoshima 2022-09-23 16:43:42 MDT

Hi Ben

Thanks for your reply.
Please find the attached slurmctld log.
We submitted jobs to reproduce the issue around 15:33 at log timestamp.

> Do jobs submitted normally with sbatch exhibit the same behavior on your test environment?  
The jobs with sbatch seems fine.

Comment 13 azoshima 2022-09-27 11:11:54 MDT

Hi Ben,

do you have any update?

Comment 14 Jason Booth 2022-09-27 11:13:32 MDT

Ben is out of the office this week, however, I have Marshall looking over this issue for you and he will reply once he has finished reviewing the information you have attached.

Comment 15 azoshima 2022-09-27 11:15:54 MDT

Thank you, Jason!

Comment 16 Marshall Garey 2022-09-27 16:41:25 MDT

I can reproduce this.

I have to make sure the job has a time limit so it will end before the reservation starts. Then backfill can schedule the job. Otherwise it works like a normal reservation - if the job's time limit would make the job overlap with the reservation, the job can't run because the nodes are reserved.

Submitting with the rest API or normal salloc/sbatch/srun doesn't matter.

Do you have a default time limit on a partition? Do you set job time limits for jobs submitted not with the rest API but do not set time limits for jobs submitted with the rest API (and therefore they inherit a partition's MaxTime limit)?

Comment 17 azoshima 2022-09-27 17:03:09 MDT

Glad to hear you could reproduce the issue.

>Do you have a default time limit on a partition?

Yes, we set DefaultTime for each partition.

```
PartitionName=all.q Nodes=gen-slurm-sexec-p[0001-0049],gen-slurm-sarchive-p[0001-0213] OverSubscribe=No DefaultTime=24:00:00 MaxTime=72:00:00 State=UP Default=YES
dartitionName=long.q Nodes=long-slurm-sarchive-p[0001-0102] OverSubscribe=No DefaultTime=24:00:00 MaxTime=384:00:00 State=UP
PartitionName=i.q Nodes=int-slurm-sarchive-p[0001-0010] OverSubscribe=No DefaultTime=24:00:00 MaxTime=168:00:00 State=UP
PartitionName=d.q Nodes=modeldev-slurm-sarchive-p[0001-0020] OverSubscribe=No DefaultTime=24:00:00 MaxTime=168:00:00 AllowAccounts=proj_dq,proj_covid,proj_covid_prod,infra State=UP
``` 


>Do you set job time limits for jobs submitted not with the rest API but do not set time limits for jobs submitted with the rest API (and therefore they inherit a partition's MaxTime limit)?

No, I did not set time limit neither sbatch nor api jobs.


Please find the attached about slurm.conf for details.

Comment 18 azoshima 2022-09-27 17:04:42 MDT

Created attachment 27018 [details]
slurm.conf

Comment 19 azoshima 2022-09-27 17:23:43 MDT

I tried to specify time_limit with api req body to make sure it does not overwrap with future reservations. however, i still reproduce the issue.

```
{
  "script": "#!/bin/bash\necho hello",
  "jobs": [
      {
      "account": "infra",
      "argv": [],
      "cpus_per_task": 1,
      "environment": {"PATH": "/bin:/usr/bin/:/usr/local/bin/:/opt/slurm"},
      "memory_per_node": 128,
      "name": "api_test",
      "partition": "all.q",
      "time_limit": 100
      }
  ]
}
```

Comment 21 azoshima 2022-09-30 14:06:07 MDT

Hi Marshall,

do you have any update?

Comment 22 Marshall Garey 2022-10-04 09:13:31 MDT

Hi,

Thanks for your clarification that setting a time limit still does not fix the issue. I still cannot reproduce this. I looked at the slurmctld log but I don't see enough information to know why the job was not scheduled. Can you run the following test?

(It will be much the same information as you've already done, but this time with more detailed slurmctld logging and for a longer time period.)

New test:

scontrol setdebugflags +backfill,backfillmap
scontrol setdebug debug3
scontrol show reservation
submit one job with time limit
Wait for eight minutes
squeue -a -l
scontrol -d show job <jobid>
sinfo

Restore logging levels:
scontrol setdebugflags -backfill,backfillmap
scontrol setdebug info


Can you upload:
* the job submit json/yaml
* The slurmctld log file (compressed)
* The output of scontrol show reservation, squeue -a -l, scontrol -d show job <jobid>, sinfo
* The approximate timestamp of when you submitted the job
* The job id (should be the same as scontrol show job <jobid>)

Thanks

- Marshall

Comment 23 azoshima 2022-10-04 12:04:46 MDT

Created attachment 27113 [details]
slurmctld.log 20221004

Comment 24 azoshima 2022-10-04 12:05:12 MDT

Created attachment 27114 [details]
operation output 20221004

Comment 25 azoshima 2022-10-04 12:07:24 MDT

Created attachment 27115 [details]
job requested body json 20221004

Comment 26 azoshima 2022-10-04 12:15:30 MDT

Thank you, Marshall.
We just followed your instruction and tested.

First, we tried just submitted 1 job with timelimit via api. Which we could not reproduce. (around 10:52)
Second, we tried submitted 5 jobs with timelimit via api. Which we could reproduce. (around 10:54)




>* the job submit json/yaml
Please find the attached. ("job requested body json 20221004")

>* The slurmctld log file (compressed)
Please find the attached. ("slurmctld.log 20221004")

>* The output of scontrol show reservation, squeue -a -l, scontrol -d show job <jobid>, sinfo
Please find the attached about operations log. ("operation output 20221004")

>* The approximate timestamp of when you submitted the job
Submitted 1 job at 10:52 (we could not reproduce)
Submitted 5 jobs at 10:54 (we could reproduce)


>* The job id (should be the same as scontrol show job <jobid>)
first 1 job
jobid: 5892247

second 5 jobs
jobid:
5892252
5892251
5892250
5892249
5892248


Thank you,

Comment 27 Marshall Garey 2022-10-05 14:53:54 MDT

Thank you, that was very helpful. I can reproduce this now. The issue is not the rest API. The issue is heterogeneous jobs. The key clue was the output of squeue:

(base) sadm_azoshima@gen-slurm-slogin-d01:~$ date; squeue -a -l
Tue Oct 4 10:54:18 PDT 2022
Tue Oct 04 10:54:18 2022
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
5892252+0 all.q api_test sadm_azo PENDING 0:00 1:40:00 1 (None)
5892251+0 all.q api_test sadm_azo PENDING 0:00 1:40:00 1 (None)
5892250+0 all.q api_test sadm_azo PENDING 0:00 1:40:00 1 (None)
5892249+0 all.q api_test sadm_azo PENDING 0:00 1:40:00 1 (None)
5892248+0 all.q api_test sadm_azo PENDING 0:00 1:40:00 1 (None)

The plus (+) in the job id indicates that it is a hetjob, although these are strange because they have just one component.

You are using "jobs" rather than "job" to submit the jobs with the rest API, even though you are just submitting a single job.

https://slurm.schedmd.com/rest_api.html#v0.0.38_job_submission

"jobs": [
{
... stuff ...
}
]

If you change that to "job" and remove the array (square brackets), then it won't submit the job as a het job.

Then your jobs should be able to run.

I don't know why het jobs aren't running with a magnetic reservation. I confirmed that it happens with a magnetic reservation, but not in a reservation without the magnetic flag.

It actually happens that het jobs won't run in a magnetic reservation at all unless they request the reservation. Also the issue is not just with future magnetic reservations but also with current magnetic reservations.

Can you change you rest request to use the singular "job" (not "jobs") so it will submit a non-het job?

Comment 29 azoshima 2022-10-05 16:33:32 MDT

Thank you, Marshall.

> Can you change you rest request to use the singular "job" (not "jobs") so it will submit a non-het job?

We just tested as normal jobs (non-het job).
If we just submit a small number of jobs, we could not reproduce the issue. So by using non-het job, it seems it is mitigating this issue somewhat.

However, if we submit a large number of jobs, we still be able to reproduce. (we submitted 100 jobs to reproduce)

Comment 30 Marshall Garey 2022-10-05 16:35:03 MDT

(In reply to azoshima from comment #29)
> Thank you, Marshall.
> 
> > Can you change you rest request to use the singular "job" (not "jobs") so it will submit a non-het job?
> 
> We just tested as normal jobs (non-het job).
> If we just submit a small number of jobs, we could not reproduce the issue.
> So by using non-het job, it seems it is mitigating this issue somewhat.
> 
> However, if we submit a large number of jobs, we still be able to reproduce.
> (we submitted 100 jobs to reproduce)

Can you clarify what happened when submitting a large number of jobs? Did any start running?

Comment 31 azoshima 2022-10-05 16:49:14 MDT

Created attachment 27138 [details]
squeue output

Comment 32 azoshima 2022-10-05 16:49:58 MDT

Created attachment 27139 [details]
slurmctld.log 20221005

Comment 33 azoshima 2022-10-05 16:51:42 MDT

> Can you clarify what happened when submitting a large number of jobs? Did any start running?

Some jobs started running successfully. and some are PD with resource reason. and some are PD with reservation reason.

Please find attached about slurmctld.log and squeue output for your reference.

Comment 36 Marshall Garey 2022-10-06 09:17:49 MDT

Looking at your latest notes:

In the squeue output, all of these jobs were pending, but in the slurmctld log I see they all ran and completed:

           5892584     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892583     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892582     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892581     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892580     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892579     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892578     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892577     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892576     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892575     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892574     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892573     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892572     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892571     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892570     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892569     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892568     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892567     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892566     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892565     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892564     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892563     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892562     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892561     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892560     all.q api_test sadm_azo PD       0:00      1 (Reservation)
           5892559     all.q api_test sadm_azo PD       0:00      1 (Reservation)


For the remaining jobs in the squeue output, the slurmctld log just shows them pending (not started). However, I did just get a snippet of the slurmctld log. Did the remaining jobs start after this log ended?

I have a patch to fix future magnetic reservations preventing het jobs from running now; however, the patch doesn't apply to non-het jobs.

Comment 39 azoshima 2022-10-06 10:23:57 MDT

> Did the remaining jobs start after this log ended?
Yes, remaining jobs are completed after this log.


> I have a patch to fix future magnetic reservations preventing het jobs from running now; however, the patch doesn't apply to non-het jobs.
Thats great! Thank you.



Looking at the log, it looks job tests are executed with all of reservations. so if there are 4 reservations, does it mean 5 tests (4 for with reservation. 1 for without reservation) will be executed per job? If that's the case, it may be hitting this sched param: bf_max_job_user=250

Comment 40 Marshall Garey 2022-10-06 10:33:18 MDT

(In reply to azoshima from comment #39)
> Looking at the log, it looks job tests are executed with all of
> reservations. so if there are 4 reservations, does it mean 5 tests (4 for
> with reservation. 1 for without reservation) will be executed per job? If
> that's the case, it may be hitting this sched param: bf_max_job_user=250

Yes.

Each reservation the job requests has its own queue record. If the job did not request reservations but there are magnetic reservations, there's a separate queue record for each magnetic reservation. In addition, each partition will have a separate queue record. Also, if you use the --prefer option, that will create another separate queue record.

There are quite a few ways in which separate queue records are created for a single job. This is done to evaluate the job with a bunch of different conditions and hopefully start the job as soon as possible. We also have an option in slurm.conf to create only a single backfill "reservation" or plan (not the same as a reservation created by scontrol) per job, even if there are multiple queue records:

https://slurm.schedmd.com/slurm.conf.html#OPT_bf_one_resv_per_job

Comment 41 Marshall Garey 2022-10-06 10:34:34 MDT

(In reply to Marshall Garey from comment #40)
> There are quite a few ways in which separate queue records are created for a
> single job. This is done to evaluate the job with a bunch of different
> conditions and hopefully start the job as soon as possible. We also have an
> option in slurm.conf to create only a single backfill "reservation" or plan
> (not the same as a reservation created by scontrol) per job, even if there
> are multiple queue records:
> 
> https://slurm.schedmd.com/slurm.conf.html#OPT_bf_one_resv_per_job

Clarification: I'm not recommending that you do or don't set this parameter, I just wanted to make you aware of it so you can use if you need it.

Comment 42 azoshima 2022-10-06 11:40:56 MDT

Thanks for the info.

We just tested with "bf_one_resv_per_job" sched param. However, it did not change the behavior. It still create test queue records for multiple reservations.

Comment 43 Marshall Garey 2022-10-06 11:42:29 MDT

(In reply to azoshima from comment #42)
> Thanks for the info.
> 
> We just tested with "bf_one_resv_per_job" sched param. However, it did not
> change the behavior. It still create test queue records for multiple
> reservations.

It doesn't change the fact the multiple queue records must be created to handle the variety of conditions that I mentions (multiple reservations, partitions, prefer). However, the backfill scheduler will only plan resources for the first queue record, and the remaining queue records will only be tested to see if they can run.

Comment 46 Marshall Garey 2022-10-07 16:09:02 MDT

We have pushed a fix for future magnetic reservations preventing het job from starting now:

commit 480b5db241

It will be included in 22.05.4.


So now I'd just like to confirm that you don't have any issues with non-het jobs. If you increase the bf_max_job_user option, do you see the (non-het) jobs running sooner?

Comment 47 azoshima 2022-10-07 17:10:51 MDT

> So now I'd just like to confirm that you don't have any issues with non-het jobs. If you increase the bf_max_job_user option, do you see the (non-het) jobs running sooner?

In our environment, the main issue is that magnetic reservation is blocking user's jobs because reservations are multiplying the number of queue records.

We set bf_max_job_user to prevent single user overwhelming the cluster. Is there any other workaround for this? And do you have any plan to create a patch to prevent this issue?

Comment 48 Marshall Garey 2022-10-10 10:17:05 MDT

(In reply to azoshima from comment #47)
> > So now I'd just like to confirm that you don't have any issues with non-het jobs. If you increase the bf_max_job_user option, do you see the (non-het) jobs running sooner?
> 
> In our environment, the main issue is that magnetic reservation is blocking
> user's jobs because reservations are multiplying the number of queue records.
> 
> We set bf_max_job_user to prevent single user overwhelming the cluster. Is
> there any other workaround for this?

Accrue limits. These limit the number of jobs that will accrue priority due to age. This was added specifically to address the issue you are concerned about (single users taking over the queue).

Search for "accrue" in the following documents:

https://slurm.schedmd.com/resource_limits.html
https://slurm.schedmd.com/sacctmgr.html


>And do you have any plan to create a patch to prevent this issue?

It's not a bug - it's just how it is designed. We need to create separate queue records for each partition, magnetic reservation, prefer constraints, etc. In addition, each queue record needs to be able to be in the backfill map, and that means it counts towards backfill limits.

I recommend increasing the bf_max_job_user limit in addition to using accrue limits.

Comment 49 Ali Nikkhah 2022-10-10 15:11:21 MDT

Thanks Marshall. Response in-line below.

(In reply to Marshall Garey from comment #48)
> (In reply to azoshima from comment #47)
> > > So now I'd just like to confirm that you don't have any issues with non-het jobs. If you increase the bf_max_job_user option, do you see the (non-het) jobs running sooner?
> > 
> > In our environment, the main issue is that magnetic reservation is blocking
> > user's jobs because reservations are multiplying the number of queue records.
> > 
> > We set bf_max_job_user to prevent single user overwhelming the cluster. Is
> > there any other workaround for this?
> 
> Accrue limits. These limit the number of jobs that will accrue priority due
> to age. This was added specifically to address the issue you are concerned
> about (single users taking over the queue).
> 
> Search for "accrue" in the following documents:
> 
> https://slurm.schedmd.com/resource_limits.html
> https://slurm.schedmd.com/sacctmgr.html
>

We currently do not have any priority accruing due to age- we are relying entirely upon fairshare. We need to seriously consider whether this is doable and whether the affects are acceptable.
 
> 
> >And do you have any plan to create a patch to prevent this issue?
> 
> It's not a bug - it's just how it is designed. We need to create separate
> queue records for each partition, magnetic reservation, prefer constraints,
> etc. In addition, each queue record needs to be able to be in the backfill
> map, and that means it counts towards backfill limits.
> 

Just to clarify- you are saying that it is intended behavior for future MAGNETIC reservations to block jobs from running now, even if the jobs could run now  (which is the side-effect of this design, as I am understanding it)? I am having a hard time understanding why that would be desired behavior. Can you elaborate one why this is desired? Is there a way to force the non-resevervation queue to take priority over future reservation queues?

> I recommend increasing the bf_max_job_user limit in addition to using accrue
> limits.

We very specifically tuned the bf_max_job_user based on issues we previously had with a single user dominating the queue, plus our expected cluster usage. Our concern with increasing this is the affect it will have on the responsiveness of the scheduler. That combined with my comment above regarding age, it seems like this may not work well for us.

What happens if we tune bf_max_job_user to, say, 1,000, but a user submits 10,000 jobs in the advanced MAGNETIC reservation scenario? Wouldn't we run into the same issue, just on a larger scale?

Comment 51 Marshall Garey 2022-10-11 17:25:21 MDT

RE looking into accrue limits:

Regardless of the outcome of this bug, I encourage you to look into this to see if it is viable for your site.

RE queue records for magnetic reservations:

It's certainly not intended to block jobs from running, though with a bf_max_job_user limit it makes it more likely to happen. backfill is spending time trying to schedule a queue record, so we count that against backfill limits.

> Can you elaborate one why this is desired?

It's certainly not desired to delay jobs, but it is desired to enforce backfill limits and to generate accurate backfill statistics (see sdiag).


> Is there a way to force
> the non-resevervation queue to take priority over future reservation queues?

Not currently. Reservation queue records take priority over non-reservation queue records.

However, this causes a problem with magnetic reservations: they are evaluated at a high priority in the queue because they are reservations, and there is a separate queue record for each magnetic reservation for each job. This means that a job will be evaluated against all magnetic reservations first before it is evaluated without a reservation. As you have observed, this can lead to hitting backfill limits.

If a job requests reservations explicitly, then those are also high priority. But, this doesn't typically make as many queue records (only equal to the number of reservations explicitly requested by the job), and it doesn't affect jobs that don't explicitly request reservations.

So, I understand your concerns. I'm looking into ways to improve this.

One way is to not count magnetic reservation queue records against backfill limits (which also affects backfill statistics). The backfill scheduler does not make a backfill reservation for magnetic reservation queue records. With this in mind, even though the backfill scheduler is spending time trying to schedule that queue record, it might be okay to not count the magnetic reservation queue records against backfill limits. (All other queue records not associated with magnetic reservation should still count against backfill limits.)

Comment 53 Ali Nikkhah 2022-10-13 09:55:31 MDT

Thanks for the additional discussion! We have increased bf_max_job_user and will do some testing with it. While we expect it will help, we may end up chasing the problem as our future MAGNETIC reservation count increases and users submit large batches of jobs.

We are also discussing details of testing and implementing accrue limits.

Comment 56 Marshall Garey 2022-10-19 15:23:37 MDT

Hi Ali,

We made a change to not count magnetic reservation queue records against backfill limits. This is in commit 9ddc272e4f and will be part of 22.05.6.

Thanks for reporting this! I'm closing this bug as fixed. Let me know if you have any questions.