Ticket 3140

Summary: Reservations and CoreCnt
Product: Slurm Reporter: Nicholas McCollum <nmccollum>
Component: User CommandsAssignee: Unassigned Developer <dev-unassigned>
Status: OPEN --- QA Contact:
Severity: 5 - Enhancement    
Priority: ---    
Version: 15.08.10   
Hardware: Linux   
OS: Linux   
Site: ASC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: ASC slurm.conf

Description Nicholas McCollum 2016-10-03 12:19:19 MDT
Trying to add a fixed amount of cores on another machine to an existing reservation:

$ scontrol show res
ReservationName=class StartTime=2016-08-24T13:02:24 EndTime=2017-08-24T13:02:24 Duration=365-00:00:00
   Nodes=dmc[1,66-71,73,75,126,197] NodeCnt=11 CoreCnt=108 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES
   TRES=cpu=108
   Users=(null) Accounts=class Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a


$ scontrol update res ReservationName=class Nodes=dmc[1,126,66,67,68,69,70,71,73,75,197],uv1 CoreCnt=20,8,8,8,8,8,8,8,8,8,16,24
Reservation updated.


$ scontrol show res
ReservationName=class StartTime=2016-08-24T13:02:24 EndTime=2017-08-24T13:02:24 Duration=365-00:00:00
   Nodes=dmc[1,66-71,73,75,126,197],uv1 NodeCnt=12 CoreCnt=364 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES
   TRES=cpu=364
   Users=(null) Accounts=class Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

Obviously had this command completed correctly the CoreCnt would be 132 instead of 364.  This seems to work sometimes and not other times, which is a bit baffling.

I really need to be able to add and remove nodes from the class reservation smoothly.  As a feature, something like: 

scontrol update res ReservationName=class Nodes+=uv1 CoreCnt=24

Would be amazing... Thanks.
Comment 1 Danny Auble 2016-10-03 16:47:58 MDT
Hey Nicholas, it doesn't appear there is code to handle Core based reservations.

From the code...

	/* FIXME: Support more core based reservation updates */

It appears the check is incorrect to send you an error.  I can look into why it is failing.

Could you send me the command you are using to make the reservation in the first place?

If I had to guess though, I am guessing the reservation really isn't a core based reservation and you are just belt and suspendering the update (requesting nodes as well as the corecnt of each node).  You should only have to request the specific nodes you want.  Corecnt is only needed with reserving less than a node and I don't think that is what is happening here.  Please correct me if I am wrong.

It should be fairly easy to do the +|-= nodename (in theory anyway) but that isn't there today sadly.  Perhaps we can turn this into a feature request when we are done with the initial problem.

Could you send me your slurm.conf file as well?
Comment 2 Nicholas McCollum 2016-10-04 16:41:21 MDT
The node uv1 has 256 processors and I would like to reserve 24 of them to the class reservation.

I can upload my slurm.conf
Comment 3 Nicholas McCollum 2016-10-04 16:43:40 MDT
Created attachment 3563 [details]
ASC slurm.conf
Comment 6 Nicholas McCollum 2016-10-05 12:40:28 MDT
Just to recap here I was able to get the reservation specific core count on my UV2000 by deleting the entire class reservation and remaking the reservation with the Nodes= and CoreCnt=.  

Unfortunately this doesn't seem to be working as it appears that non-reservation jobs can use the reserved cores on my UV, and only when the utilization drops on the UV do reservation jobs run.  

Right now 252 cores are utilized, but a 1 core reservation job does not run.
Comment 7 Nicholas McCollum 2016-10-05 12:41:01 MDT
Going to raise this to a high impact, as this currently is stopping class jobs from running.
Comment 8 Danny Auble 2016-10-05 13:26:13 MDT
Nicholas, what you are doing shouldn't be supported.  At least there is no code to do what you are trying to do, it just isn't erroring out the way it should.

You should be able to create a reservation the way you specified, but updating isn't currently supported.

As you noted in comment 6 creating a new reservation should make the reservation correctly.  I find it interesting the reservation isn't reserving cores on the UV though.  As you noted in comment 0 though it looks like you are creating the reservations with the IGNORE_JOBS flags.  From the scontrol man page

IGNORE_JOBS
    Ignore currently running jobs when creating the reservation. This can be especially useful when reserving all nodes in the system for maintenance. 

This would most likely explain the behavior.  If this is the case I would say the reservation is doing exactly what it is was asked to do.  I am guessing if you removed that ignore_jobs flag then you would get a reservation where no other jobs are running.

If these jobs are starting after the reservation has been created then there would appear to be a bug.  Is this what you are seeing?
Comment 9 Nicholas McCollum 2016-10-05 14:00:56 MDT
Correct.  When I re-created the reservation, there were 252/256 cores utilized.  A pre-existing 16 core job completed and it picked up a 2 core job that was in the reservation.  After that job ran it picked up a 16 core non-reservation job to go back to 252 cores utilized.

It is down to 244 cores now.

$ scontrol show node uv1
NodeName=uv1 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=244 CPUErr=0 CPUTot=256 CPULoad=233.42 Features=uv,sandy-bridge,avx,nogpu
   Gres=(null)
   NodeAddr=uv1 NodeHostName=uv1 Version=15.08
   OS=Linux RealMemory=4006552 AllocMem=247996 FreeMem=3979216 Sockets=256 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=44810240 Weight=1 Owner=N/A
   BootTime=2016-07-08T12:50:51 SlurmdStartTime=2016-07-08T16:07:53
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

$ cat fork-test.sh
#!/bin/bash
/usr/bin/srun /home/asnnam/bin/calc-pi.sh
#/usr/bin/srun /home/asnnam/bin/my_pi
0 asnnam dmc [~/bin]
$ sbatch -p uv --qos=class --time=20 --mem=2G -n5 fork-test.sh
Submitted batch job 53225

$ scontrol show job 53225
JobId=53225 JobName=fork-test.sh
   UserId=asnnam(2573) GroupId=analyst(10000)
   Priority=17003 Nice=-1000 Account=class QOS=class
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:20:00 TimeMin=N/A
   SubmitTime=2016-10-05T14:57:42 EligibleTime=2016-10-05T14:57:42
   StartTime=2016-10-15T04:17:11 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=uv AllocNode:Sid=dmc:25258
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=5,mem=2048,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=2G MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=class
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/mnt/homeapps/home/asnnam/bin/fork-test.sh
   WorkDir=/mnt/homeapps/home/asnnam/bin
   StdErr=/mnt/homeapps/home/asnnam/bin/slurm-53225.out
   StdIn=/dev/null
   StdOut=/mnt/homeapps/home/asnnam/bin/slurm-53225.out
   Power= SICP=0
0 asnnam dmc [~/bin]
$ sbatch -p uv --qos=medium --time=20 --mem=2G -n5 fork-test.sh
Submitted batch job 53226
0 asnnam dmc [~/bin]
$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
53225        fork-test+         uv      class          5    PENDING      0:0
53226        fork-test+         uv      users          5    PENDING      0:0
0 asnnam dmc [~/bin]
$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
53225        fork-test+         uv      class          5    PENDING      0:0
53226        fork-test+         uv      users          5    RUNNING      0:0

0 asnnam dmc [~/bin]
$ scontrol show job 53226
JobId=53226 JobName=fork-test.sh
   UserId=asnnam(2573) GroupId=analyst(10000)
   Priority=2121 Nice=-1000 Account=users QOS=medium
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:09 TimeLimit=00:20:00 TimeMin=N/A
   SubmitTime=2016-10-05T15:00:14 EligibleTime=2016-10-05T15:00:14
   StartTime=2016-10-05T15:00:17 EndTime=2016-10-05T15:20:17
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=uv AllocNode:Sid=dmc:25258
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=uv1
   BatchHost=uv1
   NumNodes=1 NumCPUs=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=5,mem=2048,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=2G MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/mnt/homeapps/home/asnnam/bin/fork-test.sh
   WorkDir=/mnt/homeapps/home/asnnam/bin
   StdErr=/mnt/homeapps/home/asnnam/bin/slurm-53226.out
   StdIn=/dev/null
   StdOut=/mnt/homeapps/home/asnnam/bin/slurm-53226.out
   Power= SICP=0
Comment 10 Danny Auble 2016-10-05 15:21:10 MDT
Hum, that is interesting.  So non-reservation jobs are starting after the reservation was put into place.

I will look closer at the code, but at first glance it appears we are reserving specific cores on each node and so those specific cores would have to be freed for the jobs to run in the reservation.  I don't believe it is just a count of cores but the actual cores.  This would explain what you are seeing.  I don't think there is an easy way around this outside of not using the IGNORE_JOBS flag.  Not using that flag will guarantee correct behavior.
Comment 11 Nicholas McCollum 2016-10-05 15:32:51 MDT
This is what I figured as well.

Is there a handy way to determine which cores are reserved by a reservation on a node?
Comment 12 Danny Auble 2016-10-05 15:43:35 MDT
It doesn't appear there is.  But since you are ignoring jobs it appears it will just pick the first cores on the node.  I am not totally positive that is what happens, but it makes sense this would be the case.
Comment 13 Nicholas McCollum 2016-10-06 08:31:39 MDT
I feel that the ignore jobs flag should be an informational warning and not a constraint.  If a reservation is to be made, the user specifies exactly what they want, its implied that no other jobs should be scheduled on those cores anyways, regardless of their current status.

0 asnnam dmc [~/bin]
$ scontrol update ReservationName=class Flags-=IGNORE_JOBS
Error updating the reservation: Requested nodes are busy
slurm_update error: Requested nodes are busy

The node is back up to 252 cores in use.  A single core job will run, but 2 or more cores hang in a pending state.
Comment 14 Danny Auble 2016-10-06 08:50:27 MDT
The ignore jobs flag is used to make a reservation without considering jobs running in the space.  It will only let current jobs run there but not allow future jobs to start.  No future jobs should be scheduled on the resources outside of the reservation.  Obviously if the jobs run for the entire time the reservation exists the reservation doesn't mean much.  

If you want the reservation to truly be a reservation don't use the flag.  This will create a reservation in the future just as you would like.

The code done for the CoreCnt instead of whole node reservations does not allow for updates made to the reservations.  As discussed earlier in this bug it appears there are bugs in the update logic not sending back a message to you.

In any case removing the ignore jobs flag from the reservation that has already been created isn't possible since the decision has already been made on the resources and removing the flag may have to preempt jobs to make the reservation.

I wish I had better news for you on this, but it sounds like the only way to get what you want currently is to make a reservation without the ignore jobs flag.
Comment 15 Nicholas McCollum 2016-10-06 08:57:26 MDT
I think the conclusion to this should be possibly a change to how SLURM acquires used cores for a reservation.  It appears that currently slurm assigns cores to a reservation regardless if they are in use or not with the IGNORE_JOBS flag.  Possibly a flag could be created such as FIRST_AVAILABLE, so that as cores become idle they are assigned to the reservation until the reservation core amount is achieved.

Could we also turn this into a feature request for the following things:

A command that can show specifically what cores are allocated to a reservation.

The ability to dynamically add and remove cores on specific nodes on a reservation.
Comment 16 Danny Auble 2016-10-06 09:04:41 MDT
I think this is a good idea Nicholas.  I am not sure the effort involved in doing this though.  I agree with the ability to be able to view the reserved cores would be nice as well.  As this is done with jobs already perhaps similar logic could be used to allow for this to happen.  Currently none of this information is available to the user interface so it would at minimum require changes to the RPCs.  The last request would most likely be the most difficult as there is absolutely no current code to allow for updates to work to non-full-node reservations.  On all these I am not sure the amount of effort involved.  I would expect all 3 are doable though but would need to study the code more before I can give you a good estimate.

The earliest possible release for any of these would be 17.02.  Please let us know if you would like to sponsor this and we can get you a SOW set up.
Comment 17 Nicholas McCollum 2016-10-06 09:13:56 MDT
> The earliest possible release for any of these would be 17.02.  Please let
> us know if you would like to sponsor this and we can get you a SOW set up.

Not sure what you mean by sponsor or SOW.
Comment 18 Danny Auble 2016-10-06 09:18:44 MDT
The best way to get a feature request into our development queue is to Sponsor, meaning support or pay for the development.  A Statement of Work (SOW) is something we provide to you to layout the projects requirements and deliverables.  We may be able to get to this after sponsored projects are done, but there is no guarantee.  Does this make sense?
Comment 20 Moe Jette 2017-02-27 14:39:54 MST
I've added logic to print the cores in a core-based advanced reservation in the following commits:
https://github.com/SchedMD/slurm/commit/2331b6cc1c404bb7122fbcc3349f0c295da6b5a7
https://github.com/SchedMD/slurm/commit/afb44301189898c1199c0f2433ceae6f1ed186c7
https://github.com/SchedMD/slurm/commit/ac639ddb04b5b07f2eaf35fe482cd95094acfc46

These changes will be in Slurm version 17.11 (available in November). There are RPC changes, to it can't be back-ported to earlier versions of Slurm.

The information reported now looks something like this:
$ scontrol show res
ReservationName=jette_51 StartTime=2017-02-27T14:02:38 EndTime=2017-02-27T14:12:38 Duration=00:10:00
   Nodes=tux[4-5] NodeCnt=2 CoreCnt=5 Features=(null) PartitionName=debug Flags=SPEC_NODES
     NodeName=tux4 CoreIDs=0-2
     NodeName=tux5 CoreIDs=0-1
   TRES=cpu=5
   Users=jette Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

With respect to increasing the cores in an existing reservation or using cores that will be available soonest, that would require substantial changes to the logic. Almost none of the scheduling logic in Slurm uses sequential algorithms (e.g. evaluating the nodes and/or cores in a sequential fashion). Almost all of the scheduling logic makes use of bitmap operations (there are bitmaps representing allocated cores, idle cored, idle nodes, nodes in a given partition, etc.), so selecting resources for an advanced reservation largely involves a few bitmap operations. The time for those operations is largely independent of the system size, which is important for systems with tens of thousands of nodes. I don't know if or when that might happen.