Ticket 1092

Summary: Requested node configuration is not available with --ntasks-per-node and --gres
Product: Slurm Reporter: Kilian Cavalotti <kilian>
Component: SchedulingAssignee: Moe Jette <jette>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: da
Version: 14.03.7   
Hardware: Linux   
OS: Linux   
Site: Stanford Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 14.03.8, 14.11.0-pre5
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: make gres.conf CPUs advisory

Description Kilian Cavalotti 2014-09-09 07:11:41 MDT
Hi,

I have a weird issue when submitting jobs with --gres and --ntasks-per-node, but I'm not sur eif that's a configuration issue or something I'm missing.

I have GPU nodes, with the following configuration:

$ scontrol show partition gpu
PartitionName=gpu
   AllowGroups=ALL AllowAccounts=ALL AllowQos=gpu,system
   AllocNodes=ALL Default=NO
   DefaultTime=02:00:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=gpu-9-[1-5]
   Priority=1 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=80 TotalNodes=5 SelectTypeParameters=N/A
   DefMemPerCPU=16000 MaxMemPerCPU=16384

$ scontrol show node gpu-9-1
NodeName=gpu-9-1 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=0 CPUErr=0 CPUTot=16 CPULoad=N/A Features=k20x
   Gres=gpu:8
   NodeAddr=gpu-9-1 NodeHostName=gpu-9-1 Version=(null)
   OS=Linux RealMemory=258000 AllocMem=0 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1000
   BootTime=2014-09-02T09:18:29 SlurmdStartTime=2014-09-02T13:50:02
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


Their gres.conf file is as follows:

Name=gpu File=/dev/nvidia[0-3] CPUs=[0-7]
Name=gpu File=/dev/nvidia[4-7] CPUs=[8-15]

So they have 16 CPU-cores and 8 GPUs each. Yet I can't seem to be able to run more that 8 tasks per node when requesting gres. Without gres, I can request up to 16 tasks, which is normal.

$ salloc --ntasks-per-node=16 -p gpu --qos=gpu
salloc: Granted job allocation 322052
$ exit
salloc: Relinquishing job allocation 322052

$ salloc --ntasks-per-node=9 --gres=gpu:1 -p gpu --qos=gpu
salloc: error: Job submit/allocate failed: Requested node configuration is not available
salloc: Job allocation 322053 has been revoked.

$ salloc --ntasks-per-node=8 --gres=gpu:1 -p gpu --qos=gpu
salloc: Granted job allocation 322054
$ exit
salloc: Relinquishing job allocation 322054


Could you please help me figuring out what I'm doing wrong?

Thanks!
Comment 1 David Bigagli 2014-09-09 07:27:41 MDT
Hi,
   if you request 2 gpus and --ntasks-per-node=9 does it work?
I am going to try your case but we think the code is enforcing the
gpu allocation in block of 8 cpus as configured.

David
Comment 2 Kilian Cavalotti 2014-09-09 07:29:16 MDT
Hi David,

(In reply to David Bigagli from comment #1)
>    if you request 2 gpus and --ntasks-per-node=9 does it work?

Nope:
$  salloc --ntasks-per-node=9 --gres=gpu:2 -p gpu --qos=gpu
salloc: error: Job submit/allocate failed: Requested node configuration is not available
Comment 3 David Bigagli 2014-09-09 08:01:55 MDT
Hi Kilian,
         I can reproduce this problem as well. Someone from our scheduling team will get back to you on this.

David
Comment 4 Kilian Cavalotti 2014-09-09 09:16:21 MDT
(In reply to David Bigagli from comment #3)
> Hi Kilian,
>          I can reproduce this problem as well. Someone from our scheduling
> team will get back to you on this.

Thanks!
Comment 5 Moe Jette 2014-09-09 09:37:44 MDT
I can give you a work-around for right now. If you specify the option --gres=gpu:5 then it the allocation logic seems to spill over into the second batch of CPUs so that all CPUs are accessible.
Comment 6 Kilian Cavalotti 2014-09-09 09:58:58 MDT
Hi Moe, 

> I can give you a work-around for right now. If you specify the option
> --gres=gpu:5 then it the allocation logic seems to spill over into the
> second batch of CPUs so that all CPUs are accessible.

Right, that works, thanks.
I was using --exclusive before, but just realized that it still sets SLURM_TASKS_PER_NODE to 8.
Comment 7 Moe Jette 2014-09-10 06:51:35 MDT
This will be fixed in version 14.03.8 when released. The commit (in case you want to work with a patch) is here:
https://github.com/SchedMD/slurm/commit/0ec4d6b76b568ce7703c1c42c2cb51d1bddde7f8

Since you want more than 8 CPUs (which are not associated with any single GPU), you will have specify at least two GPUs in the job request, i.e. "--gres=gpu:2" (which is better than the 5 you need today).

Here is a sample of what I see for CUDA_VISIBLE_DEVICES with your configuration:
$ srun  -n16 --gres=gpu:2 tmp2
CUDA_VISIBLE_DEVICES=0,4
...
CUDA_VISIBLE_DEVICES=0,4

Note that until version 14.11, you must spell out each file name, so while this will work in v14.11:
Name=gpu File=/dev/nvidia[0-3] CPUs=[0-7]
Name=gpu File=/dev/nvidia[4-7] CPUs=[8-15]

For now you should change gres.conf to this:
Name=gpu File=/dev/nvidia0 CPUs=[0-7]
Name=gpu File=/dev/nvidia1 CPUs=[0-7]
Name=gpu File=/dev/nvidia2 CPUs=[0-7]
Name=gpu File=/dev/nvidia3 CPUs=[0-7]
Name=gpu File=/dev/nvidia4 CPUs=[8-15]
Comment 8 Moe Jette 2014-09-10 07:07:10 MDT
Created attachment 1221 [details]
make gres.conf CPUs advisory

If you want the GRES CPU specifications to be advisory so that you can allocate one GPU with over 8 CPUs, this patch for your local use on top of the commit already made will do that. As indicated previously, you can specify 2 GPUs with the code that will be in the general release.
Comment 9 Moe Jette 2014-09-10 07:07:35 MDT
Fixed in v14.03.8
Comment 10 Kilian Cavalotti 2014-09-10 07:51:47 MDT
Hi Moe, 

(In reply to Moe Jette from comment #7)
> This will be fixed in version 14.03.8 when released. The commit (in case you
> want to work with a patch) is here:
> https://github.com/SchedMD/slurm/commit/
> 0ec4d6b76b568ce7703c1c42c2cb51d1bddde7f8

Thanks for the fix!

> Since you want more than 8 CPUs (which are not associated with any single
> GPU), you will have specify at least two GPUs in the job request, i.e.
> "--gres=gpu:2" (which is better than the 5 you need today).

Ok. Could you please give me more details about the need to request 2 GPUs to be able to allocate all the 16 CPUs? 
I've seen your next patch, but I'm not sure I understand the logic behind the default behavior (needing 2 GPUs for more than 8 CPUs). Wouldn't it make some sense to allow requests of independent numbers of CPUs and GPUs, and just use the device/CPU mapping as a 'hint' rather than a strong requirement?  Like it's done for the topology plugin, to optimize job placement, but in case an optimal setting is not achievable, run the job anyway.

Maybe I'm not making sense, I have a very limited understanding of the things at stake here.

> Note that until version 14.11, you must spell out each file name, so while
> this will work in v14.11:
> Name=gpu File=/dev/nvidia[0-3] CPUs=[0-7]
> Name=gpu File=/dev/nvidia[4-7] CPUs=[8-15]
> 
> For now you should change gres.conf to this:
> Name=gpu File=/dev/nvidia0 CPUs=[0-7]
> Name=gpu File=/dev/nvidia1 CPUs=[0-7]
> Name=gpu File=/dev/nvidia2 CPUs=[0-7]
> Name=gpu File=/dev/nvidia3 CPUs=[0-7]
> Name=gpu File=/dev/nvidia4 CPUs=[8-15]


Oh ok, I thought this was fixed in 14.03.5 as per #905.
Because we've been running that way for a while, and it seemed fine. With DebugFlags=gres, I have the following in slurmd.log:

[2014-09-10T12:22:13.014] Gres Name=gpu Count=4 ID=7696487 File=/dev/nvidia[0-3] CPUs=[0-7] CpuCnt=16
[2014-09-10T12:22:13.014] Gres Name=gpu Count=4 ID=7696487 File=/dev/nvidia[4-7] CPUs=[8-15] CpuCnt=16

The "Count=4" part makes me feel it got it right, am I wrong?

Thanks!
Comment 11 Moe Jette 2014-09-10 08:02:16 MDT
(In reply to Kilian Cavalotti from comment #10)
> > Since you want more than 8 CPUs (which are not associated with any single
> > GPU), you will have specify at least two GPUs in the job request, i.e.
> > "--gres=gpu:2" (which is better than the 5 you need today).
> 
> Ok. Could you please give me more details about the need to request 2 GPUs
> to be able to allocate all the 16 CPUs? 
> I've seen your next patch, but I'm not sure I understand the logic behind
> the default behavior (needing 2 GPUs for more than 8 CPUs). Wouldn't it make
> some sense to allow requests of independent numbers of CPUs and GPUs, and
> just use the device/CPU mapping as a 'hint' rather than a strong
> requirement?  Like it's done for the topology plugin, to optimize job
> placement, but in case an optimal setting is not achievable, run the job
> anyway.

Right now the CPUs specification in gres.conf is not advisory. A line like this:
Name=gpu File=/dev/nvidia0 CPUs=[0-7]
means that GPU 0 can only be used with CPUs 0-7. The patch that I attached make it advisory rather than mandatory. I definitely don't want to change this in version 14.03 and I have mixed feelings about changing it in v14.11.


> > Note that until version 14.11, you must spell out each file name, so while
> > this will work in v14.11:
> > Name=gpu File=/dev/nvidia[0-3] CPUs=[0-7]
> > Name=gpu File=/dev/nvidia[4-7] CPUs=[8-15]
> > 
> > For now you should change gres.conf to this:
> > Name=gpu File=/dev/nvidia0 CPUs=[0-7]
> > Name=gpu File=/dev/nvidia1 CPUs=[0-7]
> > Name=gpu File=/dev/nvidia2 CPUs=[0-7]
> > Name=gpu File=/dev/nvidia3 CPUs=[0-7]
> > Name=gpu File=/dev/nvidia4 CPUs=[8-15]
> 
> Oh ok, I thought this was fixed in 14.03.5 as per #905.
> Because we've been running that way for a while, and it seemed fine. With
> DebugFlags=gres, I have the following in slurmd.log:
> 
> [2014-09-10T12:22:13.014] Gres Name=gpu Count=4 ID=7696487
> File=/dev/nvidia[0-3] CPUs=[0-7] CpuCnt=16
> [2014-09-10T12:22:13.014] Gres Name=gpu Count=4 ID=7696487
> File=/dev/nvidia[4-7] CPUs=[8-15] CpuCnt=16
> 
> The "Count=4" part makes me feel it got it right, am I wrong?

I see some problems in the logic managing the mapping of GPUs to CPUs without the files on separate lines. Again, that will be fixed in v14.11. For now, I would recommend splitting it into separate lines.
Comment 12 Kilian Cavalotti 2014-09-10 08:07:09 MDT
(In reply to Moe Jette from comment #11)
> Right now the CPUs specification in gres.conf is not advisory. A line like
> this:
> Name=gpu File=/dev/nvidia0 CPUs=[0-7]
> means that GPU 0 can only be used with CPUs 0-7. The patch that I attached
> make it advisory rather than mandatory. I definitely don't want to change
> this in version 14.03 and I have mixed feelings about changing it in v14.11.

Ok, I see, it makes sense.

> I see some problems in the logic managing the mapping of GPUs to CPUs
> without the files on separate lines. Again, that will be fixed in v14.11.
> For now, I would recommend splitting it into separate lines.

Thanks for the clarification, I'll go split that file up.

Thanks again for the explanation!
Comment 13 Kilian Cavalotti 2014-09-11 09:45:27 MDT
(In reply to Kilian Cavalotti from comment #12)
> (In reply to Moe Jette from comment #11)
> > Right now the CPUs specification in gres.conf is not advisory. A line like
> > this:
> > Name=gpu File=/dev/nvidia0 CPUs=[0-7]
> > means that GPU 0 can only be used with CPUs 0-7. The patch that I attached
> > make it advisory rather than mandatory. I definitely don't want to change
> > this in version 14.03 and I have mixed feelings about changing it in v14.11.
> 
> Ok, I see, it makes sense.

Oh and I forgot to ask: would it be worth mentioning this in the gres.conf documentation? 

Thanks!
Comment 14 Moe Jette 2014-09-11 10:05:51 MDT
(In reply to Kilian Cavalotti from comment #13)
> (In reply to Kilian Cavalotti from comment #12)
> > (In reply to Moe Jette from comment #11)
> > > Right now the CPUs specification in gres.conf is not advisory. A line like
> > > this:
> > > Name=gpu File=/dev/nvidia0 CPUs=[0-7]
> > > means that GPU 0 can only be used with CPUs 0-7. The patch that I attached
> > > make it advisory rather than mandatory. I definitely don't want to change
> > > this in version 14.03 and I have mixed feelings about changing it in v14.11.
> > 
> > Ok, I see, it makes sense.
> 
> Oh and I forgot to ask: would it be worth mentioning this in the gres.conf
> documentation? 
> 
> Thanks!


I just added info to the gres web page and gres.conf man page.
Comment 15 Kilian Cavalotti 2014-09-11 10:17:13 MDT
> I just added info to the gres web page and gres.conf man page.

Excellent, thank you Moe!