Ticket 7880

Summary: throw an error message when the value provided for --gpu-bind=map_gpu: exceeds the number of available devices
Product: Slurm Reporter: hpc-admin
Component: GPUAssignee: Director of Support <support>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: cinek
Version: 19.05.2   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=8229
https://bugs.schedmd.com/show_bug.cgi?id=7917
Site: Ghent Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 20.11.0 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description hpc-admin 2019-10-05 08:59:59 MDT
This is follow-up of https://bugs.schedmd.com/show_bug.cgi?id=7726#c51
Comment 2 Michael Hinton 2019-10-22 10:29:18 MDT
Hello,

I intend to look at this once we fix a few other GRES and GPU-related issues.

Thanks,
Michael
Comment 3 Michael Hinton 2019-12-11 13:47:56 MST
Hello Stijn,

Assuming I have a node with four GPUs and I try to bind tasks to devices out-of-bounds, this is what happens:

$ srun -l -n4 --gres=gpu:4 --gpu-bind=map_gpu:3,2,1,4,3 printenv CUDA_VISIBLE_DEVICES
2: 1
3: 0
0: 3
1: 2
 
There are two things going on here: The first is that task 3 is bound to GPU 0, since GPU 4 doesn’t exist, instead of exiting with an error. The second is that there is no fifth task to bind to, so there should instead be an error. 

What you are asking for is that both of these things should emit an error and cause the job submission to fail. Does that sum up the issue, or am I missing something?

Thanks,
Michael
Comment 4 hpc-admin 2019-12-11 13:56:44 MST
hi michael,

yes, i believe an srun failure for both cases is better than silently ignoring. at the very least some warning or error message.

stijn
Comment 5 Michael Hinton 2019-12-11 14:03:05 MST
(In reply to hpc-admin from comment #4)
> yes, i believe an srun failure for both cases is better than silently
> ignoring. at the very least some warning or error message.
Ok, great. I think we should be able to add some client-side validation before the job is submitted, at least for the obvious cases. If we can't catch all corner cases, then I agree that some kind of error printed to the user saying that e.g. "GPU 4 does not exist in the allocation; binding to GPU 0 instead" would be good.
Comment 10 Michael Hinton 2019-12-13 15:46:29 MST
Hi Stijn,

I've got a patch pending internal review that prints an error when the fallback binding occurs. Here's an example (assuming I have 4 GPUs on a node):

$ srun -l -n4 --gres=gpu:4 --gpu-bind=mask_gpu:2,0xFF,0x1c0,0xB0,3 printenv CUDA_VISIBLE_DEVICES
2: slurmstepd-test1: error: Bind request 6-8 (0x1C0) does not specify any devices within the allocation. Binding to the first device in the allocation instead.
3: slurmstepd-test1: error: Bind request 4-5,7 (0xB0) does not specify any devices within the allocation. Binding to the first device in the allocation instead.
1: 0,1,2,3
2: 0
3: 0
0: 1

You can see that task 2 and 3 were not initially bound to any valid GPUs, so the fallback went to GPU 0 and errors were printed.

However, I'm not sure I really care that:

a) the fifth bind mask (0x3) does nothing, since there are only four tasks, and so that gets discarded; and

b) the second bind mask for task 1 (0xFF) exceeds the number of GPUs on the node, but since it still overlaps allocated GPUs, there is no fallback.

Issues a) and b) seem like a lot of effort for marginal gain. It's still possible, but we'd need to prioritize it accordingly. I think printing an error whenever there is a binding fallback is the most important thing here, since that was silent and can be surprising. 

If you want, I can make the patch available to you to try out while we wait for the review process.
Comment 12 hpc-admin 2019-12-14 13:29:36 MST
hi michael, 

i agree that printing an error is the the minimum actio to take (i'd prefer failure because if fallback is acceptable, gpu-bind looks like a best effort rather than forced control)

i don't need the patch (nor will i have much time to do anything with it in the ocming weeks)

stijn
Comment 19 Michael Hinton 2020-04-03 14:38:47 MDT
Hi Stijn,

This has been fixed and will appear in 20.11. See commit https://github.com/SchedMD/slurm/commit/eced64c9743f6c3df1e355f43c0914b915b699f1.

Thanks!
-Michael
Comment 20 Marcin Stolarek 2020-04-06 06:48:27 MDT
*** Ticket 7917 has been marked as a duplicate of this ticket. ***