| Summary: | throw an error message when the value provided for --gpu-bind=map_gpu: exceeds the number of available devices | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | hpc-admin |
| Component: | GPU | Assignee: | Director of Support <support> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | cinek |
| Version: | 19.05.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=8229 https://bugs.schedmd.com/show_bug.cgi?id=7917 |
||
| Site: | Ghent | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 20.11.0 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
hpc-admin
2019-10-05 08:59:59 MDT
Hello, I intend to look at this once we fix a few other GRES and GPU-related issues. Thanks, Michael Hello Stijn, Assuming I have a node with four GPUs and I try to bind tasks to devices out-of-bounds, this is what happens: $ srun -l -n4 --gres=gpu:4 --gpu-bind=map_gpu:3,2,1,4,3 printenv CUDA_VISIBLE_DEVICES 2: 1 3: 0 0: 3 1: 2 There are two things going on here: The first is that task 3 is bound to GPU 0, since GPU 4 doesn’t exist, instead of exiting with an error. The second is that there is no fifth task to bind to, so there should instead be an error. What you are asking for is that both of these things should emit an error and cause the job submission to fail. Does that sum up the issue, or am I missing something? Thanks, Michael hi michael, yes, i believe an srun failure for both cases is better than silently ignoring. at the very least some warning or error message. stijn (In reply to hpc-admin from comment #4) > yes, i believe an srun failure for both cases is better than silently > ignoring. at the very least some warning or error message. Ok, great. I think we should be able to add some client-side validation before the job is submitted, at least for the obvious cases. If we can't catch all corner cases, then I agree that some kind of error printed to the user saying that e.g. "GPU 4 does not exist in the allocation; binding to GPU 0 instead" would be good. Hi Stijn, I've got a patch pending internal review that prints an error when the fallback binding occurs. Here's an example (assuming I have 4 GPUs on a node): $ srun -l -n4 --gres=gpu:4 --gpu-bind=mask_gpu:2,0xFF,0x1c0,0xB0,3 printenv CUDA_VISIBLE_DEVICES 2: slurmstepd-test1: error: Bind request 6-8 (0x1C0) does not specify any devices within the allocation. Binding to the first device in the allocation instead. 3: slurmstepd-test1: error: Bind request 4-5,7 (0xB0) does not specify any devices within the allocation. Binding to the first device in the allocation instead. 1: 0,1,2,3 2: 0 3: 0 0: 1 You can see that task 2 and 3 were not initially bound to any valid GPUs, so the fallback went to GPU 0 and errors were printed. However, I'm not sure I really care that: a) the fifth bind mask (0x3) does nothing, since there are only four tasks, and so that gets discarded; and b) the second bind mask for task 1 (0xFF) exceeds the number of GPUs on the node, but since it still overlaps allocated GPUs, there is no fallback. Issues a) and b) seem like a lot of effort for marginal gain. It's still possible, but we'd need to prioritize it accordingly. I think printing an error whenever there is a binding fallback is the most important thing here, since that was silent and can be surprising. If you want, I can make the patch available to you to try out while we wait for the review process. hi michael, i agree that printing an error is the the minimum actio to take (i'd prefer failure because if fallback is acceptable, gpu-bind looks like a best effort rather than forced control) i don't need the patch (nor will i have much time to do anything with it in the ocming weeks) stijn Hi Stijn, This has been fixed and will appear in 20.11. See commit https://github.com/SchedMD/slurm/commit/eced64c9743f6c3df1e355f43c0914b915b699f1. Thanks! -Michael *** Ticket 7917 has been marked as a duplicate of this ticket. *** |