| Summary: | SLURM_JOB_GPUS not available on epilog | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Bruno Mundim <bmundim> |
| Component: | Configuration | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | Moe Jette <jette> |
| Severity: | 5 - Enhancement | ||
| Priority: | --- | CC: | support |
| Version: | - Unsupported Older Versions | ||
| Hardware: | Other | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=8587 https://bugs.schedmd.com/show_bug.cgi?id=8596 |
||
| Site: | SciNet | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | Ubuntu |
| Machine Name: | sgc | CLE Version: | |
| Version Fixed: | 19.05.0-pre4 | Target Release: | 19.05 |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Bruno Mundim
2019-03-06 08:21:40 MST
I've just added the logic to Slurm version 19.05 for gres/mps, doing the same for gres/gpu should be easy to do now. The commit with that change is here (mostly for my own reference): https://github.com/SchedMD/slurm/commit/e23e7fe0f136a43b8e35cc5df9f3f757ebce73d7 Thank you very much! Hi Bruno, We're looking into this now and will get back to you when we have a patch ready. Thanks, Michael Rather than SLURM_JOB_GPUS, I'm going to set CUDA_VISIBLE_DEVICES in the Prolog and Epilog. If gres/mps are configured and used, CUDA_VISIBLE_DEVICES plus CUDA_MPS_ACTIVE_THREAD_PERCENTAGE will be set. Note that today SLURM_JOB_GPUS is set only in the Prolog and only for batch jobs. It is not set for salloc or jobs created using srun. These changes will be in Slurm version 19.05, to be released in May. The commit with the changes (mostly for internal tracking purposes) is here: https://github.com/SchedMD/slurm/commit/deaf17ab0db3ba49dd2048cb32aabb82e0cc6412 It is not suitable for back-porting to earlier versions of Slurm as it takes advantage of many new changes in data structures and RPCs specific to version 19.05 I'm closing this ticket now, but let me know if you have any questions. Thanks! Marking this ticket as closed. The functionality will be in Slurm version 19.05 when released. |