Ticket 6646

Summary: SLURM_JOB_GPUS not available on epilog
Product: Slurm Reporter: Bruno Mundim <bmundim>
Component: ConfigurationAssignee: Moe Jette <jette>
Status: RESOLVED FIXED QA Contact: Moe Jette <jette>
Severity: 5 - Enhancement    
Priority: --- CC: support
Version: - Unsupported Older Versions   
Hardware: Other   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=8587
https://bugs.schedmd.com/show_bug.cgi?id=8596
Site: SciNet Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: Ubuntu
Machine Name: sgc CLE Version:
Version Fixed: 19.05.0-pre4 Target Release: 19.05
DevPrio: --- Emory-Cloud Sites: ---

Description Bruno Mundim 2019-03-06 08:21:40 MST
I was wondering why SLURM_JOB_GPUS variable is available in the prolog and during the job, but it is not available on the epilog. We would like to get the GPU ids for a job in the epilog script to generate monitoring results based on the which GPUs the job used. 

Thanks,
Bruno.
Comment 1 Moe Jette 2019-03-06 09:47:44 MST
I've just added the logic to Slurm version 19.05 for gres/mps, doing the same for gres/gpu should be easy to do now. The commit with that change is here (mostly for my own reference):
https://github.com/SchedMD/slurm/commit/e23e7fe0f136a43b8e35cc5df9f3f757ebce73d7
Comment 3 Bruno Mundim 2019-03-06 13:05:10 MST
Thank you very much!
Comment 10 Michael Hinton 2019-03-07 13:03:11 MST
Hi Bruno,

We're looking into this now and will get back to you when we have a patch ready.

Thanks,
Michael
Comment 14 Moe Jette 2019-03-11 10:51:46 MDT
Rather than SLURM_JOB_GPUS, I'm going to set CUDA_VISIBLE_DEVICES in the Prolog and Epilog. If gres/mps are configured and used, CUDA_VISIBLE_DEVICES plus CUDA_MPS_ACTIVE_THREAD_PERCENTAGE will be set.

Note that today SLURM_JOB_GPUS is set only in the Prolog and only for batch jobs. It is not set for salloc or jobs created using srun.

These changes will be in Slurm version 19.05, to be released in May.

The commit with the changes (mostly for internal tracking purposes) is here:
https://github.com/SchedMD/slurm/commit/deaf17ab0db3ba49dd2048cb32aabb82e0cc6412
It is not suitable for back-porting to earlier versions of Slurm as it takes advantage of many new changes in data structures and RPCs specific to version 19.05

I'm closing this ticket now, but let me know if you have any questions.
Comment 15 Bruno Mundim 2019-03-11 13:55:31 MDT
Thanks!
Comment 16 Moe Jette 2019-04-01 14:33:47 MDT
Marking this ticket as closed. The functionality will be in Slurm version 19.05 when released.