Summary: | [CAST-34287] Slurm cgroup issue with GPU IPC | ||
---|---|---|---|
Product: | Slurm | Reporter: | Brian F Gilmer <brian.gilmer> |
Component: | GPU | Assignee: | Ben Glines <ben.glines> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | csamuel, david.gloe, felip.moll, jean-yves.vet, simonbyrne |
Version: | 22.05.9 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=17793 | ||
Site: | CRAY | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | CSC COMPUTER SCIENCES LTD |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | LUMI | CLE Version: | |
Version Fixed: | 23.11.4 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Brian F Gilmer
2023-10-10 07:19:51 MDT
Hello Brian, If I undestood you properly, what you are asking here is that when we do gpu resource constraint by task (like using the --gpus-per-task) instead of forcing that by using cgroup, we only set the environment variables needed for that. If that is the case I am afraid that this had been changed in slurm 21.08 to do exactly what is doing now, to enforce gpu isolation in tasks by using cgroups, so we will not be undoing that. I think that the workaround that you described: "Another workaround is to let all the GPUs visible to each task and manually defining the visible devices with the ROCR_VISIBLE_DEVICES (or CUDA_VISIBLE_DEVICES) environment variable." Is the appropiate way to go. So change from --gpus-per-task to --gpus and set the env_var for each task, in order to do this last part transparent to the user you could use TaskProlog. We will be including a note in our documentation in order to reflect that in order to use GPU P2P communication, the gpus cannot be constrained at task level. This is too bad. It makes things overly complicated for users. Finding the proper locality to manually set ROCR_VISIBLE_DEVICES (or CUDA_VISIBLE_DEVICES) environment variable is not straightforward (it at least requires a wrapper script to hide the complexity, but could be forgotten by the user). All HPC using GPUs and cgroup enforcement will likely complain at some point. It means --gpus-per-task is not an option for sites with cgroup enforcement as it would prevent the GPUs from using P2P (intra node) communications. If the default behavior cannot be changed, would it be possible to add another feature? For instance, you could support another flag in gres.conf like Flags=no_task_cgroup (meaning cgroup would be applied on all GPUs allocated by the job in the node so that all GPUs are visible from all tasks). What if it were to be an extra option to "--gpu-bind" (e.g. "srun --gpu-bind=soft,per_task:1") to use environment variables instead of cgroups for task binding? From the site: Yes that would work too. I believe sites could then decide if GPU options should always be rewritten (via job_submit) to use "soft". Hello, any update on that case? Oriol is on PTO so I am taking over this bug now. I believe that I understand your use case, and I see why you would want some sort of option to do this "soft" binding to allow each task to see every GPU in the job, but have *_VISIBLE_DEVICES set to however you set the binding with --[tres|gpu]-binding. I'm getting some internal discussion going on this, and I'll make sure to update you ASAP since I understand that you've been waiting a couple months on this. I am currently out of the office until Tuesday, January the 2nd with limited access to email. I will respond as soon as I return. Thank you, Jean-Yves Vet HPC Technical Consultant, Application Performance Engineer Software & CoE Solutions Oriol, Ben, do you have an update on that case? Thanks Hello, This issue will be resolved in 23.11.4 with the addition of the "allow-task-sharing" option to --gres-flags. See commit 79780501e1. https://github.com/SchedMD/slurm/commit/79780501e1f55449b99f8f40e9a9b8a423f2b230 example task: > $ cat task.sh > #!/bin/bash > > printf "Task $SLURM_PROCID:\n `nvidia-smi -L && env | grep CUDA` \n\n" Job with default gres settings. Note that each task can only see the GPU that it is bound to: > $ srun --gpus-per-task=1 -n2 task.sh > Task 0: > GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-52b451c3-8778-72e3-7e53-0eed7e991248) > MIG 4g.20gb Device 0: (UUID: MIG-d38ab0fb-5fea-5c3d-ac17-86da539e6335) > CUDA_VISIBLE_DEVICES=MIG-d38ab0fb-5fea-5c3d-ac17-86da539e6335 > > Task 1: > GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-52b451c3-8778-72e3-7e53-0eed7e991248) > MIG 3g.20gb Device 0: (UUID: MIG-121e5901-4794-5b0a-ba9b-168f4ba10447) > CUDA_VISIBLE_DEVICES=MIG-121e5901-4794-5b0a-ba9b-168f4ba10447 Job with --gres-flags=allow-task-sharing. Note that each task can see each GPU within the job allocation that is on the same node as the task, which will allow inter GPU communication. (note that MIGs do not support P2P, but they are simply used for demo purposes here) Also note that the CUDA_VISIBLE_DEVICES variable accurately reflects the gpu binding (per_task:1 in this case) and only shows the device that is bound to the task: > $ srun --gpus-per-task=1 -n2 --gres-flags=allow-task-sharing task.sh > Task 0: > GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-52b451c3-8778-72e3-7e53-0eed7e991248) > MIG 4g.20gb Device 0: (UUID: MIG-d38ab0fb-5fea-5c3d-ac17-86da539e6335) > MIG 3g.20gb Device 1: (UUID: MIG-121e5901-4794-5b0a-ba9b-168f4ba10447) > CUDA_VISIBLE_DEVICES=MIG-d38ab0fb-5fea-5c3d-ac17-86da539e6335 > > Task 1: > GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-52b451c3-8778-72e3-7e53-0eed7e991248) > MIG 4g.20gb Device 0: (UUID: MIG-d38ab0fb-5fea-5c3d-ac17-86da539e6335) > MIG 3g.20gb Device 1: (UUID: MIG-121e5901-4794-5b0a-ba9b-168f4ba10447) > CUDA_VISIBLE_DEVICES=MIG-121e5901-4794-5b0a-ba9b-168f4ba10447 Let me know if there are any further questions. Hello Ben, This patch looks great. Many thanks! To summarize, with 23.11.4 we could still define GRES that way with AMD GPUs in gres.conf: NodeName=<node_names> Name=gpu Type=mixxx Flags=amd_gpu_env File=/dev/dri/renderD132 Cores=0-7 (Flags=amd_gpu_env is still advised, right?) Then we could use a "job submit plugin" to make --gres-flags=allow-task-sharing the default behaviour with a partition. Is that right? Is this only an issue when ConstrainDevices=yes is set in cgroup.conf? I'm investigating another bug that might be related to this problem. Sorry for the late replies here. I presume you've already figured these things out, but I'll still respond for the sake of verifying things. I wasn't checking my "RESOLVED" tickets for responses. In the future, please re-open the ticket in case any engineers aren't actively checking their "RESOLVED" tickets. (In reply to Jean-Yves Vet from comment #23) > Hello Ben, > > This patch looks great. Many thanks! > To summarize, with 23.11.4 we could still define GRES that way with AMD GPUs > in gres.conf: > NodeName=<node_names> Name=gpu Type=mixxx Flags=amd_gpu_env > File=/dev/dri/renderD132 Cores=0-7 Yes, that looks good. > (Flags=amd_gpu_env is still advised, right?) Correct. > Then we could use a "job submit plugin" to make > --gres-flags=allow-task-sharing the default behaviour with a partition. Is > that right? Correct. (In reply to David Gloe from comment #24) > Is this only an issue when ConstrainDevices=yes is set in cgroup.conf? I'm > investigating another bug that might be related to this problem. Yes. If ConstrainDevices=no (this is the default setting), all tasks should be able to see all devices. I am currently out of the office until Monday, July the 22nd, with limited access to email. I will respond as soon as I return. Best regards, Jean-Yves Vet HPC Technical Consultant, Application Performance Engineer Software & CoE Solutions |