Ticket 17875

Summary: [CAST-34287] Slurm cgroup issue with GPU IPC
Product: Slurm Reporter: Brian F Gilmer <brian.gilmer>
Component: GPUAssignee: Ben Glines <ben.glines>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: csamuel, david.gloe, felip.moll, jean-yves.vet, simonbyrne
Version: 22.05.9   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=17793
Site: CRAY Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: CSC COMPUTER SCIENCES LTD
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: LUMI CLE Version:
Version Fixed: 23.11.4 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Brian F Gilmer 2023-10-10 07:19:51 MDT
Problem Description: Impact:

This issue makes applications using GPU P2P crash. It impacts all sites using GPUs (at least NVIDA and AMD GPUs) and cgroup enforcement with Slurm. Context:When gpu affinity (-gpu-bind=map_gpu:4,5,2,3,6,7,0,1) or gpu count per task is provided (-gpus-per-task=1), Slurm puts the GPUs in different cgroups which prevents GPU P2P (aka GPU IPC) to be properly initiated.

This issue may appear intermittent as it depends on the transfer size. Indeed, IO sizes smaller than MPICH_GPU_IPC_THRESHOLD (8192 Bytes by default) are not using GPU P2P.

From the documentation:
MPICH_GPU_IPC_THRESHOLD: Intra-node GPU-GPU transfers with payloads of size greater than or equal to this value will use the IPC capability. Transfers with smaller payloads will use CPU-attached shared memory regions.

This is observed with a OSU test:

OSU MPI-ROCM Bandwidth Test v7.2
Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
Size Bandwidth (MB/s)
Datatype: MPI_CHAR.
1 4.32
2 8.97
4 18.55
8 37.42
16 75.58
32 146.83
64 277.03
128 525.71
256 441.81
512 810.07
GTL_DEBUG: [1] hsa_amd_ipc_memory_attach (in gtlt_hsa_ops.c at line 1544): HSA_STATUS_ERROR_INVALID_ARGUMENT: One of the actual arguments does not meet a precondition stated in the documentation of the corresponding formal argument.
MPICH ERROR [Rank 1] [job id 4297202.3] [Wed Aug 2 18:21:24 2023] [nid007320] - Abort(675950594) (rank 1 in comm 0): Fatal error in PMPI_Waitall: Invalid count, error stack:
PMPI_Waitall(339).........................: MPI_Waitall(count=64, req_array=0x423e80, status_array=0x41f060) failed
MPIR_Waitall(167).........................:
MPIR_Waitall_impl(51).....................:
MPID_Progress_wait(193)...................:
MPIDI_Progress_test(97)...................:
MPIDI_SHMI_progress(118)..................:
MPIDI_POSIX_progress(412).................:
MPIDI_CRAY_Common_lmt_ctrl_send_rts_cb(64):
MPIDI_CRAY_Common_lmt_handle_recv(44).....:
MPIDI_CRAY_Common_lmt_import_mem(218).....:
(unknown)(): Invalid count
The Cray MPI (man MPI) documentation says:

Typically, the use of cgroups has the downside of preventing the use of GPU Peer2Peer IPC mechanisms. By default Cray MPI uses IPC for implementing intra-node,
inter-process MPI data movement operations that involve GPU-attached user buffers. When Slurm’s cgroups settings are in effect, users are advised to set
MPICH_SMP_SINGLE_COPY_MODE=NONE or MPICH_GPU_IPC_ENABLED=0 to disable the use of IPC-based implementations. Disabling IPC also has a noticeable impact on intra-node MPI
performance when GPU-attached memory regions are involved.

As a consequence, some customers were advised to set MPICH_GPU_IPC_ENABLE=0. This is not a viable solution as the performance penalty is very large. Another workaround is to let all the GPUs visible to each task and manually defining the visible devices with the ROCR_VISIBLE_DEVICES (or CUDA_VISIBLE_DEVICES) environment variable. While this workaround is working when properly applied (with proper GPU affinity which could be tricky with shared nodes), it is not transparent for the end-user.

Solution:
This issue is also observed by the OpenMPI community (see https://github.com/open-mpi/ompi/issues/11949#issuecomment-1741849673). We would like Slurm (SchedMD) to change the way cgroups are set. A proper way would be to set the cgroups on the GPUs allocated on the node by the job, then set the ROCR_VISIBLE_DEVICES (or HIP_VISIBLE_DEVICES or CUDA_VISIBLE_DEVICES) to match the --gpus-per-task or --gpu-bind options. That way, GPU IPC would be always enabled and the end-user won't have to write a wrapper with the GPU affinity (GPU/CPU affinities are already defined in the gres.conf slurm file).The following case (https://bugs.schedmd.com/show_bug.cgi?id=17793) was already opened by the community, we should help get it prioritized by SchedMD.
Problem Analysis: Fix cgroup implementation with GPUs in Slurm
Comment 1 Oriol Vilarrubi 2023-10-12 11:16:56 MDT
Hello Brian,

If I undestood you properly, what you are asking here is that when we do gpu resource constraint by task (like using the --gpus-per-task) instead of forcing that by using cgroup, we only set the environment variables needed for that.

If that is the case I am afraid that this had been changed in slurm 21.08 to do exactly what is doing now, to enforce gpu isolation in tasks by using cgroups, so we will not be undoing that. I think that the workaround that you described:

"Another workaround is to let all the GPUs visible to each task and manually defining the visible devices with the ROCR_VISIBLE_DEVICES (or CUDA_VISIBLE_DEVICES) environment variable."

Is the appropiate way to go. So change from --gpus-per-task to --gpus and set the env_var for each task, in order to do this last part transparent to the user you could use TaskProlog.

We will be including a note in our documentation in order to reflect that in order to use GPU P2P communication, the gpus cannot be constrained at task level.
Comment 2 Brian F Gilmer 2023-10-24 09:28:22 MDT
This is too bad. It makes things overly complicated for users. Finding the proper locality to manually set ROCR_VISIBLE_DEVICES (or CUDA_VISIBLE_DEVICES) environment variable is not straightforward (it at least requires a wrapper script to hide the complexity, but could be forgotten by the user). All HPC using GPUs and cgroup enforcement will likely complain at some point. It means --gpus-per-task is not an option for sites with cgroup enforcement as it would prevent the GPUs from using P2P (intra node) communications. 
 
If the default behavior cannot be changed, would it be possible to add another feature? For instance, you could support another flag in gres.conf like Flags=no_task_cgroup (meaning cgroup would be applied on all GPUs allocated by the job in the node so that all GPUs are visible from all tasks).
Comment 3 Simon Byrne 2023-10-24 10:01:42 MDT
What if it were to be an extra option to "--gpu-bind" (e.g. "srun --gpu-bind=soft,per_task:1") to use environment variables instead of cgroups for task binding?
Comment 4 Brian F Gilmer 2023-11-03 07:52:03 MDT
From the site:
Yes that would work too. I believe sites could then decide if GPU options should always be rewritten (via job_submit) to use "soft".
Comment 5 Jean-Yves Vet 2023-12-12 00:20:09 MST
Hello, any update on that case?
Comment 7 Ben Glines 2023-12-13 11:20:56 MST
Oriol is on PTO so I am taking over this bug now.

I believe that I understand your use case, and I see why you would want some sort of option to do this "soft" binding to allow each task to see every GPU in the job, but have *_VISIBLE_DEVICES set to however you set the binding with --[tres|gpu]-binding.

I'm getting some internal discussion going on this, and I'll make sure to update you ASAP since I understand that you've been waiting a couple months on this.
Comment 12 Jean-Yves Vet 2023-12-22 17:57:00 MST
I am currently out of the office until Tuesday, January the 2nd with limited access to email. I will respond as soon as I return.

Thank you,
Jean-Yves Vet
HPC Technical Consultant, Application Performance Engineer
Software & CoE Solutions
Comment 14 Jean-Yves Vet 2024-01-18 05:47:28 MST
Oriol, Ben, do you have an update on that case?
Thanks
Comment 22 Ben Glines 2024-02-06 17:01:32 MST
Hello,

This issue will be resolved in 23.11.4 with the addition of the "allow-task-sharing" option to --gres-flags.

See commit 79780501e1.
https://github.com/SchedMD/slurm/commit/79780501e1f55449b99f8f40e9a9b8a423f2b230

example task:
> $ cat task.sh
> #!/bin/bash
> 
> printf "Task $SLURM_PROCID:\n `nvidia-smi -L && env | grep CUDA` \n\n"
Job with default gres settings. Note that each task can only see the GPU that it is bound to:
> $ srun --gpus-per-task=1 -n2 task.sh
> Task 0:
>  GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-52b451c3-8778-72e3-7e53-0eed7e991248)
>   MIG 4g.20gb     Device  0: (UUID: MIG-d38ab0fb-5fea-5c3d-ac17-86da539e6335)
> CUDA_VISIBLE_DEVICES=MIG-d38ab0fb-5fea-5c3d-ac17-86da539e6335
> 
> Task 1:
>  GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-52b451c3-8778-72e3-7e53-0eed7e991248)
>   MIG 3g.20gb     Device  0: (UUID: MIG-121e5901-4794-5b0a-ba9b-168f4ba10447)
> CUDA_VISIBLE_DEVICES=MIG-121e5901-4794-5b0a-ba9b-168f4ba10447
Job with --gres-flags=allow-task-sharing. Note that each task can see each GPU within the job allocation that is on the same node as the task, which will allow inter GPU communication. (note that MIGs do not support P2P, but they are simply used for demo purposes here) Also note that the CUDA_VISIBLE_DEVICES variable accurately reflects the gpu binding (per_task:1 in this case) and only shows the device that is bound to the task:
> $ srun --gpus-per-task=1 -n2 --gres-flags=allow-task-sharing task.sh
> Task 0:
>  GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-52b451c3-8778-72e3-7e53-0eed7e991248)
>   MIG 4g.20gb     Device  0: (UUID: MIG-d38ab0fb-5fea-5c3d-ac17-86da539e6335)
>   MIG 3g.20gb     Device  1: (UUID: MIG-121e5901-4794-5b0a-ba9b-168f4ba10447)
> CUDA_VISIBLE_DEVICES=MIG-d38ab0fb-5fea-5c3d-ac17-86da539e6335
> 
> Task 1:
>  GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-52b451c3-8778-72e3-7e53-0eed7e991248)
>   MIG 4g.20gb     Device  0: (UUID: MIG-d38ab0fb-5fea-5c3d-ac17-86da539e6335)
>   MIG 3g.20gb     Device  1: (UUID: MIG-121e5901-4794-5b0a-ba9b-168f4ba10447)
> CUDA_VISIBLE_DEVICES=MIG-121e5901-4794-5b0a-ba9b-168f4ba10447

Let me know if there are any further questions.
Comment 23 Jean-Yves Vet 2024-02-12 11:07:10 MST
Hello Ben, 

This patch looks great. Many thanks!
To summarize, with 23.11.4 we could still define GRES that way with AMD GPUs in gres.conf:
NodeName=<node_names> Name=gpu Type=mixxx Flags=amd_gpu_env File=/dev/dri/renderD132 Cores=0-7

(Flags=amd_gpu_env is still advised, right?)

Then we could use a "job submit plugin" to make --gres-flags=allow-task-sharing the default behaviour with a partition. Is that right?
Comment 24 David Gloe 2024-06-04 06:14:36 MDT
Is this only an issue when ConstrainDevices=yes is set in cgroup.conf? I'm investigating another bug that might be related to this problem.
Comment 25 Ben Glines 2024-07-18 09:25:32 MDT
Sorry for the late replies here. I presume you've already figured these things out, but I'll still respond for the sake of verifying things. I wasn't checking my "RESOLVED" tickets for responses. In the future, please re-open the ticket in case any engineers aren't actively checking their "RESOLVED" tickets.

(In reply to Jean-Yves Vet from comment #23)
> Hello Ben, 
> 
> This patch looks great. Many thanks!
> To summarize, with 23.11.4 we could still define GRES that way with AMD GPUs
> in gres.conf:
> NodeName=<node_names> Name=gpu Type=mixxx Flags=amd_gpu_env
> File=/dev/dri/renderD132 Cores=0-7
Yes, that looks good.

> (Flags=amd_gpu_env is still advised, right?)
Correct.

> Then we could use a "job submit plugin" to make
> --gres-flags=allow-task-sharing the default behaviour with a partition. Is
> that right?
Correct.

(In reply to David Gloe from comment #24)
> Is this only an issue when ConstrainDevices=yes is set in cgroup.conf? I'm
> investigating another bug that might be related to this problem.
Yes. If ConstrainDevices=no (this is the default setting), all tasks should be able to see all devices.
Comment 26 Jean-Yves Vet 2024-07-18 09:26:07 MDT
I am currently out of the office until Monday, July the 22nd, with limited access to email.
I will respond as soon as I return.

Best regards,
Jean-Yves Vet
HPC Technical Consultant, Application Performance Engineer
Software & CoE Solutions