Ticket 22448 - Warnings with tres-bind and gpu map when gpus are not 0,1
Summary: Warnings with tres-bind and gpu map when gpus are not 0,1
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: GPU (show other tickets)
Version: 24.11.3
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Scott Hilton
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2025-03-27 09:21 MDT by Josko Plazonic
Modified: 2025-03-28 13:10 MDT (History)
0 users

See Also:
Site: Princeton (PICSciE)
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Josko Plazonic 2025-03-27 09:21:06 MDT
Hello,

take this job

#!/bin/bash
#SBATCH -t 1:02:00
#SBATCH -N 1
#SBATCH -n 13
#SBATCH --gres=gpu:2
#SBATCH --tres-bind=gres/gpu:verbose,map:0,1
srun -l /home/plazonic/gputestranks.sh
sleep 200

and the gputestranks.sh is

#!/bin/bash
echo `hostname`,`nvidia-smi --query-gpu=pci.bus_id --format=csv`,$CUDA_VISIBLE_DEVICES

When this gets scheduled on GPUs 0,1 the output is

 0: gpu-bind: usable_gres=0x1; bit_alloc=0x3; local_inx=2; global_list=0; local_list=0
 1: gpu-bind: usable_gres=0x1; bit_alloc=0x3; local_inx=2; global_list=0; local_list=0
 2: gpu-bind: usable_gres=0x1; bit_alloc=0x3; local_inx=2; global_list=0; local_list=0
 3: gpu-bind: usable_gres=0x1; bit_alloc=0x3; local_inx=2; global_list=0; local_list=0
 4: gpu-bind: usable_gres=0x1; bit_alloc=0x3; local_inx=2; global_list=0; local_list=0
 5: gpu-bind: usable_gres=0x1; bit_alloc=0x3; local_inx=2; global_list=0; local_list=0
 6: gpu-bind: usable_gres=0x1; bit_alloc=0x3; local_inx=2; global_list=0; local_list=0
 7: gpu-bind: usable_gres=0x1; bit_alloc=0x3; local_inx=2; global_list=0; local_list=0
 8: gpu-bind: usable_gres=0x1; bit_alloc=0x3; local_inx=2; global_list=0; local_list=0
 9: gpu-bind: usable_gres=0x1; bit_alloc=0x3; local_inx=2; global_list=0; local_list=0
10: gpu-bind: usable_gres=0x1; bit_alloc=0x3; local_inx=2; global_list=0; local_list=0
11: gpu-bind: usable_gres=0x1; bit_alloc=0x3; local_inx=2; global_list=0; local_list=0
12: gpu-bind: usable_gres=0x1; bit_alloc=0x3; local_inx=2; global_list=0; local_list=0
 7: mcmillan-r1g1,pci.bus_id 00000000:04:00.0,0
12: mcmillan-r1g1,pci.bus_id 00000000:03:00.0,0
 0: mcmillan-r1g1,pci.bus_id 00000000:03:00.0,0
 3: mcmillan-r1g1,pci.bus_id 00000000:04:00.0,0
 1: mcmillan-r1g1,pci.bus_id 00000000:04:00.0,0
 2: mcmillan-r1g1,pci.bus_id 00000000:03:00.0,0
11: mcmillan-r1g1,pci.bus_id 00000000:04:00.0,0
 9: mcmillan-r1g1,pci.bus_id 00000000:04:00.0,0
 8: mcmillan-r1g1,pci.bus_id 00000000:03:00.0,0
 5: mcmillan-r1g1,pci.bus_id 00000000:04:00.0,0
10: mcmillan-r1g1,pci.bus_id 00000000:03:00.0,0
 4: mcmillan-r1g1,pci.bus_id 00000000:03:00.0,0
 6: mcmillan-r1g1,pci.bus_id 00000000:03:00.0,0

Exactly as expected. When it gets scheduled on GPUs #2,3 we get this

 0: slurmstepd: error: Bind request gres/gpu:verbose,map:0,1 does not specify any devices within the allocation for task 0. Binding to the first device in the allocation instead.
 1: slurmstepd: error: Bind request gres/gpu:verbose,map:0,1 does not specify any devices within the allocation for task 1. Binding to the first device in the allocation instead.
 4: slurmstepd: error: Bind request gres/gpu:verbose,map:0,1 does not specify any devices within the allocation for task 4. Binding to the first device in the allocation instead.
 4: gpu-bind: usable_gres=0x1; bit_alloc=0xC; local_inx=2; global_list=2; local_list=0
 5: slurmstepd: error: Bind request gres/gpu:verbose,map:0,1 does not specify any devices within the allocation for task 5. Binding to the first device in the allocation instead.
 5: gpu-bind: usable_gres=0x1; bit_alloc=0xC; local_inx=2; global_list=2; local_list=0
 6: slurmstepd: error: Bind request gres/gpu:verbose,map:0,1 does not specify any devices within the allocation for task 6. Binding to the first device in the allocation instead.
 6: gpu-bind: usable_gres=0x1; bit_alloc=0xC; local_inx=2; global_list=2; local_list=0
 7: slurmstepd: error: Bind request gres/gpu:verbose,map:0,1 does not specify any devices within the allocation for task 7. Binding to the first device in the allocation instead.
 7: gpu-bind: usable_gres=0x1; bit_alloc=0xC; local_inx=2; global_list=2; local_list=0
 8: slurmstepd: error: Bind request gres/gpu:verbose,map:0,1 does not specify any devices within the allocation for task 8. Binding to the first device in the allocation instead.
 8: gpu-bind: usable_gres=0x1; bit_alloc=0xC; local_inx=2; global_list=2; local_list=0
 9: slurmstepd: error: Bind request gres/gpu:verbose,map:0,1 does not specify any devices within the allocation for task 9. Binding to the first device in the allocation instead.
 9: gpu-bind: usable_gres=0x1; bit_alloc=0xC; local_inx=2; global_list=2; local_list=0
10: slurmstepd: error: Bind request gres/gpu:verbose,map:0,1 does not specify any devices within the allocation for task 10. Binding to the first device in the allocation instead.
10: gpu-bind: usable_gres=0x1; bit_alloc=0xC; local_inx=2; global_list=2; local_list=0
11: slurmstepd: error: Bind request gres/gpu:verbose,map:0,1 does not specify any devices within the allocation for task 11. Binding to the first device in the allocation instead.
11: gpu-bind: usable_gres=0x1; bit_alloc=0xC; local_inx=2; global_list=2; local_list=0
 0: gpu-bind: usable_gres=0x1; bit_alloc=0xC; local_inx=2; global_list=2; local_list=0
 1: gpu-bind: usable_gres=0x1; bit_alloc=0xC; local_inx=2; global_list=2; local_list=0
 2: slurmstepd: error: Bind request gres/gpu:verbose,map:0,1 does not specify any devices within the allocation for task 2. Binding to the first device in the allocation instead.
 2: gpu-bind: usable_gres=0x1; bit_alloc=0xC; local_inx=2; global_list=2; local_list=0
 3: slurmstepd: error: Bind request gres/gpu:verbose,map:0,1 does not specify any devices within the allocation for task 3. Binding to the first device in the allocation instead.
 3: gpu-bind: usable_gres=0x1; bit_alloc=0xC; local_inx=2; global_list=2; local_list=0
12: slurmstepd: error: Bind request gres/gpu:verbose,map:0,1 does not specify any devices within the allocation for task 12. Binding to the first device in the allocation instead.
12: gpu-bind: usable_gres=0x1; bit_alloc=0xC; local_inx=2; global_list=2; local_list=0
 7: mcmillan-r1g1,pci.bus_id 00000000:83:00.0,0
11: mcmillan-r1g1,pci.bus_id 00000000:83:00.0,0
 2: mcmillan-r1g1,pci.bus_id 00000000:82:00.0,0
 5: mcmillan-r1g1,pci.bus_id 00000000:83:00.0,0
 4: mcmillan-r1g1,pci.bus_id 00000000:82:00.0,0
 6: mcmillan-r1g1,pci.bus_id 00000000:82:00.0,0
 1: mcmillan-r1g1,pci.bus_id 00000000:83:00.0,0
10: mcmillan-r1g1,pci.bus_id 00000000:82:00.0,0
12: mcmillan-r1g1,pci.bus_id 00000000:82:00.0,0
 0: mcmillan-r1g1,pci.bus_id 00000000:82:00.0,0
 3: mcmillan-r1g1,pci.bus_id 00000000:83:00.0,0
 8: mcmillan-r1g1,pci.bus_id 00000000:82:00.0,0
 9: mcmillan-r1g1,pci.bus_id 00000000:83:00.0,0

Allocation clearly did what we wanted - we have alternate GPUs given to tasks - but why errors by slurmstepd? Are these fake errors or what is going on here?

BTW this is the nvidia-smi on that machine

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla P100-PCIE-16GB           On  |   00000000:03:00.0 Off |                    0 |
| N/A   33C    P0             27W /  250W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla P100-PCIE-16GB           On  |   00000000:04:00.0 Off |                    0 |
| N/A   35C    P0             26W /  250W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla P100-PCIE-16GB           On  |   00000000:82:00.0 Off |                    0 |
| N/A   33C    P0             26W /  250W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Tesla P100-PCIE-16GB           On  |   00000000:83:00.0 Off |                    0 |
| N/A   32C    P0             27W /  250W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Josko
Comment 1 Carlos Tripiana Montes 2025-03-27 09:54:17 MDT
Hi Josko,

As per [1]:

"If the task/cgroup plugin is used and ConstrainDevices is set in cgroup.conf, then the gres IDs are zero-based indexes relative to the gres allocated to the job (e.g. the first gres is 0, even if the global ID is 3). Otherwise, the gres IDs are global IDs, and all gres on each node in the job should be allocated for predictable binding results."

I am assuming that ConstrainDevices is not used in the cluster you are testing. If that point is correct, the behaviour is expected, as it fallbacks to the default behaviour when the mapping is not possible.

Regards,
Carlos.

[1] https://slurm.schedmd.com/sbatch.html#OPT_map:%3Clist%3E
Comment 2 Josko Plazonic 2025-03-27 09:58:00 MDT
Sorry, but we do have things configured correctly:

[root@mcmillan-r1g1 ~]# scontrol show config |grep TaskPlugin
TaskPlugin              = task/cgroup,task/affinity
TaskPluginParam         = (null type)

[root@mcmillan-r1g1 ~]# cat /etc/slurm/cgroup.conf 
### Managed by puppet - do not change
#
# Slurm cgroup support configuration file
#
# See man slurm.conf and man cgroup.conf for further
# information on cgroup configuration parameters
#--
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes

After all, it probably would not work at all with map=0,1.
Comment 3 Carlos Tripiana Montes 2025-03-27 10:00:26 MDT
What do you mean exactly with:

> After all, it probably would not work at all with map=0,1.

Thank you!
Comment 4 Josko Plazonic 2025-03-27 10:03:58 MDT
I meant that if we did not configure ConstrainDevices=yes and used task/cgroup that in the case where the job is given GPUs #2,3 that it probably would not work as expected with --tres-bind=gres/gpu:map:0,1 - but it does.
Comment 5 Scott Hilton 2025-03-28 13:10:45 MDT
Josko,

I reproduced your issue and see the problem. I think it may be a bug. I am investigating a potential solution.

-Scott