Summary: | Warnings with tres-bind and gpu map when gpus are not 0,1 | ||
---|---|---|---|
Product: | Slurm | Reporter: | Josko Plazonic <plazonic> |
Component: | GPU | Assignee: | Scott Hilton <scott> |
Status: | OPEN --- | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | ||
Version: | 24.11.3 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Princeton (PICSciE) | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
Josko Plazonic
2025-03-27 09:21:06 MDT
Hi Josko, As per [1]: "If the task/cgroup plugin is used and ConstrainDevices is set in cgroup.conf, then the gres IDs are zero-based indexes relative to the gres allocated to the job (e.g. the first gres is 0, even if the global ID is 3). Otherwise, the gres IDs are global IDs, and all gres on each node in the job should be allocated for predictable binding results." I am assuming that ConstrainDevices is not used in the cluster you are testing. If that point is correct, the behaviour is expected, as it fallbacks to the default behaviour when the mapping is not possible. Regards, Carlos. [1] https://slurm.schedmd.com/sbatch.html#OPT_map:%3Clist%3E Sorry, but we do have things configured correctly: [root@mcmillan-r1g1 ~]# scontrol show config |grep TaskPlugin TaskPlugin = task/cgroup,task/affinity TaskPluginParam = (null type) [root@mcmillan-r1g1 ~]# cat /etc/slurm/cgroup.conf ### Managed by puppet - do not change # # Slurm cgroup support configuration file # # See man slurm.conf and man cgroup.conf for further # information on cgroup configuration parameters #-- ConstrainCores=yes ConstrainRAMSpace=yes ConstrainDevices=yes After all, it probably would not work at all with map=0,1. What do you mean exactly with:
> After all, it probably would not work at all with map=0,1.
Thank you!
I meant that if we did not configure ConstrainDevices=yes and used task/cgroup that in the case where the job is given GPUs #2,3 that it probably would not work as expected with --tres-bind=gres/gpu:map:0,1 - but it does. Josko, I reproduced your issue and see the problem. I think it may be a bug. I am investigating a potential solution. -Scott |