7726 – gpu-bind closest is wrong

Ticket 7726 - gpu-bind closest is wrong

Summary: gpu-bind closest is wrong

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	GPU (show other tickets)
Version:	19.05.2
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Marcin Stolarek
QA Contact:

URL:

Depends on:	7509
Blocks:
	Show dependency tree / graph

Reported:	2019-09-11 06:59 MDT by hpc-ops
Modified:	2020-06-09 08:35 MDT (History)
CC List:	1 user (show)

See Also:	7917 8310 8579 8577
Site:	Ghent
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	20.02.4 20.11.0pre1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm configuration files (1.64 KB, application/gzip) 2019-09-11 07:00 MDT, hpc-ops	Details
job output (1.01 KB, text/plain) 2019-09-13 07:35 MDT, hpc-ops	Details
per task output (2.31 KB, application/gzip) 2019-09-18 05:56 MDT, hpc-ops	Details
more per task ouput (106.57 KB, application/gzip) 2019-09-19 02:23 MDT, hpc-ops	Details
pass RANK_ID to step_reset_env(v1) (4.31 KB, patch) 2019-09-19 05:51 MDT, Marcin Stolarek	Details \| Diff
new output with patch (106.50 KB, application/gzip) 2019-09-19 14:22 MDT, hpc-ops	Details
7726_1905(v2 PROTOTYPE) (13.28 KB, patch) 2019-09-26 08:01 MDT, Marcin Stolarek	Details \| Diff
7726_1905(v3 PROTOTYPE) (11.74 KB, patch) 2019-10-02 13:10 MDT, Marcin Stolarek	Details \| Diff
slurm output (100.65 KB, application/gzip) 2019-10-03 14:36 MDT, hpc-ops	Details
WIP_DEBUGGING patch(v4) (21.89 KB, patch) 2019-10-04 03:09 MDT, Marcin Stolarek	Details \| Diff
output for WIP_DEBUGGING (37.81 KB, application/gzip) 2019-10-04 04:06 MDT, hpc-ops	Details
WIP_DEBUGGING patch(v5) (27.88 KB, patch) 2019-10-04 06:13 MDT, Marcin Stolarek	Details \| Diff
WIP_DEBUGGING patch(v6) (27.91 KB, patch) 2019-10-04 07:40 MDT, Marcin Stolarek	Details \| Diff
new output with new patch (33.73 KB, application/gzip) 2019-10-04 08:47 MDT, hpc-ops	Details
output for map+gpt, closest+pgt and closest (106.83 KB, application/gzip) 2019-10-05 08:43 MDT, hpc-ops	Details
Clean collection of patches for Ghent (10.41 KB, patch) 2019-10-08 10:22 MDT, Marcin Stolarek	Details \| Diff
potential regression in 19.05.3 (1.41 KB, patch) 2019-10-09 09:04 MDT, Marcin Stolarek	Details \| Diff
regression in 19.05.3 mitigation (v2) (1014 bytes, patch) 2019-10-10 05:56 MDT, Marcin Stolarek	Details \| Diff
Show Obsolete (7) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description hpc-ops 2019-09-11 06:59:01 MDT

our nodes have 4 gpus each, gres.conf only has automatic nvml, no explicit listing of the gpu devices

submitting a job with sbatch --gres=gpu:4, i'm trying to start an mpi application using pmix that uses (or should use) 1 gpu per mpi rank. i expect that the CUDA_VISIBLE_DEVICES variable has only 1 gpu id and that the cpu pinning is "the correct one"

to do so, i specify --gpu-per-task=1 and ask for a multiple of 4 tasks.
this by no means seems to limit the CUDA_VISIBLE_DEVICES variable, every mpi rank/process has all gpus. (not sure what the gpu-per-task is for).

when i try to couple the tasks assigned cpus to "correct" gpu using --gpu-bind=closest i get something very wrong (detailed output below): 

eg GPU1 is coupled to first numa domain (with the even cpu cores), however all processes only get the first 2 GPU ids; eg the process 184984 is bound to the first 8 cores of the second numa doamin (with odd core numbers), yet it still gets the gpu ids 0-1 which are only ideal for the even cores.

my questions are:
- how do i limit 1 gpu id to 1 rank (i'll try the gpu map stuff later, but i'm not sure i'll always use full nodes, so the exclusive behaviour is not so ideal)
- the gpubind=closest seems wrong, this could be either a bug or something that i misconfigrued or maybe some dependency is worng (this an el7 box with hwloc 1.11.8 (but the hwloc changelog does not mention any specific fixes wrt gpu location info))


stijn


[root@node3300 ~]# nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	mlx5_0	mlx5_1	CPU Affinity
GPU0	 X 	NV2	NV2	NV2	NODE	SYS	0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30
GPU1	NV2	 X 	NV2	NV2	NODE	SYS	0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30
GPU2	NV2	NV2	 X 	NV2	SYS	NODE	1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27,29-29,31-31
GPU3	NV2	NV2	NV2	 X 	SYS	NODE	1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27,29-29,31-31
mlx5_0	NODE	NODE	SYS	SYS	 X 	SYS	
mlx5_1	SYS	SYS	NODE	NODE	SYS	 X 	

[root@node3300 ~]# nvidia-smi 
...
|=============================================================================|
|    0    184983      C   ...OMACS/2019.3-fosscuda-2019a/bin/gmx_mpi   355MiB |
|    0    184984      C   ...OMACS/2019.3-fosscuda-2019a/bin/gmx_mpi   353MiB |
|    1    184985      C   ...OMACS/2019.3-fosscuda-2019a/bin/gmx_mpi   353MiB |
|    1    184986      C   ...OMACS/2019.3-fosscuda-2019a/bin/gmx_mpi   353MiB |
+-----------------------------------------------------------------------------+
[root@node3300 ~]# taskset -cp 184984
pid 184984's current affinity list: 1,3,5,7,9,11,13,15
[root@node3300 ~]# strings /proc/184984/environ |grep -e 'CUDA_VIS\|GPU'
CUDA_VISIBLE_DEVICES=0,1
GPU_DEVICE_ORDINAL=0,1
SLURM_STEP_GPUS=0,1

Comment 1 hpc-ops 2019-09-11 07:00:37 MDT

Created attachment 11539 [details]
slurm configuration files

cgroup.conf  gres.conf  plugstack.conf  slurm.con

Comment 2 hpc-ops 2019-09-11 07:21:40 MDT

using the map_gpu, it looks like the SLURM_STEP_GPU variable is correct, but for some reason the CUDA_VISIBLE_DEVICES is always set to the same value 0



+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    196634      C   ...OMACS/2019.3-fosscuda-2019a/bin/gmx_mpi   359MiB |
|    0    196635      C   ...OMACS/2019.3-fosscuda-2019a/bin/gmx_mpi   353MiB |
|    0    196636      C   ...OMACS/2019.3-fosscuda-2019a/bin/gmx_mpi   353MiB |
|    0    196637      C   ...OMACS/2019.3-fosscuda-2019a/bin/gmx_mpi   359MiB |
+-----------------------------------------------------------------------------+
[root@node3300 ~]# strings /proc/19663[4-7]/environ |grep -e 'SLURM_STEP_GPU\|VISIBLE\|SLURMD_TRES_BIND'
CUDA_VISIBLE_DEVICES=0
SLURM_STEP_GPUS=0
SLURMD_TRES_BIND=gpu:map_gpu:0,2,1,3
CUDA_VISIBLE_DEVICES=0
SLURM_STEP_GPUS=2
SLURMD_TRES_BIND=gpu:map_gpu:0,2,1,3
CUDA_VISIBLE_DEVICES=0
SLURM_STEP_GPUS=1
SLURMD_TRES_BIND=gpu:map_gpu:0,2,1,3
CUDA_VISIBLE_DEVICES=0
SLURM_STEP_GPUS=3
SLURMD_TRES_BIND=gpu:map_gpu:0,2,1,3

Comment 3 Marcin Stolarek 2019-09-12 07:36:06 MDT

Stijn,

Did you check that you don't get appropriate devices - for instance running nvidia-smi when the job is running and checking allocated memory?
 
CUDA_VISIBLE_DEVICES=0 means that cuda will use 1st GPU device found in the namespace the application is running. In your configuration you're using task/cgroup plugin and you constrain devices in your cgroup.conf. In this case the task will "see" only one /dev/nvidiaX device (not the same one) and CUDA_VISIBLE_DEVICES=0 will be correct. Depending on the device being visible the same value of the variable may result in different GPU allocation.

Maybe this also answers your questions about --gpu-bind=closest, if by GPU ID you mean the value of $CUDA_VISIBLE_DEVICES?

cheers,
Marcin

Comment 4 hpc-ops 2019-09-12 08:04:54 MDT

hi marci,

unless i'm seriously mistaken, the nvidia-smi output which shows the processes on the gpu ids (it was run as root on the nodes), and i assume that the gpu ids there are absolute wrt the system. 

so there you can see the processes mapped to eg 0 or 1 in the bind=closests case or all to 0 to the bind=map case.

so maybe the CUDA_VISIBLE_DEVICES is correct, but the actual binding of mpi process to the device did not happen (remind: this is a srun inside a sbatch that asked for all gpus).

(i'm probably doing someting wrong, but it is not easy from the manual how to do it correctly: i ask for x gpus per node using ---gres when submitting the job, and i want to start mpi process with 1 gpu per rank using pmix/srun)

stijn

Comment 5 hpc-ops 2019-09-13 06:07:08 MDT

another clue, defintely a bug:

this is output from slurmd -vvvv, as you can see the nvml detect the gpus,
the affinity range and the affintiy range abstract are not the same and always 50% wrong.

teh affintiy range is correct, it's the abstraction that isn't.


stijn


[2019-09-13T14:00:01.477] debug2: GPU index 0:
[2019-09-13T14:00:01.477] debug2:     Name: tesla_v100-sxm2-32gb
[2019-09-13T14:00:01.477] debug2:     Brand/Type: tesla
[2019-09-13T14:00:01.477] debug2:     UUID: GPU-5614fdf4-2479-17e8-fd81-ab00dac29bb8
[2019-09-13T14:00:01.477] debug2:     PCI Domain/Bus/Device: 0:24:0
[2019-09-13T14:00:01.477] debug2:     PCI Bus ID: 00000000:18:00.0
[2019-09-13T14:00:01.477] debug2:     NVLinks: -1,2,2,2
[2019-09-13T14:00:01.477] debug2:     Device File (minor number): /dev/nvidia0
[2019-09-13T14:00:01.477] debug2:     CPU Affinity Range: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
[2019-09-13T14:00:01.477] debug2:     CPU Affinity Range Abstract: 0-15
--
[2019-09-13T14:00:01.480] debug2: GPU index 1:
[2019-09-13T14:00:01.480] debug2:     Name: tesla_v100-sxm2-32gb
[2019-09-13T14:00:01.480] debug2:     Brand/Type: tesla
[2019-09-13T14:00:01.480] debug2:     UUID: GPU-5b357e27-0edc-47b2-c77f-75f2e73c2a86
[2019-09-13T14:00:01.480] debug2:     PCI Domain/Bus/Device: 0:59:0
[2019-09-13T14:00:01.480] debug2:     PCI Bus ID: 00000000:3B:00.0
[2019-09-13T14:00:01.480] debug2:     NVLinks: 2,-1,2,2
[2019-09-13T14:00:01.480] debug2:     Device File (minor number): /dev/nvidia1
[2019-09-13T14:00:01.480] debug2:     CPU Affinity Range: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
[2019-09-13T14:00:01.480] debug2:     CPU Affinity Range Abstract: 0-15
--
[2019-09-13T14:00:01.482] debug2: GPU index 2:
[2019-09-13T14:00:01.482] debug2:     Name: tesla_v100-sxm2-32gb
[2019-09-13T14:00:01.482] debug2:     Brand/Type: tesla
[2019-09-13T14:00:01.482] debug2:     UUID: GPU-c587cc7d-c6a3-f275-8916-71fbefcf34d7
[2019-09-13T14:00:01.482] debug2:     PCI Domain/Bus/Device: 0:134:0
[2019-09-13T14:00:01.482] debug2:     PCI Bus ID: 00000000:86:00.0
[2019-09-13T14:00:01.482] debug2:     NVLinks: 2,2,-1,2
[2019-09-13T14:00:01.482] debug2:     Device File (minor number): /dev/nvidia2
[2019-09-13T14:00:01.482] debug2:     CPU Affinity Range: 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
[2019-09-13T14:00:01.482] debug2:     CPU Affinity Range Abstract: 16-31
--
[2019-09-13T14:00:01.484] debug2: GPU index 3:
[2019-09-13T14:00:01.484] debug2:     Name: tesla_v100-sxm2-32gb
[2019-09-13T14:00:01.484] debug2:     Brand/Type: tesla
[2019-09-13T14:00:01.484] debug2:     UUID: GPU-e7d763ad-438e-04ed-c4e7-330e21265f66
[2019-09-13T14:00:01.484] debug2:     PCI Domain/Bus/Device: 0:175:0
[2019-09-13T14:00:01.484] debug2:     PCI Bus ID: 00000000:AF:00.0
[2019-09-13T14:00:01.484] debug2:     NVLinks: 2,2,2,-1
[2019-09-13T14:00:01.484] debug2:     Device File (minor number): /dev/nvidia3
[2019-09-13T14:00:01.484] debug2:     CPU Affinity Range: 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
[2019-09-13T14:00:01.484] debug2:     CPU Affinity Range Abstract: 16-31

Comment 6 hpc-ops 2019-09-13 06:33:57 MDT

my apologies, i guess the abstract format is some sort of internal format (or similar to the hwloc logical format).

so this looks ok.

stijn

Comment 7 Marcin Stolarek 2019-09-13 06:58:36 MDT

Stijn,

Yes - the abstract represents bitmap used by Slurm internally, which for affinity is translated to numbers as used by hwloc.

Could you please share a sequence of commands you use? I don't see any obvious mistake in your description.

Did you check utilization of GPUs(the table above processes in nvidia-smi output) to confirm if the job really ends up on only two/one gpu, which as I understand is of your main concern? I'm not sure how nvidia-smi gets information about GPU ID in process table. 


I'm setting up a test environment with a few GPUs to reproduce the case.

cheers,
Marcin

Comment 8 hpc-ops 2019-09-13 07:08:46 MDT

hi marcin,

i think i start to understand what you said ;)

so it looks like the tasks do not have devices constrained (or rather re-constrained, since the job itself has the devices properly constrained), and the code that sets the env variables assumes they are properly constrained.

one other question: slurm.conf manpage has following note:

NOTE: It is recommended to stack task/affinity,task/cgroup together when configuring  TaskPlugin,  and  setting  TaskAffinity=no  and  ConstrainCores=yes  in cgroup.conf.  This  setup  uses  the task/affinity plugin for setting the affinity of the tasks (which is better and different than task/cgroup) and uses the task/cgroup plugin to fence tasks into the specified resources, thus combining the best of both pieces.

can you confirm that we have to set TaskAffinity=no (we now use TaskAffinity=yes as you can see from our cgroup.conf file; and it is also what the explanation seem to suggest)?


stijn

Comment 9 hpc-ops 2019-09-13 07:18:41 MDT

our comments just crossed each other. i'll try to make some sample case to reproduce.

i basically submit a job asking for 2 full nodes (32 core nodes)

/usr/bin/sbatch --nodes=2 --ntasks=64 --ntasks-per-node=32 --gres=gpu:4  job.sh

at the start of the jb , i "sanitise" the enivormnet (getting rid of SLURM variables from the sbatch that might interfere with the srun call)

and in the jobscript i start a mpi application that uses openmpi4, and use pmix3 to start the mpi with srun:
srun --nodes=2 --ntasks=8 --cpus-per-task=8 --mem-per-cpu=7600 --gpus-per-task=1 --export=ALL --mpi=pmix_v3  --gpu-bind=closest  application

(or --gpu-bind=map:0,2,1,3)



stijn

Comment 10 Marcin Stolarek 2019-09-13 07:24:52 MDT

Stijn,

Can you execute it with a script like the one below, instead of your MPI job?
>ls /dev/nvidia* && echo $CUDA_VISIBLE_DEVICES
Could you please attach the listing with commands and result to the bug report?


I'd recommend setting TaskAffinity=no as suggested by the manual. 

cheers,
Marcin

Comment 11 hpc-ops 2019-09-13 07:26:03 MDT

hi marcin,

also to be clear, my main concern is twofold:
- i can't seem to control the number of gpus that each tasks gets. without the gpu-bind options every task sees all gpus even if i specified --gpus-per-task=1
- then try to limit this effect by choosing --gpu-bind=closest (the fact that it still doesn't give 1 gpu per rank is still annoying; but i also end up with the wrong gpus wrt cpu affinity
- with --gpu-bind=map:... --gpus-per-task=1 and i get 1 gpu device per rank (hooray), but this only works for whole nodes and it still sees the wrong gpus


it is clear the issue with the wrong gpus comes from the reconstraining that doesn't happen for some reason.
it is however very unfortunate that --gpus-per-task=1 does not give you 1 gpu per task (unless this is related with the constraining that does not happen)

anyway, thanks for looing into it

stijn

Comment 12 hpc-ops 2019-09-13 07:35:07 MDT

Created attachment 11572 [details]
job output

Comment 13 hpc-ops 2019-09-13 07:35:55 MDT

hi marcin,

in attachment: all steps retrun 0,1 for devices but each step can list all devices

stijn

Comment 14 hpc-ops 2019-09-13 08:10:47 MDT

i set TaskAffinity=no, same result 

stijn

Comment 15 Marcin Stolarek 2019-09-16 07:19:55 MDT

Stijn,

I'm still working on the setup to replicate the issue. Unfortunately, it's not that easy because of 4 GPUs in the node. However, it just came to me that maybe you don't really create steps in allocation but just submit new jobs, so you end up using different nodes?

What do you mean by:
>at the start of the jb , i "sanitise" the enivormnet (getting rid of SLURM 
>variables from the sbatch that might interfere with the srun call)

are you doing "unset" on variables starting with SLURM*? If you unset SLURM_JOB_ID than srun won't recognize that the allocation is created and it will just create an allocation per srun call (submit new job). May this be the case?

cheers,
Marcin

Comment 16 hpc-ops 2019-09-16 07:30:23 MDT

hi marcin,

the jobid is not unset. i only remove anything matching 'MEM', 'CPU', 'GPU', 'TASK', 'NODE', 'PROC' (except for NODELIST)

the mpi job starts as a step with the same jobid.

if there is something i can check in verbose log files let me know. or if you have a patch that logs additional things (eg to make the device constraining very clear), just send it (we're used to recompile slurm); and i'll deploy it.

if you need access, that can also be arranged

stijn

Comment 17 Marcin Stolarek 2019-09-18 05:21:58 MDT

Stijn,

I took a closer look at this works in different scenarios. Could you please repeat the commands you did before but instead of ls execute:
#nvidia-smi
#echo $CUDA_VISIBLE_DEVICES

If you can share this together with commands you execute it will be much easier to read for me. You can create a script called by srun, so it will be just copy&paste from your terminal. 

This should help us really understand which GPU was assigned to which step to compare with the expected behavior.

On the GPU ID number returned by nvidia-smi the situation is not trivial. In a few points:
1) It's not a unique number assigned to GPU on the system. It's build by nvidia-smi during it's execution based on PCI address order. You can use "Bus-Id" field to get a unique identifier. 
2) Order of devices returned by nvidia-smi may not match the order used by CUDA driver. However, from CUDA7 you can use CUDA_DEVICE_ORDER=PCI_BUS_ID environment variable to enforce the same enumeration.
3) Slurm enumeration of GPUs in AutoDetedt=nvml is based on the same library as nvidia-smi - PCI order, so CUDA_VISIBLE_DEVICES set by Slurm should match CUDA order in case of  CUDA_DEVICE_ORDER=PCI_BUS_ID.

If you'd like to further check devices allowed/denied by slurmd you can enable SlurmdDebug=debug level which should show lines like "Allowing access to device.."/"Not allowing access to device" in your slurmd log file.

cheers,
Marcin

Comment 18 hpc-ops 2019-09-18 05:56:11 MDT

Created attachment 11616 [details]
per task output

Comment 19 hpc-ops 2019-09-18 05:59:00 MDT

hi marcin,

i added per task output of

#!/bin/bash
nvidia-smi > out.$SLURM_JOB_ID.$SLURM_TASK_PID
env|sort|grep 'CUDA_[DV]\|SLURM_' >> out.$SLURM_JOB_ID.$SLURM_TASK_PID


(the stdout was mixed in an unreadable state; i also added all slurm env variables)

we have CUDA_DEVICE_ORDER=PCI_BUS_ID set for both slurmd (via unitfile) as the regular shells (via profile.d), so that shouldn't be the issue.


i'll try the debugging next

stijn

Comment 20 hpc-ops 2019-09-18 06:44:42 MDT

hi marcin, 

with SlurmdDebug=debug i get output below:

there is no mention of "allowing access" per task, only per job and per step. for each task there are some xcgroup messages (i guess these pin the tasks to cpus)

so the tasks do not get per-task constraining (or at least not with the same code as the job/step).


stijn


...
[2019-09-18T14:35:52.305] [40000800.0] debug:  Allowing access to device c 195:0 rwm(/dev/nvidia0) for job
[2019-09-18T14:35:52.306] [40000800.0] debug:  Allowing access to device c 195:1 rwm(/dev/nvidia1) for job
[2019-09-18T14:35:52.306] [40000800.0] debug:  Allowing access to device c 195:2 rwm(/dev/nvidia2) for job
[2019-09-18T14:35:52.306] [40000800.0] debug:  Allowing access to device c 195:3 rwm(/dev/nvidia3) for job
[2019-09-18T14:35:52.306] [40000800.0] debug:  Allowing access to device c 195:0 rwm(/dev/nvidia0) for step
[2019-09-18T14:35:52.306] [40000800.0] debug:  Allowing access to device c 195:1 rwm(/dev/nvidia1) for step
[2019-09-18T14:35:52.306] [40000800.0] debug:  Allowing access to device c 195:2 rwm(/dev/nvidia2) for step
[2019-09-18T14:35:52.306] [40000800.0] debug:  Allowing access to device c 195:3 rwm(/dev/nvidia3) for step
[2019-09-18T14:35:52.548] [40000800.0] debug:  IO handler started pid=326504
[2019-09-18T14:35:52.550] [40000800.0] starting 4 tasks
[2019-09-18T14:35:52.550] [40000800.0] task 0 (326722) started 2019-09-18T14:35:52
[2019-09-18T14:35:52.551] [40000800.0] task 1 (326723) started 2019-09-18T14:35:52
[2019-09-18T14:35:52.551] [40000800.0] task 2 (326724) started 2019-09-18T14:35:52
[2019-09-18T14:35:52.552] [40000800.0] task 3 (326725) started 2019-09-18T14:35:52
[2019-09-18T14:35:52.552] [40000800.0] debug:  Setting slurmstepd oom_adj to -1000
[2019-09-18T14:35:52.552] [40000800.0] debug:  jobacct_gather_cgroup_cpuacct_attach_task: jobid 40000800 stepid 0 taskid 0 max_task_id 0
[2019-09-18T14:35:52.552] [40000800.0] debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm' already exists
[2019-09-18T14:35:52.552] [40000800.0] debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_2540002' already exists
[2019-09-18T14:35:52.552] [40000800.0] debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_2540002/job_40000800' already exists
[2019-09-18T14:35:52.556] [40000800.0] debug:  jobacct_gather_cgroup_memory_attach_task: jobid 40000800 stepid 0 taskid 0 max_task_id 0
[2019-09-18T14:35:52.556] [40000800.0] debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm' already exists
[2019-09-18T14:35:52.556] [40000800.0] debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm/uid_2540002' already exists
[2019-09-18T14:35:52.556] [40000800.0] debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm/uid_2540002/job_40000800' already exists
[2019-09-18T14:35:52.556] [40000800.0] debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm/uid_2540002/job_40000800/step_0' already exists

Comment 21 Marcin Stolarek 2019-09-19 02:00:16 MDT

Stijn,

It looks like I was able to reproduce the issue. To work with messed output, a few processes writing to the same file, I used a script like the one below:

>#!/bin/bash
>exec >> ${SLURM_JOB_ID}.${SLURM_TASK_PID}
>exec 2>> ${SLURM_JOB_ID}.${SLURM_TASK_PID}
>echo "My pid is $$"
>echo "CUDA_VISIBLE_DEVICES="$CUDA_VISIBLE_DEVICES
>echo "======scontrol show job"
>scontrol show job -d ${SLURM_JOB_ID}
>echo "======scontrol show step"
>scontrol show step -d ${SLURM_JOB_ID}.${SLURM_STEP_ID}
>echo "======nvidia-smi"
>nvidia-smi
>echo "======topology from nvidia-smi"
>nvidia-smi topo -m
>echo "======environemnt"
>env
>echo "======CPU affinity"
>taskset -cp ${SLURM_TASK_PID}

I'm looking into the code to check how we can improve this. Could you be so kind to gather the information using the script I shared and attach it to the case? I understand that it may be anoying to do similar things a few times, but having proper and clear information from your system is very important to address cases  like this one.

cheers,
Marcin

Comment 22 hpc-ops 2019-09-19 02:23:41 MDT

Created attachment 11622 [details]
more per task ouput

Comment 23 hpc-ops 2019-09-19 02:24:50 MDT

hi marcin,

i added the output, this is not annoying at all. i'm glad schedmd support is properly following up on the issues!

stijn

Comment 24 Marcin Stolarek 2019-09-19 05:51:25 MDT

Created attachment 11623 [details]
pass RANK_ID to step_reset_env(v1)

Stijn,

Could you please apply the attached patch and recheck. This should fix the issue with CUDA_VISIBLE_DEVICES being set correctly only for the 1st task(rank) in step. 

As you observed devices affinity is done by task/cgroup only on job and step level. For task(rank) GPU binding we rely on process environment.

I'll continue checking this part since I'm not sure if everything else works correctly in this area. However, I'd like to share this portion since you mentioned that you feel comfortable with patch application and Slurm rebuild.

cheers,
Marcin

Comment 25 hpc-ops 2019-09-19 14:22:56 MDT

Created attachment 11628 [details]
new output with patch

hi marcin,

patch applied, but still wrong assignments. something looks different though, but eg there are CUDA_VISIBLE_DEVICES with an id of 4

stijn

Comment 26 Marcin Stolarek 2019-09-20 01:17:28 MDT

Stijn,

Could you please share the commands you used to get those results? Unfortunately, bind parameters are reflected neither in scontrol show job, scontrol show stop nor process environment. 

cheers,
Marcin

Comment 27 hpc-ops 2019-09-20 01:46:24 MDT

hi marcin,

the commands were the same as the previous output.

srun --chdir=/some/path --nodes=4 --ntasks=16 --cpus-per-task=8 --mem-per-cpu=7600 --gpus-per-task=1 --export=ALL --mpi=pmix_v3 --gpu-bind=closest /some/path/schedmd.sh

(the schedmd.sh is the script code you pasted)


stijn

Comment 28 Marcin Stolarek 2019-09-24 07:20:53 MDT

Stijn,

I'm still looking into this, in fact, it's a set of a few "communicating vessels". I think that the current implementation of --gpu-bind can only work properly with an allocation of full nodes and no constraint of devices in cgroup.conf (ConstrainDevices=no). Could you please verify how it works for you in this case?

Constrainment of devices by cgroup as you've seen is done up to the step level. Task binding of gpus is only done by the CUDA_VISIBLE_DEVICES. This part doesn't take --gpus-per-task options into account.

I'm gathering the information now. I can agree that the situation is not perfect, however, I'll have to double check where we can treat some issues as bugs and where it's rather new features. 

cheers,
Marcin

Comment 29 Marcin Stolarek 2019-09-24 07:23:16 MDT

PS. I agree that the patch I shared previously shows some issues for --gpu-bind=closest. My main concern was in --gpu-bind=bpu_map case, but currently, I think it can't be very helpful without additional changes. 

I'll keep you posted.

Comment 30 hpc-ops 2019-09-24 08:28:04 MDT

hi marcin,

that's pretty bad news for us. i'd least appreciate that the manpages and other docs are updated to reflect what i working and what not wrt gpus.

i'm not looking for a way to constrain per task, i would already be happy with setting the CUDA_VISIBLE _DEVICES correct. i'll now have to go write a taskprolog to do exactly this....

glad we have wrappers for our users to hide this mess. 


stijn

Comment 31 Marcin Stolarek 2019-09-24 09:08:42 MDT

Stijn,

I don't mean that we can't fix it. I just wanted to let you know, that it looks like a few issues and I want to clearly separate those, which will require some time on my side.

For instance, if I disable cgroup device constrainment(as mentioned in comment 28) I'm getting
[root@marcin-test-slurm-4k80 cinek_schedmd_com]# salloc --gres=gpu:4 -n24
salloc: Granted job allocation 189
salloc: Waiting for resource configuration
salloc: Nodes test01 are ready for job
[root@marcin-test-slurm-4k80 cinek_schedmd_com]# srun --gpu-bind=closest  /root/check_gpu_env.sh | sort | uniq -c                                                                                           
     16 CUDA_VISIBLE_DEVICE=0,1
      8 CUDA_VISIBLE_DEVICE=2,3
      1 pid 5680's current affinity list: 0
      1 pid 5681's current affinity list: 16
      1 pid 5682's current affinity list: 1
      1 pid 5683's current affinity list: 17
      1 pid 5684's current affinity list: 2
      1 pid 5685's current affinity list: 18
      1 pid 5686's current affinity list: 3
      1 pid 5687's current affinity list: 19
      1 pid 5688's current affinity list: 4
      1 pid 5689's current affinity list: 20
      1 pid 5690's current affinity list: 5
      1 pid 5691's current affinity list: 21
      1 pid 5692's current affinity list: 6
      1 pid 5693's current affinity list: 22
      1 pid 5694's current affinity list: 7
      1 pid 5695's current affinity list: 23
      1 pid 5696's current affinity list: 8
      1 pid 5697's current affinity list: 24
      1 pid 5698's current affinity list: 9
      1 pid 5699's current affinity list: 25
      1 pid 5700's current affinity list: 10
      1 pid 5701's current affinity list: 26
      1 pid 5702's current affinity list: 11
      1 pid 5703's current affinity list: 27

which is expected behavior and ends up with correct CPU to GPU binding. Can you try specifying Cores in "slurm abstract" form in your gres.conf with devices files? In my case AutoDetect worked correctly, but I had ranges returned by hwloc - not odd/even split. 

Is --gpu-bind=closest the most important for you? Do you have to use it in conjunction with ConstrainDevice=yes, with multiple jobs per node? If your answer is first yes, second no then we can focus on fixing this first.

(If you need multiple jobs per node - This puts additional requirements on other plugins (like resources selection) since you have to make sure that you have enough GPUs and CPUs in the same NUMA node. This, of course, can be done properly, my goal was just to give you potential workaround while I'm checking the details.)

cheers,
Marcin
 
cheers,
Marcin

Comment 32 hpc-ops 2019-09-25 02:19:21 MDT

hi mracin,

wrt to priorities:
- resource constraints: we need multiple jobs per node
- wrt pinning: there seem to be 2 major groups of gpu software that have mpi support: normal ones that you should give a single (ie the best) gpu per rank, and then some others that do not want this and want to see all gpus per rank, and the then some code in the tools distributes the gpus over the ranks.

this case is for the "normal" ones, assigning the best gpus for the ranks (or slurm tasks as you will). it's still a bit annoing the bind=closest retruns more than one per rank, but so be it for now.

for the other case, we also have issues, see https://bugs.schedmd.com/show_bug.cgi?id=7801


i can totally understand that it is quite difficult to handle all possible combos with numa and gpus, but that even for the trivial case (job gets whole nodes) there is an issue is a bit surprising (and to some extend disappointing). 

i'll have a look at specifying the cores explicitly, but from earlier info, i thought that slurm got the correct info (see https://bugs.schedmd.com/show_bug.cgi?id=7726#c5)

anyway, good luck!


stijn

Comment 33 Marcin Stolarek 2019-09-25 03:17:18 MDT

Stijn, 

While I'm looking into details could you please try the patch from bug 7509 comment 5. I think that it's addressing at least part of the issues you see.

Although, the focus there was on map_gpu functionality.

cheers,
Marcin

Comment 34 hpc-ops 2019-09-25 04:00:51 MDT

hi marcin,

i'll try the patch in a bit, but from that comment i see that the job ios submitted asking for 3 out of 4 nodes and using map_gpu. however, the manpage says that map_gpu should only work with full nodes. 


stijn

Comment 35 Marcin Stolarek 2019-09-26 08:01:30 MDT

Created attachment 11708 [details]
7726_1905(v2 PROTOTYPE)

Hi Stijn,

I'd like to share a bigger prototype of a patch addressing a few issues you may have seen. It should be easy to apply it with git am on top of slurm-19.05, however, it contains two patches I've already shared in the ticket.

It enables the use of --gres-per-task together with --gpu-bind options and eliminates the need to disable devices constrainment in cgroup.conf. 

In my configuration (ConstrainDevices=yes, two sockets each with two gpus, each socket 4 cores) I get:

# salloc --gres=gpu:4  -n8
salloc: Granted job allocation 265
# srun --gpu-bind=closest /bin/bash -c 'echo CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES'
CUDA_VISIBLE_DEVICES=0,1
CUDA_VISIBLE_DEVICES=2,3
CUDA_VISIBLE_DEVICES=2,3
CUDA_VISIBLE_DEVICES=0,1
CUDA_VISIBLE_DEVICES=2,3
CUDA_VISIBLE_DEVICES=1,0
CUDA_VISIBLE_DEVICES=0,1
CUDA_VISIBLE_DEVICES=2,3
# srun --gpu-bind=closest --gpus-per-task=1 /bin/bash -c 'echo CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES'
srun: error: Unable to create step for job 265: More processors requested than permitted
# exit
salloc: Relinquishing job allocation 265
salloc: Job allocation 265 has been revoked.
# salloc --gres=gpu:4  -n4
salloc: Granted job allocation 266
# srun --gpu-bind=closest /bin/bash -c 'echo CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES'
CUDA_VISIBLE_DEVICES=0,1
CUDA_VISIBLE_DEVICES=1,0
CUDA_VISIBLE_DEVICES=0,1
CUDA_VISIBLE_DEVICES=0,1
# srun --gpu-bind=closest --gpus-per-task=1 /bin/bash -c 'echo CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES'
CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLE_DEVICES=1
CUDA_VISIBLE_DEVICES=0

The last one may be surprising, but in fact, is not. It just shows that we've got CPUs from one socket, which you can verify running:

[root@slurmctl TEST7726]# srun --gpu-bind=closest --gpus-per-task=1 -n4 /bin/bash -c 'echo CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES && taskset -cp $$'
CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLE_DEVICES=2
CUDA_VISIBLE_DEVICES=1
CUDA_VISIBLE_DEVICES=3
pid 22404's current affinity list: 0,1
pid 22405's current affinity list: 2,3
pid 22406's current affinity list: 4,5
pid 22407's current affinity list: 6,7
[root@slurmctl TEST7726]# exit
salloc: Relinquishing job allocation 269
salloc: Job allocation 269 has been revoked.
[root@slurmctl TEST7726]# salloc --gres=gpu:4  -n4
salloc: Granted job allocation 270
[root@slurmctl TEST7726]# srun --gpu-bind=closest --gpus-per-task=1 -n4 /bin/bash -c 'echo CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES && taskset -cp $$'
CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLE_DEVICES=1
pid 22502's current affinity list: 1
CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLE_DEVICES=0
slurmstepd-test08: error: gres_per_task = 1
pid 22504's current affinity list: 3
pid 22503's current affinity list: 2
pid 22501's current affinity list: 0
 

Additionally, it fixes the issue for --gres-bind=map_gpu option which in case of lower number of gpus on the list than tasks some tasks were not limited at all contrary to documentation statement: "If the number of tasks (or ranks) exceeds the number of elements in this list, elements in the list will be reused as needed starting from the beginning of the list." 

It will be great if you can apply and test it in your environment. As I mentioned it's just a prototype and I'll appreciate all comments from you. 

cheers,
Marcin

Comment 36 Marcin Stolarek 2019-10-02 13:10:43 MDT

Created attachment 11786 [details]
7726_1905(v3 PROTOTYPE)

small fix - change for mask_gpu was incorrect.

Comment 37 hpc-ops 2019-10-02 13:48:19 MDT

hi marcin,

apologies for getting back so late, i was occupied with other things.

so it looks like the gpu-closest and gpus-per-tasks behave as expected. thank you so much for this.

i was looking at some strange things wrt gpu_map, will try your patch rigth away

maybe some (small) remark (you asked for feedback ;)

the error in your example:
# srun --gpu-bind=closest --gpus-per-task=1 /bin/bash -c 'echo CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES'
srun: error: Unable to create step for job 265: More processors requested than permitted

this is trying to start 8 tasks for a total of 8 gpus. so i sort-of get why gpus-per-task=1 also might imply 1 task per gpu (although it is not strictly true).
but i guess in a case where 2 tasks would share a gpu, people should use MPS.
also the error message itself is not clear. it's more than permitted for if you also consider at most 1 gpu per pask (and it has nothing to do with processors)

Comment 38 hpc-ops 2019-10-02 14:49:09 MDT

hi marcin,

map_gpu still seems odd

i get error messages in stderr
slurmstepd: error: i=0, local_inx=1, gres_per_task=1 assigned=0
slurmstepd: error: i=1, local_inx=2, gres_per_task=1 assigned=0
slurmstepd: error: i=2, local_inx=3, gres_per_task=1 assigned=0
slurmstepd: error: i=3, local_inx=0, gres_per_task=1 assigned=0


also the pinning seems off:
running with --gpus-per-task=1 --gpu-bind=map_gpu:0213
i would expect that task 0 has gpu 0, task 1 has gpu 2 etc etc; but i get
task 0 has gpu 3, task 1 has gpu 0 ...

-bash-4.2$ grep -e 'CUDA_VISIBLE_DEVICES\|affini' 40000913*
40000913.423333:CUDA_VISIBLE_DEVICES=3
40000913.423333:CUDA_VISIBLE_DEVICES=3
40000913.423333:======CPU affinity
40000913.423333:pid 423333's current affinity list: 0,2,4,6,8,10,12,14
40000913.423334:CUDA_VISIBLE_DEVICES=0
40000913.423334:CUDA_VISIBLE_DEVICES=0
40000913.423334:======CPU affinity
40000913.423334:pid 423334's current affinity list: 1,3,5,7,9,11,13,15
40000913.423335:CUDA_VISIBLE_DEVICES=1
40000913.423335:CUDA_VISIBLE_DEVICES=1
40000913.423335:======CPU affinity
40000913.423335:pid 423335's current affinity list: 16,18,20,22,24,26,28,30
40000913.423336:CUDA_VISIBLE_DEVICES=2
40000913.423336:CUDA_VISIBLE_DEVICES=2
40000913.423336:======CPU affinity
40000913.423336:pid 423336's current affinity list: 17,19,21,23,25,27,29,31


the filenames are jobid.taskpid

Comment 39 hpc-ops 2019-10-02 15:03:00 MDT

hi marcin,

i'm sorry but i was wrong before. the gpu-bind=closest is not working (the combination with --gpus-per-task does give only 1 gpu so that is good).

i'm not sure why i thought it was working before...


this is with closest (tested with both patches)

grep -e 'CUDA_VISIBLE_DEVICES\|affini' 40000919*

40000919.429930:CUDA_VISIBLE_DEVICES=0
40000919.429930:CUDA_VISIBLE_DEVICES=0
40000919.429930:======CPU affinity
40000919.429930:pid 429930's current affinity list: 0,2,4,6,8,10,12,14
40000919.429931:CUDA_VISIBLE_DEVICES=1
40000919.429931:CUDA_VISIBLE_DEVICES=1
40000919.429931:======CPU affinity
40000919.429931:pid 429931's current affinity list: 1,3,5,7,9,11,13,15
40000919.429932:CUDA_VISIBLE_DEVICES=2
40000919.429932:CUDA_VISIBLE_DEVICES=2
40000919.429932:======CPU affinity
40000919.429932:pid 429932's current affinity list: 16,18,20,22,24,26,28,30
40000919.429933:CUDA_VISIBLE_DEVICES=3
40000919.429933:CUDA_VISIBLE_DEVICES=3
40000919.429933:======CPU affinity
40000919.429933:pid 429933's current affinity list: 17,19,21,23,25,27,29,31

Comment 40 Marcin Stolarek 2019-10-03 01:40:41 MDT

Could you please run the jobs you tried with a scrip from comment 21? I understand that a lot of things is clear for you when you run the jobs, but having the exect commands you executed with detailed information about jobs/steps/tasks will help me a lot.

Error messages you mentioned in comment 38 was not a real error, it's just a debugging output I added working on the issue, I'm sorry I shared this in a patch for you. I also see that you don't have comas between numbers in map_gpu - could you please check if adding comas will change the behavior?

Am I correct that the issue in comment 39 is that "CUDA_DEVICE 1" is not on the socket with cores 1,3,5..? Detailed info will really help me understanding what's wrong there - was it the first step in the job?

cheers,
Marcin

Comment 41 hpc-ops 2019-10-03 04:48:41 MDT

hi marcin,

the output is from the that script, i only showed the (relevant) info (visible ids and tasksets) via the grep.
i'll rerun later today and send the whole output.


i'll then also try with commas in the map_gpu (i was confused  a bit, as gromacs also has a similar map option without commas)
however, if this was seen as 1 gpuid, why was there no error (as there is no such gpu) and/or why was it not always the same gpuid (as there is only one id)

you are correct, gpu id 1 is part of the socket 0, and only has even core numbers, and should be mapped to the 2nd set of 8 cores.

i'll get back after a few meetings ;)

stijn

Comment 42 hpc-ops 2019-10-03 14:36:52 MDT

Created attachment 11812 [details]
slurm output

3 subdirs, each has a cmd file with the srun command
all submitted as separate job with
sbatch --nodes=1 --ntasks=32 --ntasks-per-node=32 --gres=gpu:4 
(so this is the first/only step)

run with contained the patch from comment 36 and patch from bug 7801 (the blockblockblock issue)

Comment 43 hpc-ops 2019-10-03 14:43:09 MDT

so mask_gpu with commas works as expected (but it should fail without commas imho)

closest still doesn't work, i added a run with closest+block:block:block, which makes it look even stranger.

Comment 44 Marcin Stolarek 2019-10-04 03:09:44 MDT

Created attachment 11816 [details]
WIP_DEBUGGING  patch(v4)

Stijn,

I think that we're dealing with two issues here. 
1) Incorrect GPU being set in CUDA_VISIBLE_DEVICES for --gpu-bind=closest, which is definitively a bug.  
2) A feature request to combine --gpu-bind=closest with --gpus-per-task. I don't want to be definitive, but I just realized that this may be more complicated than I thought - taking block/cyclic distribution of tasks into consideration. 

I'd like to take a step back and focus on the 1 issue. Could you please apply the attached patch and share information as before + slurmd logs from the time of job execution?

The patch removes the impact of --gpus-per-task on CUDA_VISIBLE_DEVICES and adds a lot of slurmd logging that should help me understand the differences between yours and my setup. 

cheers,
Marcin

Comment 45 hpc-ops 2019-10-04 04:06:33 MDT

Created attachment 11817 [details]
output for WIP_DEBUGGING

this is output an cmd with patch from comment 44

Comment 46 Marcin Stolarek 2019-10-04 06:13:45 MDT

Created attachment 11819 [details]
WIP_DEBUGGING  patch(v5)

Stijn, 

It looks like the bitmap containing GPU/CPU affinity was incorrect. Could you please check with the attached patch and share the results as previously?

I really appreciate your cooperation and I'm sorry that it takes so many iterations, but to be honest we're solving a few bugs in this ticket.

cheers,
Marcin

Comment 47 hpc-ops 2019-10-04 07:03:06 MDT

patch doesn't work: starting slurmd

[root@node3301 log]# slurmd -vvvvvv
slurmd: bitstring.c:714: bit_copy: Assertion `(b) != ((void *)0)' failed.
Aborted

Comment 48 Marcin Stolarek 2019-10-04 07:40:31 MDT

Created attachment 11820 [details]
WIP_DEBUGGING patch(v6)

Oh.. sorry for that. The one here shouldn't have this error.

cheers,
Marcin

Comment 49 hpc-ops 2019-10-04 08:47:03 MDT

Created attachment 11822 [details]
new output with new patch

Comment 50 hpc-ops 2019-10-04 08:47:22 MDT

hi marcin, 

new output uploaded

stijn

Comment 51 Marcin Stolarek 2019-10-04 13:29:03 MDT

Stijn,

Are you sure that all patches from the last attachment were applied? From the output it looks like last part fixing CPU/GPU affinity map works as expected. Do you agree?
$ cat 40000980.261* | egrep '(CUDA_VISI|^pid)' | uniq
CUDA_VISIBLE_DEVICES=0
pid 261299's current affinity list: 0,2,4,6,8,10,12,14
CUDA_VISIBLE_DEVICES=2
pid 261300's current affinity list: 1,3,5,7,9,11,13,15
CUDA_VISIBLE_DEVICES=1
pid 261301's current affinity list: 16,18,20,22,24,26,28,30
CUDA_VISIBLE_DEVICES=3
pid 261302's current affinity list: 17,19,21,23,25,27,29,31

However, I'd expect some additional logs and both devices from socket being presented on the list.

It looks like you still have a code setting CUDA_VISIBLE_DEVICES only up to --gpus-per-task value. This prototype was buggy and may assign GPUs in a suboptimal way depending on cores distribution (some devices can be presented twice while other not used by any task).

I'd suggest fixing only clear bugs in this ticket:
-) corrected GPU affinity for --gpu-bind=closest
-) allow reuse of devices on map_gpu:X,Y list
-) fix CUDA_VISIBLE_DEVICES when ConstrainDevices=YES configured in cgroup.conf (duplicate of bug 7509)
I'll clean up the patch stack fixing those, share it with you and pass it to SchedMD QA process, so fixes can be included in main repository.

For two other requests I see here:
-) limit number of devices on CUDA_VISIBLE_DEVICES list up to --gpus-per-task value
-) throw an error message when the value provided for --gpu-bind=map_gpu: exceeds the number of available devices.

I'd suggest opening separate bug reports to get appropriate attention and procedure(for instance limiting the number of devices on the list is a change in behavior and eventually can be added to 20.02, but rather not to 19.05) for those. Are you good with this?

cheers,
Marcin

Comment 52 hpc-ops 2019-10-05 08:43:21 MDT

Created attachment 11837 [details]
output for map+gpt, closest+pgt and closest

Comment 53 hpc-ops 2019-10-05 08:57:01 MDT

hi marcin,

i send the output of map+gpu-per-task (gpt)
i attached new output for map+gpt, closest+gpt and closest (the cmd files are correct this time).

this is the output of closest+gpt: there's not single task (you removed that), and mapping looks ok. there's one oddity: for the odd groups, the gpu index is swicthed but not for the even cpus

40000981.141579:My pid is 141579
40000981.141579:CUDA_VISIBLE_DEVICES=0,1
40000981.141579:pid 141579's current affinity list: 0,2,4,6,8,10,12,14
40000981.141580:My pid is 141580
40000981.141580:CUDA_VISIBLE_DEVICES=2,3
40000981.141580:pid 141580's current affinity list: 1,3,5,7,9,11,13,15
40000981.141581:My pid is 141581
40000981.141581:CUDA_VISIBLE_DEVICES=0,1
40000981.141581:pid 141581's current affinity list: 16,18,20,22,24,26,28,30
40000981.141582:My pid is 141582
40000981.141582:CUDA_VISIBLE_DEVICES=3,2
40000981.141582:pid 141582's current affinity list: 17,19,21,23,25,27,29,31


wrt the splitting: sure, whatever is needed to get it all accepted as quick as possible. wrt timing, not sure it matters, we will probably not run an actual release for while, so adding the patches is not an issue. i'm not sure yet what we will tell our users how to use the gpus (or how they can expect them to be assigned)

also, we will start piloting the infra next week, so probably next week will be my last chance to deploy patches and test things this easily...

stijn

Comment 54 hpc-ops 2019-10-05 09:00:47 MDT

marcin, i opened https://bugs.schedmd.com/show_bug.cgi?id=7879 and https://bugs.schedmd.com/show_bug.cgi?id=7880 for the other issues

Comment 59 Marcin Stolarek 2019-10-08 10:22:46 MDT

Created attachment 11872 [details]
Clean collection of patches for Ghent

Stijn, 

Since you mentioned that you'll be probably running a locally patched version of Slurm for some time. I'm sharing this clean set of patches. It includes 3 main issues addressed here.

The "oddity" - shuffling of GPU devices was coming from the work-in-progress part related to --gres-per-task. It shouldn't happen with the patches shared in this set.

Thanks for the info about the last week of the trial period. I'll keep it in mind working on the other bug from you I have.

cheers,
Marcin

Comment 60 hpc-ops 2019-10-09 08:48:48 MDT

hi marcin,

with (only) the patch from comment 59 applied (against 19.05.3, not sure if it relevant), gpu-bind=closest does nothing at all: every task sees all gpus. (also gpu-per-task=1 does nothing, but that is not part of the patches anymore). same for gpu_map with gpu-per-task



stijn

Comment 61 Marcin Stolarek 2019-10-09 09:04:34 MDT

Created attachment 11881 [details]
potential regression in 19.05.3

Stijn, 
I think that this may be related to regression in slurm-19.05.3 (in different place). 

Could you please try it with the attached patch?

cheers,
Marcin

Comment 62 hpc-ops 2019-10-09 15:36:10 MDT

hi marcin,


the attached patch fixes the map_gpu (and gpu-per-task for map_gpu); but has no effect for gpu-bind=closest

Comment 63 Marcin Stolarek 2019-10-10 05:56:01 MDT

Created attachment 11899 [details]
regression in 19.05.3 mitigation (v2)

Stijn, 

We're discussing internally how to best handle the issue we have in 19.05.3, so I'm not sure if this is the final version, but this should mitigate the issue correctly.

Please drop the commit from the previous patch. 

cheers,
Marcin

Comment 64 hpc-ops 2019-10-10 08:12:10 MDT

hi marcin,

with patches from comment 59 and 63 (without patch from 61) gpu_map works fine (incl with gpu-per-task), but now gpu-bind=closest gives for with and without gpus-per-task=1 CUDA_VISIBLE_DEVICES=0 (nvidia-smi showing all gpus) 

stijn

Comment 65 Marcin Stolarek 2019-10-10 08:33:50 MDT

oh.. I see the issue. 

Can you try adding: 
>Name=gpu Type=tesla CPUs=0-15   File=/dev/nvidia[0-1]
>Name=gpu Type=tesla CPUs=16-31  File=/dev/nvidia[2-3]
to gres.conf as a verification/workaround? 

I understand that by --gpus-per-task working you mean that --gpus-per-task=1 you get one device per task? This is expected, however, it's the result of --gpu-bind=map_gpu  since it's assigning only one gpu per task.

cheers,
Marcin

Comment 66 hpc-ops 2019-10-10 08:37:55 MDT

hi marcin,

wrt the map_gpu/gpu-per-task=1, yes, you are right, they always get one device.

wrt the ges.conf: the cpus are the even and the odd ones resp (or do you really mean 0-15 for device 0-1? (ie like the slurm internal numbering)

stijn

Comment 67 Marcin Stolarek 2019-10-10 08:40:07 MDT

>wrt the ges.conf: the cpus are the even and the odd ones resp (or do you really >mean 0-15 for device 0-1? (ie like the slurm internal numbering)

yes - gres.conf should follow Slurm abstract numbering, this will get translated to physical cores with the help of hwloc

cheers,
Marcin

Comment 68 hpc-ops 2019-10-10 08:45:18 MDT

hi marcin,


with the gres.conf above (ie no nvml automatic) the gpu-bind=closest works (gpus-per-task not, but that was as expected)



stijn

Comment 69 Marcin Stolarek 2019-10-10 08:49:56 MDT

Thanks for the prompt check. As I said before we're having an internal discussion on the regression in 19.05.3, I'll come back to you as soon as we have that fixed.

cheers,
Marcin

Comment 70 Marcin Stolarek 2019-10-23 07:27:09 MDT

Stijn,

I'm sorry that it took a few days. However, I can share good news with you - a set of patches fixing the regression introduced in 19.05.3 was finally merged into main repository[1]. The main patch for you is [2], however, I recommend that you apply them all.

All together with the patches attached to this case you map_gpu and closest --bind-gpu options should work as expected.

I know that you may not be able to apply the patches immediately in the current status of the cluster, but any feedback will be very much appreciated.

cheers,
Marcin


[1] https://github.com/SchedMD/slurm/commits/master (check Commits on Oct 23, 2019 )
[2] https://github.com/SchedMD/slurm/commit/d4b3cc6eeb80aaeec85139317e0a35b4d618fb3b

Comment 73 Marcin Stolarek 2020-05-07 01:40:26 MDT

Stijn,

Sorry I wasn't able to get back to you sooner. 

Some fixes we shared with you in this bug are already merged into public repository in slurm 20.02.1[1] and slurm-19.05.4[2]. Although the issue with Autodetect=nvml properly setting cores closest to GPUs is still under review. It looks like we found a better way to handle it than the one I shared in the initial patch (avoiding ABI change), but we are struggling with some difficulties setting up appropriate test environment.

What is the CPU model on the nodes where you've found the issue?
Would it be possible for you to test the different patch on the hardware where you noticed the issue?

cheers,
Marcin

[1]https://github.com/SchedMD/slurm/commit/0d8222ebc9c6c29374f539816353a136b833a596
[2]https://github.com/SchedMD/slurm/commit/5b13fbb33043bcc3002e9eee62f2b738b7ddb2ff

Comment 74 Marcin Stolarek 2020-06-09 08:35:59 MDT

I know it took us long time, however, the good news is that the fix for untrivial mapping between abstract and machine CPU id coming from Autodetect=NVML was merged to slurm-20.02[1].

Should you have any questions please don't hesitate to reopen.

cheers,
Marcin

[1]https://github.com/SchedMD/slurm/commit/270c0f3d59b4409