Ticket 4455

Summary:	GPU Cgroup restrictions not working
Product:	Slurm	Reporter:	Adam <asa188>
Component:	slurmd	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	brian, kaizaad, kamil, kilian, marshall
Version:	17.11.0
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=8777
Site:	Simon Fraser University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	17.11.0
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Slurmd log for 17.11.0 full release in -vvv debug mode Job log from GPU test run

Description Adam 2017-11-30 10:12:48 MST

Created attachment 5645 [details]
Slurmd log for 17.11.0 full release in -vvv debug mode

We're currently on 17.11.0rc3 but I've tested the full release of 17.11.0 on a slurmd agent and received identical results.

This was not an issue in 17.02.x(tested .4,.5,.9) even if the agent is 17.02.x while the CTLD is 17.11.0rc3
When a job is launched with TaskPlugin=task/cgroup and --gres=gpu:2 (we have 4 GPUs), then they get CUDA_VISIBLE_DEVICES=0,1 and the logs show Default allow /dev/nvidia0 through 3, and then immediately afterwards(like it should) "Not allowing access to device /dev/nvidia2 for job" as well as nvidia3, which is correct.  The problem lay in that it does not actually deny access to these GPUs.  If we get 2 jobs on the same GPU node, because they've asked for any less than all 4 GPUs, then they will end up attempting to use the exact same GPUs that are already in use.  $(nvidia-smi) agrees that they can see all 4 GPUs and running the CUDA bandwidthTest sample suite runs against the first 2 GPUs in this case on both jobs, the second job will fail if the first one is using nvidia[0,1] already.

This may or may not be related to another bug report I will be filing for 17.11.0 regarding step .extern  ($SLURM_JOB_ID.extern) versus the previous ($SLURM_JOB_ID.4294967295)
In that case, the jobs almost always begin with failing to clean up starting step (_remove_starting_step) because it's looking for $SLURM_JOB_ID.4294967295 to clean up, except that has finally been fully renamed to $SLURM_JOB_ID.extern

Comment 1 Adam 2017-11-30 10:13:44 MST

Created attachment 5646 [details]
Job log from GPU test run

Comment 2 Adam 2017-11-30 10:16:32 MST

My job log doesn't cover 2 jobs running at the same time, it presents a single job being able to access all 4 GPUs, keeps it clean and simple.
And the job log submitted is for the slurmd running the full release of 17.11.0

Comment 3 Marshall Garey 2017-11-30 13:57:17 MST

Adam,

I reproduced the problem, but I actually had the same problem on 17.02 as well as 17.11. I'm looking for the source of the problem and will be working on a fix now.

Comment 4 Adam 2017-11-30 14:35:37 MST

Thanks Marshall, was that 17.02.10 you were testing on, I don't believe I noticed the problem on the other 17.02.[4,5,9]

Comment 5 Marshall Garey 2017-11-30 15:05:43 MST

I learned something else. If ConstrainDevices=yes in your cgroup.conf, then the job only ever sees the GPUs it was allocated. Since each job was only allocated 2 GPUs, the you see indicies 0,1 for CUDA_VISIBLE_DEVICES on each job. But I think they should still be the 4 different GPUs, not the same 2.

If I set ConstrainDevices=no in my cgroup.conf, then the job sees all GPUs but seems to just use the ones it was allocated.

marshall@byu:~/slurm/17.11/byu$ salloc --gres=gpu:2
salloc: Granted job allocation 5448
marshall@byu:~/slurm/17.11/byu$ srun env | grep -i cuda
CUDA_VISIBLE_DEVICES=0,1

marshall@byu:~/slurm/17.11/byu/slurm$ salloc --gres=gpu:2
salloc: Granted job allocation 5449
marshall@byu:~/slurm/17.11/byu/slurm$ srun env | grep -i cuda
CUDA_VISIBLE_DEVICES=2,3




> running the CUDA bandwidthTest sample suite
> runs against the first 2 GPUs in this case on both jobs, the second job will
> fail if the first one is using nvidia[0,1] already.

Can you double check this for me that this is a Slurm issue, not something else? Can you try it with ConstrainDevices=no as well as ConstrainDevices=yes? It should work both times (assuming there aren't any issues outside of Slurm)


I was running 17.02.10 and 17.11.0. However, I just tested on 17.02.9 and got the same behavior.

Comment 6 Adam 2017-11-30 15:40:07 MST

Sorry, I should have mentioned it at the start, we are using ConstrainDevices=yes in production and the initial testing that I displayed.

I've tested with ConstrainDevices=no and it does set CUDA_VISIBLE_DEVICES appropriately in 17.11.0 (second job gets 2,3)  

When ConstrainDevices=yes then CUDA_VISIBLE_DEVICES should always start at 0, and then if 2 devices are chosen it will be 0,1

In a correctly working setup, with ConstrainDevices=yes, if the job gets constrained to /dev/nvidia2 and /dev/nvidia3 then CUDA_VISIBLE_DEVICES will be 0,1 which is correct and it will cause them to operate only on those 2 /dev/nvidia devices.

What I am seeing is in 17.11.0 with ConstrainDevices=yes is that CUDA_VISIBLE_DEVICES is correct in saying 0,1 but because the node is failing to constrain them to their allocated /dev/nvidia devices, they end up seeing all 4 and 0,1 equals /dev/nvidia0 /dev/nvidia1 in that case.... which means any job on that node will at least attempt to use /dev/nvidia0 and run into a conflict.

I am unable to run into this issue using 17.02.[4,5,9] with ConstrainDevices=yes

While we did upgrade nvidia driver versions at the same time as we upgraded Slurm versions, I'm not convinced it could be the cause since i'm unable to reproduce the breakage in the older Slurm version.

I think the easiest way to prove that it's not Constraining properly is to just do a job that requests 1 GPU and runs $(nvidia-smi), if it shows more than 1 GPU on a multigpu node, then it's broken.

Comment 7 Marshall Garey 2017-11-30 15:57:39 MST

You're right, it was broken. The cgroup wasn't correctly adding devices to devices.deny and devices.allow. It was correct in 17.02, but broke in 17.11. A patch will be forthcoming.

Comment 8 Adam 2017-11-30 15:59:36 MST

Thanks Marshall.  I should post the other 2 bugs that I know about for 17.11.0 then ;)

Comment 12 Marshall Garey 2017-11-30 16:25:33 MST

This has been fixed in commit 434acb17c8526b.

Comment 13 Marshall Garey 2017-11-30 16:56:02 MST

Looks like I jumped the gun on closing this bug. There are still problems, though the previously mentioned patch did fix a big one.

Devices aren't being denied if the user doesn't request any GRES.

Parsing the cgroup.conf file isn't ignoring comments and blank lines.

There may be other problems, too.

Reopening the bug.

Comment 15 Marshall Garey 2017-12-04 10:48:46 MST

Devices are properly being denied, see commit 0ed03cda5bcf4. Missing debug statement is fixed in commit ee68721350dc46. cgroup.conf does ignore comments, blanks lines, etc. - it just tells you about it when you have debug3 on, which is why I was mistaken.

Closing as resolved/fixed.

Comment 16 Marshall Garey 2017-12-04 10:50:58 MST

I forgot to actually close the bug in the previous comment... Closing as resolved/fixed.

Comment 17 Brian Christiansen 2017-12-13 16:54:35 MST

*** Ticket 4518 has been marked as a duplicate of this ticket. ***

Comment 18 Adam 2017-12-13 17:17:24 MST

Quick update note.  Thanks for the quick fix on the bug guys.  We've brought it active at 2 sites and it's working smoothly.  We built from the top of the 17.11 branch as of Dec 11, 2017.