Ticket 17152

Summary:	scontrol shows 1-gpu allocated but $user can access two
Product:	Slurm	Reporter:	Prabhjyot Saluja <prabhjyot_saluja>
Component:	GPU	Assignee:	Ben Glines <ben.glines>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	david_johnson, geoffrey_avila, paul_stey, Samuel_Fulcomer
Version:	22.05.7
Hardware:	Linux
OS:	Linux
Site:	Brown Univ	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	RHEL	Machine Name:	gpu2107
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	user job script user job script slurm.conf gres.conf cgroup.conf

Description Prabhjyot Saluja 2023-07-07 12:12:39 MDT

Hi - We have a situation where user1 requested a job with gres:gpu:1, command 'scontrol show job <JOBID> -d' confirms the job's resource allocation as (cpu=8, mem=32G, node=1, billing=8, gres/gpu=1). However, when user1 checks using tools like 'nvidia-smi' or 'nvtop', they can actually see two GPUs available. The problem arises because user2's processes are also running on the same GPU, resulting in conflicts and competition for GPU resources. We looked at the PID and /proc/10019/fd (attached)

Although the job is still running, we would like to gather additional debug information for you before it ends. Is there any specific data or details we can provide to assist in resolving the issue?


$ scontrol show job 10358411 -d
JobId=10358411 JobName=sys/dashboard/sys/bc_ccv_jupyter_virtualenv
   UserId=qzhang64(140308453) GroupId=gk(6670) MCS_label=N/A
   Priority=1020549 Nice=0 Account=default QOS=gk-3090-gcondo
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=1-20:46:42 TimeLimit=4-00:00:00 TimeMin=N/A
   SubmitTime=2023-07-05T17:08:02 EligibleTime=2023-07-05T17:08:02
   AccrueTime=2023-07-05T17:08:02
   StartTime=2023-07-05T17:08:04 EndTime=2023-07-09T17:08:04 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-05T17:08:04 Scheduler=Main
   Partition=3090-gcondo AllocNode:Sid=poodcit3:27905
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=gpu2107
   BatchHost=gpu2107
   NumNodes=1 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=8,mem=32G,node=1,billing=8,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   JOB_GRES=gpu:1
     Nodes=gpu2107 CPU_IDs=33,35-41 Mem=32768 GRES=gpu:1(IDX:5)
   MinCPUsNode=1 MinMemoryNode=32G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/users/qzhang64/ondemand/data/sys/dashboard/batch_connect/sys/bc_ccv_jupyter_virtualenv/output/e0dde888-5997-4047-8826-64009839c9e7
   StdErr=/users/qzhang64/ondemand/data/sys/dashboard/batch_connect/sys/bc_ccv_jupyter_virtualenv/output/e0dde888-5997-4047-8826-64009839c9e7/output.log
   StdIn=/dev/null
   StdOut=/users/qzhang64/ondemand/data/sys/dashboard/batch_connect/sys/bc_ccv_jupyter_virtualenv/output/e0dde888-5997-4047-8826-64009839c9e7/output.log
   Power=
   TresPerNode=gres:gpu:1

cd /proc/10019/fd
$ gpu2107 fd]# ls -l | grep nvidia
lrwx------ 1 qzhang64 gk 64 Jul  7 13:38 64 -> /dev/nvidiactl
lrwx------ 1 qzhang64 gk 64 Jul  7 13:38 65 -> /dev/nvidia-uvm
lrwx------ 1 qzhang64 gk 64 Jul  7 13:38 67 -> /dev/nvidia7
lrwx------ 1 qzhang64 gk 64 Jul  7 13:38 69 -> /dev/nvidiactl
lrwx------ 1 qzhang64 gk 64 Jul  7 13:38 70 -> /dev/nvidia7
lrwx------ 1 qzhang64 gk 64 Jul  7 13:38 71 -> /dev/nvidia7
lrwx------ 1 qzhang64 gk 64 Jul  7 13:38 72 -> /dev/nvidia7
lr-x------ 1 qzhang64 gk 64 Jul  7 13:38 73 -> /dev/nvidia-caps/nvidia-cap2
lrwx------ 1 qzhang64 gk 64 Jul  7 13:38 74 -> /dev/nvidia5
lrwx------ 1 qzhang64 gk 64 Jul  7 13:38 75 -> /dev/nvidia7
lrwx------ 1 qzhang64 gk 64 Jul  7 13:38 76 -> /dev/nvidia7
lrwx------ 1 qzhang64 gk 64 Jul  7 13:38 77 -> /dev/nvidia5
lrwx------ 1 qzhang64 gk 64 Jul  7 13:38 78 -> /dev/nvidia5
lrwx------ 1 qzhang64 gk 64 Jul  7 13:38 80 -> /dev/nvidia7
lrwx------ 1 qzhang64 gk 64 Jul  7 13:38 81 -> /dev/nvidia7
lrwx------ 1 qzhang64 gk 64 Jul  7 13:38 82 -> /dev/nvidia7
lrwx------ 1 qzhang64 gk 64 Jul  7 13:38 84 -> /dev/nvidia7
lrwx------ 1 qzhang64 gk 64 Jul  7 13:38 85 -> /dev/nvidia7
lrwx------ 1 qzhang64 gk 64 Jul  7 13:38 86 -> /dev/nvidia7
lrwx------ 1 qzhang64 gk 64 Jul  7 13:38 87 -> /dev/nvidia7
[root@gpu2107 fd]#

Regards,
Singh

Comment 1 Prabhjyot Saluja 2023-07-07 12:16:51 MDT

I should mention in our gres.conf file we have AutoDetect=nvml.

Comment 2 Prabhjyot Saluja 2023-07-07 12:32:39 MDT

Detailed explanation: The user1 got assigned (IDX:5) but user2 who got assigned (IDX:4), but user2 processes are running on (IDX:5) GPU instead. This is preventing user1 from getting all the memory available on that 3090-GPU. 

scontrol show job 10364500_35 -d
JobId=10367489 ArrayJobId=10364500 ArrayTaskId=35 ArrayTaskThrottle=50 JobName=CIP-5-2-mv
   UserId=makbulut(140402267) GroupId=gdk(1176) MCS_label=N/A
   Priority=1008488 Nice=0 Account=default QOS=cs-3090-gcondo
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=07:10:25 TimeLimit=10-23:00:00 TimeMin=N/A
   SubmitTime=2023-07-06T15:30:48 EligibleTime=2023-07-06T15:30:49
   AccrueTime=2023-07-06T15:30:49
   StartTime=2023-07-07T07:18:57 EndTime=2023-07-18T06:18:57 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-07T07:18:57 Scheduler=Main
   Partition=3090-gcondo AllocNode:Sid=node1801:65682
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=gpu2107
   BatchHost=gpu2107
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=8G,node=1,billing=1,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   JOB_GRES=gpu:1
     Nodes=gpu2107 CPU_IDs=32 Mem=8192 GRES=gpu:1(IDX:4)
   MinCPUsNode=1 MinMemoryNode=8G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=.onager/scripts/CIP-5-2-mv/wrapper.sh
   WorkDir=/oscar/home/makbulut/motor_skills
   StdErr=/oscar/home/makbulut/motor_skills/.onager/logs/slurm/CIP-5-2-mv_10364500_35.e
   StdIn=/dev/null
   StdOut=/oscar/home/makbulut/motor_skills/.onager/logs/slurm/CIP-5-2-mv_10364500_35.o
   Power=
   TresPerNode=gres:gpu:1

Comment 3 Prabhjyot Saluja 2023-07-07 12:33:34 MDT

Detailed explanation: The user1 got assigned (IDX:5) but user2 who got assigned (IDX:4), but user2 processes are running on (IDX:5) GPU instead. This is preventing user1 from getting all the memory available on that 3090-GPU. 

scontrol show job 10364500_35 -d
JobId=10367489 ArrayJobId=10364500 ArrayTaskId=35 ArrayTaskThrottle=50 JobName=CIP-5-2-mv
   UserId=makbulut(140402267) GroupId=gdk(1176) MCS_label=N/A
   Priority=1008488 Nice=0 Account=default QOS=cs-3090-gcondo
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=07:10:25 TimeLimit=10-23:00:00 TimeMin=N/A
   SubmitTime=2023-07-06T15:30:48 EligibleTime=2023-07-06T15:30:49
   AccrueTime=2023-07-06T15:30:49
   StartTime=2023-07-07T07:18:57 EndTime=2023-07-18T06:18:57 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-07T07:18:57 Scheduler=Main
   Partition=3090-gcondo AllocNode:Sid=node1801:65682
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=gpu2107
   BatchHost=gpu2107
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=8G,node=1,billing=1,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   JOB_GRES=gpu:1
     Nodes=gpu2107 CPU_IDs=32 Mem=8192 GRES=gpu:1(IDX:4)
   MinCPUsNode=1 MinMemoryNode=8G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=.onager/scripts/CIP-5-2-mv/wrapper.sh
   WorkDir=/oscar/home/makbulut/motor_skills
   StdErr=/oscar/home/makbulut/motor_skills/.onager/logs/slurm/CIP-5-2-mv_10364500_35.e
   StdIn=/dev/null
   StdOut=/oscar/home/makbulut/motor_skills/.onager/logs/slurm/CIP-5-2-mv_10364500_35.o
   Power=
   TresPerNode=gres:gpu:1

Comment 4 Prabhjyot Saluja 2023-07-07 12:46:13 MDT

Logical GPU UUIDs
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-87a9d0e5-5853-25af-f767-6bfef38a13b5)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-38688e7c-979f-8bc0-8be6-413327b63fb3)
GPU 2: NVIDIA GeForce RTX 3090 (UUID: GPU-a0fd89c7-8c60-1c4b-ad56-5314588a71a6)
GPU 3: NVIDIA GeForce RTX 3090 (UUID: GPU-3ac61cff-cd5b-d1b4-b8e0-aac8303d637e)
GPU 4: NVIDIA GeForce RTX 3090 (UUID: GPU-037a7c4c-c061-037f-9875-cd0ce3df6ee9)
GPU 5: NVIDIA GeForce RTX 3090 (UUID: GPU-43868453-ca5a-2de0-7867-e4d42722743a)
GPU 6: NVIDIA GeForce RTX 3090 (UUID: GPU-ffa870d1-420e-7ece-daa3-c80a7bebdf06)
GPU 7: NVIDIA GeForce RTX 3090 (UUID: GPU-778c1463-c83d-dfbc-df8a-5cc86175d6e1)

[user1@gpu2107 ~]$ nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-38688e7c-979f-8bc0-8be6-413327b63fb3) - gpu1(root)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-43868453-ca5a-2de0-7867-e4d42722743a) - gpu5(root)

[user2@gpu2107 ~]$ nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-037a7c4c-c061-037f-9875-cd0ce3df6ee9) - gpu4(root)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-43868453-ca5a-2de0-7867-e4d42722743a) - gpu5(root)


Ideally, gpu5 is allocated to user1 but how can user2 also get it. That's a problem.

Comment 5 Jason Booth 2023-07-07 14:01:55 MDT

Please attach your slurm.conf, cgroup.conf and gres.conf. Please also include the submission script and or arguments used to submit this job.

Comment 6 David D. Johnson 2023-07-07 16:03:49 MDT

Created attachment 31158 [details]
user job script

Comment 7 David D. Johnson 2023-07-07 16:04:19 MDT

Created attachment 31159 [details]
user job script

Comment 8 David D. Johnson 2023-07-07 16:04:38 MDT

Created attachment 31160 [details]
slurm.conf

Comment 9 David D. Johnson 2023-07-07 16:04:59 MDT

Created attachment 31161 [details]
gres.conf

Comment 10 David D. Johnson 2023-07-07 16:05:47 MDT

Created attachment 31162 [details]
cgroup.conf

Comment 11 David D. Johnson 2023-07-07 16:18:24 MDT

User two has submitted multiple jobs using the same script, even though job id doesn't match the file I uploaded.

Comment 12 Ben Glines 2023-07-18 11:13:10 MDT

Hello,

After reviewing you slurm.conf and you nvidia-smi output, it appears that you have 8 gpus on the gpu2107 node, but only 7 gpus configured for the gpu2107 node in your slurm.conf:
> NodeName=gpu[2107] CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=32
> ThreadsPerCore=1 RealMemory=1030000 Feature=amd,gpu,geforce3090,ampere Gres=gpu:7
> Weight=1600

I suspect that "gpu5" (UUID: GPU-43868453-ca5a-2de0-7867-e4d42722743a) is not under the control of Slurm, and thus it is available to any users on the node regardless of their job's allocation.

Try changing the node definition to include all 8 gpus, and see if that fixes things.

Comment 13 Prabhjyot Saluja 2023-07-19 07:01:16 MDT

Thank you for your thorough attention to detail and for catching the mistake. We encountered an issue with one GPU on this node but overlooked the need to modify the slurm.conf file when the GPU was reinstalled. This ticket can be marked as resolved. 

Regards,
Singh

Comment 14 Ben Glines 2023-07-19 16:45:16 MDT

Sounds good! Always happy to help.

Closing now.