10077 – Wrong number of CPU allocated in Slurm cgroup on shared GPU nodes

Ticket 10077 - Wrong number of CPU allocated in Slurm cgroup on shared GPU nodes

Summary: Wrong number of CPU allocated in Slurm cgroup on shared GPU nodes

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	20.02.4
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Marcin Stolarek
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-10-27 11:43 MDT by IDRIS System Team
Modified:	2020-12-17 03:46 MST (History)
CC List:	3 users (show)

See Also:
Site:	GENCI - Grand Equipement National de Calcul Intensif
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	20.02.6 20.11.1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description IDRIS System Team 2020-10-27 11:43:50 MDT

With Slurm v20.02.4 and the patch provided by #10019, we are facing an issue with CPU allocated in Slurm cgroup on GPU nodes.

We submit a job with one task, 10 CPU/tasks, 1GPU and no multithread. On the allocated cgroup node, there are only 8 CPU available. Could you help us to fix this issue?

[user1@jean-zay2: ~]$ srun -A foo@gpu -n 1 -c 10 -p gpu_p1 --gres=gpu:1 --hint=nomultithread --pty bash
srun: job 610468 queued and waiting for resources
srun: job 610468 has been allocated resources
bash-4.4$ cat /sys/fs/cgroup/memory/`grep memory /proc/self/cgroup | cut -d: -f 3`/memory.limit_in_bytes
34359738368
bash-4.4$ scontrol show job $SLURM_JOB_ID                                                               
JobId=610468 JobName=bash
   UserId=user1(300277) GroupId=group1(300225) MCS_label=N/A
   Priority=171473 Nice=0 Account=foo@gpu QOS=qos_gpu-t3
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:10 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2020-10-27T17:50:39 EligibleTime=2020-10-27T17:50:39
   AccrueTime=2020-10-27T17:50:39
   StartTime=2020-10-27T17:50:44 EndTime=2020-10-27T18:00:46 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-10-27T17:50:44
   Partition=gpu_p1 AllocNode:Sid=jean-zay2-ib0:70686
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=r6i4n7
   BatchHost=r6i4n7
   NumNodes=1 NumCPUs=10 NumTasks=1 CPUs/Task=10 ReqB:S:C:T=0:0:*:1
   TRES=cpu=10,mem=20G,node=1,billing=10,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=10 MinMemoryCPU=2G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/path/to/workdir
   Power=
   TresPerNode=gpu:1
   MailUser=(null) MailType=NONE

bash-4.4$ cat /sys/fs/cgroup/cpuset/`grep cpuset /proc/self/cgroup | cut -d: -f 3`/cpuset.*cpus
2-6,17-19,42-46,57-59
2-6,17-19,42-46,57-59
bash-4.4$ nvidia-smi
Tue Oct 27 17:51:23 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:88:00.0 Off |                    0 |
| N/A   45C    P0    45W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

# Log slurmcltd
[2020-10-27T17:50:44.032] error: _block_sync_core_bitmap: b_min > nboards_nb (2 > 1) node:r6i4n7 core_bitmap:2-6,17-19
[2020-10-27T17:50:44.035] error: _block_sync_core_bitmap: b_min > nboards_nb (2 > 1) node:r6i4n7 core_bitmap:2-6,17-19
[2020-10-27T17:50:44.036] backfill: Started JobId=610468 in gpu_p1 on r6i4n7

# Log slurmd
[2020-10-27T17:50:46.186] _run_prolog: run job script took usec=134363
[2020-10-27T17:50:46.200] _run_prolog: prolog with lock for job 610468 ran for 0 seconds
[2020-10-27T17:50:46.430] [610468.extern] task/cgroup: /slurm/uid_300277/job_610468: alloc=32768MB mem.limit=32768MB memsw.limit=unlimited
[2020-10-27T17:50:46.430] [610468.extern] task/cgroup: /slurm/uid_300277/job_610468/step_extern: alloc=32768MB mem.limit=32768MB memsw.limit=unlimited
[2020-10-27T17:50:48.889] launch task 610468.0 request from UID:300277 GID:300225 HOST:10.148.0.21 PORT:13506
[2020-10-27T17:50:48.889] lllp_distribution jobid [610468] binding: threads,one_thread, dist 8192
[2020-10-27T17:50:48.889] _task_layout_lllp_block
[2020-10-27T17:50:48.889] error: task/affinity: only 16 bits in avail_map, CPU_BIND_ONE_THREAD_PER_CORE requires 20!
[2020-10-27T17:50:48.889] error: lllp_distribution jobid [610468] overriding binding: threads,mask_cpu,one_thread
[2020-10-27T17:50:48.889] error: Verify socket/core/thread counts in configuration
[2020-10-27T17:50:48.955] [610468.0] task/cgroup: /slurm/uid_300277/job_610468: alloc=32768MB mem.limit=32768MB memsw.limit=unlimited
[2020-10-27T17:50:48.955] [610468.0] task/cgroup: /slurm/uid_300277/job_610468/step_0: alloc=32768MB mem.limit=32768MB memsw.limit=unlimited
[2020-10-27T17:50:48.977] [610468.0] in _window_manager
[2020-10-27T17:50:48.979] [610468.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-10-27T17:55:30.033] [610468.0] done with job
[2020-10-27T17:55:30.050] [610468.extern] done with job

Comment 3 Marcin Stolarek 2020-10-27 14:14:00 MDT

There were some changes in the core selection logic for --threads-per-core=1  (which is internally set by --hint=nomultithread) in case of --cpus-per-task(-c) and --gres specification between 20.02.4 and 20.02.5. We have some bugs still working on related topics, but to make sure that all the efforts are aligned. Could you please:
-) Upgreade to Slurm 20.02.5.
-) Verify the issue and share slurmctld logs with GRES debugflag enabled from the time when test was performed?

cheers,
Marcin

Comment 4 Marcin Stolarek 2020-10-30 12:55:32 MDT

Additionally can you check 'scontrol show job -d'. Do I understand correctly that by the patch from Bug 10019 you mean a configuration change - addition of Cores to gres.conf?

Cheers,
Marcin

Comment 5 Marcin Stolarek 2020-11-09 08:24:37 MST

Could you please take a look at the last comments? 

cheers,
Marcin

Comment 6 IDRIS System Team 2020-11-13 03:05:41 MST

Sorry for the delay.

The given bug number for the patch was wrong. We used the one provided by #9670.

We will try Slurm v20.02.5 as soon as we can.

Comment 7 Marcin Stolarek 2020-11-13 03:20:42 MST

>Sorry for the delay.
No issue at all.

>The given bug number for the patch was wrong. We used the one provided by #9670.
Thanks for clarifying that.

>We will try Slurm v20.02.5 as soon as we can.
Just go to our downloads page[1], latest release of 20.02 is 20.02.6.

I see that in one of the versions of slurm.conf you shared with us before there were lines with DefCpuPerGPU. Is one of those in use today? If yes please take a look at Bug 9947 - fixed in 20.02.7 (not yet released).


cheers,
Marcin
[1]https://www.schedmd.com/downloads.php

Comment 8 Marcin Stolarek 2020-11-26 07:31:29 MST

Just following up on the case, were you able to upgrade and gather the logs requested in comment 3?

cheers,
Marcin

Comment 9 Marcin Stolarek 2020-12-14 06:25:51 MST

Let me know if you want to get back to this ticket. In case of no reply I'll close the case as timeout.

cheers,
Marcin

Comment 10 IDRIS System Team 2020-12-17 01:44:58 MST

We were not able to reproduce the issue with Slurm 20.02.6 (and patches for #9670 and #9724).

(In reply to Marcin Stolarek from comment #7)
> >Sorry for the delay.
> No issue at all.
> 
> >The given bug number for the patch was wrong. We used the one provided by #9670.
> Thanks for clarifying that.
> 
> >We will try Slurm v20.02.5 as soon as we can.
> Just go to our downloads page[1], latest release of 20.02 is 20.02.6.
> 
> I see that in one of the versions of slurm.conf you shared with us before
> there were lines with DefCpuPerGPU. Is one of those in use today? If yes
> please take a look at Bug 9947 - fixed in 20.02.7 (not yet released).
> 
> 
> cheers,
> Marcin
> [1]https://www.schedmd.com/downloads.php

Comment 11 Marcin Stolarek 2020-12-17 03:46:23 MST

Thanks for the feedback, based on you're reply I'm marking it as fixed in 20.02.6.

cheers,
Marcin