Ticket 10474

Summary: Wrong node allocation with recommended CPU/GPU affinity
Product: Slurm Reporter: IDRIS System Team <gensyshpe>
Component: SchedulingAssignee: Marcin Stolarek <cinek>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: brian, cinek, csc-slurm-tickets, nate, remi.lacroix
Version: 20.02.6   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=10019
https://bugs.schedmd.com/show_bug.cgi?id=9670
Site: IDRIS Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 20.11.4 21.08pre1
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: Slurmctld log with GRES debug
Slurm configuration file
GRES configuration file
verification patch(v1)

Description IDRIS System Team 2020-12-17 09:43:58 MST
Hi!

We have an issue with CPU/GPU affinity on Slurm 20.02.6 (and patched for #9670 and #9724). When we submit the following job:

    srun -A sos@gpu -n 1 -c 10 --gres=gpu:1 --hint=nomultithread ~/binding_mpi.exe

An error appears in slurmctld log:

    error: _block_sync_core_bitmap: b_min > nboards_nb (2 > 1) node:r7i3n7 core_bitmap:0-2,40-42

And the job starts on two nodes (although one is enough) and blocks:

    srun: job 251 queued and waiting for resources
    srun: job 251 has been allocated resources
    [blocked]

Job information:

JobId=251 JobName=binding_mpi.exe
   UserId=user1(1001) GroupId=group1(1001) MCS_label=N/A
   Priority=161949 Nice=0 Account=group1@gpu QOS=qos_gpu-t3
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:08 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2020-12-17T16:41:24 EligibleTime=2020-12-17T16:41:24
   AccrueTime=2020-12-17T16:41:24
   StartTime=2020-12-17T16:41:32 EndTime=2020-12-17T16:51:32 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-12-17T16:41:32
   Partition=gpu_p13 AllocNode:Sid=jean-zay1:62428
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=r7i3n[6-7]
   BatchHost=r7i3n6
   NumNodes=2 NumCPUs=12 NumTasks=1 CPUs/Task=10 ReqB:S:C:T=0:0:*:1
   TRES=cpu=12,mem=12G,node=2,billing=20,gres/gpu=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=10 MinMemoryCPU=2G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/linkhome/idris/genidr/ssos250/binding_mpi.exe
   WorkDir=/home/user1
   Power=
   TresPerNode=gpu:1
   MailUser=user1 MailType=NONE

Node configuration:

    NodeName=r7i3n6 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=191752

GRES configuration (as recommanded by #10019):

    NodeName=r7i3n6,r7i3n7 Name=gpu File=/dev/nvidia[0-1] Cores=0-19
    NodeName=r7i3n6,r7i3n7 Name=gpu File=/dev/nvidia[2-3] Cores=20-39

This bug and the one from #10019 do not appear if we use one of the following GRES configuration:

* without cores:

    NodeName=r7i3n6,r7i3n7 Name=gpu File=/dev/nvidia[0-1]
    NodeName=r7i3n6,r7i3n7 Name=gpu File=/dev/nvidia[2-3]

* with logical cores:

    NodeName=r7i3n6,r7i3n7 Name=gpu File=/dev/nvidia[0-1] Cores=0-19,40-59
    NodeName=r7i3n6,r7i3n7 Name=gpu File=/dev/nvidia[2-3] Cores=20-39,60-79

Could you help us to resolve this issue?
Comment 1 Michael Hinton 2020-12-18 16:04:57 MST
(In reply to IDRIS System Team from comment #0)
>     error: _block_sync_core_bitmap: b_min > nboards_nb (2 > 1) node:r7i3n7
> core_bitmap:0-2,40-42
It looks like the job's core bitmap is messed up - it should be cores, but it looks like it's CPUs here. I'm not sure why. And I'm not sure why gres.conf's Cores has anything to do with this part of the code. We'll look into it.

Thanks,
-Michael
Comment 3 Marcin Stolarek 2020-12-22 03:25:37 MST
Could you please share slurmctld logs with GRES debug flag enabled? You can do that without restart simply executing:
> scontrol setdebugflag +Gres
and disable it after reproducing by
> scontrol setdebugflag -Gres

cheers,
Marcin
Comment 4 IDRIS System Team 2020-12-30 09:16:23 MST
Created attachment 17298 [details]
Slurmctld log with GRES debug
Comment 6 Michael Hinton 2020-12-30 15:58:08 MST
Could you please also attach your full slurm.conf and gres.conf? Thanks
Comment 8 IDRIS System Team 2020-12-31 04:32:52 MST
Created attachment 17305 [details]
Slurm configuration file
Comment 9 IDRIS System Team 2020-12-31 04:34:06 MST
Created attachment 17306 [details]
GRES configuration file
Comment 10 Marcin Stolarek 2021-01-03 21:35:42 MST
Created attachment 17314 [details]
verification patch(v1)

Could you please apply the attached patch and check how it works for you?

cheers,
Marcin
Comment 12 IDRIS System Team 2021-01-07 04:26:34 MST
Hi!

The patch seems to solve the current issue (job no longer blocked, no block_sync_core_bitmap error) but we experience now a problem when using exclusive. We don't know if it's related to the patch.

When running:

    sbatch --ntasks=8 --cpus-per-task=10 --hint=nomultithread --qos=qos_cpu-dir --account=xyz --exclusive <<EOF
    > #!/bin/bash
    > srun hostname
    > EOF

Job output:

    srun: error: Unable to create step for job 506: More processors requested than permitted

The same sbatch command without exclusive produces the excepted result. It also works with exclusive when using srun directly:

    srun --ntasks=8 --cpus-per-task=10 --hint=nomultithread --qos=qos_cpu-dir --account=xyz --exclusive hostname


(In reply to Marcin Stolarek from comment #10)
> Created attachment 17314 [details]
> verification patch(v1)
> 
> Could you please apply the attached patch and check how it works for you?
> 
> cheers,
> Marcin
Comment 14 Marcin Stolarek 2021-01-12 10:05:41 MST
OK. I'm passing the patch to the review team. 
I think that the other issue you mentioned is something we're already working on, but unfortunately, the bug is not pubic.

Could you please submit it as a separate ticket or do you want me to create it? (Due to technical limitation of bugzilla I'll be a reporter there but I can add you as CC).

cheers,
Marcin
Comment 16 IDRIS System Team 2021-01-14 01:05:34 MST
You can create it, thanks!

(In reply to Marcin Stolarek from comment #14)
> OK. I'm passing the patch to the review team. 
> I think that the other issue you mentioned is something we're already
> working on, but unfortunately, the bug is not pubic.
> 
> Could you please submit it as a separate ticket or do you want me to create
> it? (Due to technical limitation of bugzilla I'll be a reporter there but I
> can add you as CC).
> 
> cheers,
> Marcin
Comment 17 Marcin Stolarek 2021-01-14 02:34:13 MST
Opened Bug 10627 to continue the work on the issue from comment 12.
Comment 22 Marcin Stolarek 2021-02-14 13:17:05 MST
I'm closing the ticket now, since the patch got merged into our main repository:
https://github.com/SchedMD/slurm/commit/5c7d447113310984ca431c782ccd82b2f3200eda


cheers,
Marcin
Comment 23 Marcin Stolarek 2021-03-01 03:05:48 MST
*** Ticket 10948 has been marked as a duplicate of this ticket. ***