(In reply to IDRIS System Team from comment #0) > error: _block_sync_core_bitmap: b_min > nboards_nb (2 > 1) node:r7i3n7 > core_bitmap:0-2,40-42 It looks like the job's core bitmap is messed up - it should be cores, but it looks like it's CPUs here. I'm not sure why. And I'm not sure why gres.conf's Cores has anything to do with this part of the code. We'll look into it. Thanks, -Michael Could you please share slurmctld logs with GRES debug flag enabled? You can do that without restart simply executing: > scontrol setdebugflag +Gres and disable it after reproducing by > scontrol setdebugflag -Gres cheers, Marcin Created attachment 17298 [details]
Slurmctld log with GRES debug
Could you please also attach your full slurm.conf and gres.conf? Thanks Created attachment 17305 [details]
Slurm configuration file
Created attachment 17306 [details]
GRES configuration file
Created attachment 17314 [details]
verification patch(v1)
Could you please apply the attached patch and check how it works for you?
cheers,
Marcin
Hi!
The patch seems to solve the current issue (job no longer blocked, no block_sync_core_bitmap error) but we experience now a problem when using exclusive. We don't know if it's related to the patch.
When running:
sbatch --ntasks=8 --cpus-per-task=10 --hint=nomultithread --qos=qos_cpu-dir --account=xyz --exclusive <<EOF
> #!/bin/bash
> srun hostname
> EOF
Job output:
srun: error: Unable to create step for job 506: More processors requested than permitted
The same sbatch command without exclusive produces the excepted result. It also works with exclusive when using srun directly:
srun --ntasks=8 --cpus-per-task=10 --hint=nomultithread --qos=qos_cpu-dir --account=xyz --exclusive hostname
(In reply to Marcin Stolarek from comment #10)
> Created attachment 17314 [details]
> verification patch(v1)
>
> Could you please apply the attached patch and check how it works for you?
>
> cheers,
> Marcin
OK. I'm passing the patch to the review team. I think that the other issue you mentioned is something we're already working on, but unfortunately, the bug is not pubic. Could you please submit it as a separate ticket or do you want me to create it? (Due to technical limitation of bugzilla I'll be a reporter there but I can add you as CC). cheers, Marcin You can create it, thanks! (In reply to Marcin Stolarek from comment #14) > OK. I'm passing the patch to the review team. > I think that the other issue you mentioned is something we're already > working on, but unfortunately, the bug is not pubic. > > Could you please submit it as a separate ticket or do you want me to create > it? (Due to technical limitation of bugzilla I'll be a reporter there but I > can add you as CC). > > cheers, > Marcin Opened Bug 10627 to continue the work on the issue from comment 12. I'm closing the ticket now, since the patch got merged into our main repository: https://github.com/SchedMD/slurm/commit/5c7d447113310984ca431c782ccd82b2f3200eda cheers, Marcin *** Ticket 10948 has been marked as a duplicate of this ticket. *** |
Hi! We have an issue with CPU/GPU affinity on Slurm 20.02.6 (and patched for #9670 and #9724). When we submit the following job: srun -A sos@gpu -n 1 -c 10 --gres=gpu:1 --hint=nomultithread ~/binding_mpi.exe An error appears in slurmctld log: error: _block_sync_core_bitmap: b_min > nboards_nb (2 > 1) node:r7i3n7 core_bitmap:0-2,40-42 And the job starts on two nodes (although one is enough) and blocks: srun: job 251 queued and waiting for resources srun: job 251 has been allocated resources [blocked] Job information: JobId=251 JobName=binding_mpi.exe UserId=user1(1001) GroupId=group1(1001) MCS_label=N/A Priority=161949 Nice=0 Account=group1@gpu QOS=qos_gpu-t3 JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:08 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2020-12-17T16:41:24 EligibleTime=2020-12-17T16:41:24 AccrueTime=2020-12-17T16:41:24 StartTime=2020-12-17T16:41:32 EndTime=2020-12-17T16:51:32 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-12-17T16:41:32 Partition=gpu_p13 AllocNode:Sid=jean-zay1:62428 ReqNodeList=(null) ExcNodeList=(null) NodeList=r7i3n[6-7] BatchHost=r7i3n6 NumNodes=2 NumCPUs=12 NumTasks=1 CPUs/Task=10 ReqB:S:C:T=0:0:*:1 TRES=cpu=12,mem=12G,node=2,billing=20,gres/gpu=2 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=10 MinMemoryCPU=2G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/linkhome/idris/genidr/ssos250/binding_mpi.exe WorkDir=/home/user1 Power= TresPerNode=gpu:1 MailUser=user1 MailType=NONE Node configuration: NodeName=r7i3n6 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=191752 GRES configuration (as recommanded by #10019): NodeName=r7i3n6,r7i3n7 Name=gpu File=/dev/nvidia[0-1] Cores=0-19 NodeName=r7i3n6,r7i3n7 Name=gpu File=/dev/nvidia[2-3] Cores=20-39 This bug and the one from #10019 do not appear if we use one of the following GRES configuration: * without cores: NodeName=r7i3n6,r7i3n7 Name=gpu File=/dev/nvidia[0-1] NodeName=r7i3n6,r7i3n7 Name=gpu File=/dev/nvidia[2-3] * with logical cores: NodeName=r7i3n6,r7i3n7 Name=gpu File=/dev/nvidia[0-1] Cores=0-19,40-59 NodeName=r7i3n6,r7i3n7 Name=gpu File=/dev/nvidia[2-3] Cores=20-39,60-79 Could you help us to resolve this issue?