Created attachment 12344 [details] lstopo output We have an AMD Rome compute node with 2 sockets, 8 NUMA nodes, 128 cores, and 256 threads. When we configure the node with Sockets=2 CoresPerSocket=64 ThreadsPerCore=2, we get these log messages from slurmd: [2019-11-04T17:51:21.532] Considering each NUMA node as a socket [2019-11-04T17:51:21.533] error: You are using cons_res or gang scheduling with Fastschedule=0 and node configuration differs from hardware. The node configuration used will be what is in the slurm.conf because of the bitmaps the slurmctld must create before the slurmd registers. CPUs=256:256(hw) Boards=1:1(hw) SocketsPerBoard=2:8(hw) CoresPerSocket=64:16(hw) ThreadsPerCore=2:2(hw) When we change the configuration to Sockets=8 CoresPerSocket=16 ThreadsPerCore=2, we no longer get this message, but --hint=nomultithread no longer works correctly; tasks are placed on threads on the same core. nid001002:~ # srun --exclusive -w nid001002 --cpu-bind=v,threads -c 2 --hint=nomultithread -n 2 /lus/dgloe/xthi cpu-bind=MASK - nid001002, task 0 0 [66084]: mask 0x100000000000000000000000000000001 set cpu-bind=MASK - nid001002, task 1 1 [66085]: mask 0x200000000000000000000000000000002 set Hello from rank 000 thread 000 on nid001002 (core affinity = 0,128 ) Hello from rank 000 thread 001 on nid001002 (core affinity = 0,128 ) Hello from rank 001 thread 000 on nid001002 (core affinity = 1,129 ) Hello from rank 001 thread 001 on nid001002 (core affinity = 1,129 ) Which configuration should we use? Is the nomultithread behavior a bug?
David, There are a few things here, first of all, you should not use FastSchedule=0. You should get the below error message as one of the first lines of slurmctld/slurmd output: >error: FastSchedule will be removed in 20.02, as will the FastSchedule=0 functionality. Please consider removing this from your configuration now. AMD Rome(all from Epyc) line is another AMD line with multiple NUMA nodes per socket, by default Slurm considers every NUMA node as a separate Socket since from the application perspective it's more important than the number of packages. I see that you have both --cpu-bind=v,thread and --hint=nomultithread. The way --hint is implemented is modyfication of the same job descrtiption member as --cpu-bind and the only --cpu-bind option allowed together with --hint is verbose. If other than verbose options are specified to --cpu-bind the value of --hint is ignored. Could you please re-run the test skipping "thread" in --cpu-bind? Could you please elaborate on the goal you're trying to achieve by --hint=nomultithread? cheers, Marcin
I don't see any mention of --cpu-bind overriding --hint in the srun man page. A user reports that --hint=nomultithread worked with --cpu-bind=threads in Slurm 18.08.7 (although on a different processor) [saraha@osprey misc]$ srun --mpi=pmi2 -n2 --hint=nomultithread --cpu-bind=threads -c 2 ./xthi | sort {{ Hello from rank 000 thread 00 on prod-0065 (core affinity = 0 )}} {{ Hello from rank 000 thread 01 on prod-0065 (core affinity = 1 )}} {{ Hello from rank 001 thread 00 on prod-0065 (core affinity = 28 )}} {{ Hello from rank 001 thread 01 on prod-0065 (core affinity = 29 )}} {{ [saraha@osprey misc]$ srun --mpi=pmi2 -n2 --cpu-bind=threads -c 2 ./xthi | sort}} {{ Hello from rank 000 thread 00 on prod-0065 (core affinity = 0 )}} {{ Hello from rank 000 thread 01 on prod-0065 (core affinity = 56 )}} {{ Hello from rank 001 thread 00 on prod-0065 (core affinity = 28 )}} {{ Hello from rank 001 thread 01 on prod-0065 (core affinity = 84 )}} What we're trying to do is disable use of SMT threads.
Created attachment 12354 [details] error message + documentation about --hint ignore(v1) David, I can confirm there was a commit changing this behavior[1], previously the final result was dependent on the order of salloc/sbatch/srun arguments. Attached patch adds an end-user error message informing that the option is ignored and documents this in appropriate man pages. I'm addressing additional end-user messages to 20.02, since it's a user end change. However, if you like it should cleanly apply on top of 19.05. Did you change slurm.conf between the tests in bug report and comment 3? Since in comment 3 both attempts resulted in the allocation you wanted to achieve? This behavior depends on a number of "CPUs" you specify in the node configuration line of slurm.conf - if it's equal to number of cores (not the total number of threads) or you have SelectTypeParameters=CR_ONE_TASK_PER_CORE set in your slurm.conf cores are threaded as one PU(processing unit). If you prefer to keep CPUs=TotalNumberSMTs you can achieve one "PU per core" adding --threads-per-core=1 option to your srun. Like below (I have ThreadsPerCore=2. Core 0 has PU0 and PU1; Core1 has PU2 and PU3..): 1. Default, every thread treated as PU. # srun --mem=10 -n2 -c2 /bin/bash -c 'taskset -cp $$' srun: job 899 queued and waiting for resources srun: job 899 has been allocated resources pid 29265's current affinity list: 2,3 pid 29264's current affinity list: 0,1 2. Added --threads-per-core=1 - core treated as one PU. Each task gets 4 threads=2cores. Binding done to all threads in core. # srun --mem=10 -n2 --threads-per-core=1 -c2 /bin/bash -c 'taskset -cp $$' srun: job 898 queued and waiting for resources srun: job 898 has been allocated resources pid 29194's current affinity list: 0-3 pid 29195's current affinity list: 4-7 3. Added --cpu-bind=thread to bind processes to only one SMT per core. # srun --mem=10 -n2 -c2 --threads-per-core=1 --cpu-bind=thread /bin/bash -c 'taskset -cp $$' srun: job 912 queued and waiting for resources srun: job 912 has been allocated resources pid 7512's current affinity list: 4,6 pid 7511's current affinity list: 0,2 I hope you find this helpful. cheers, Marcin [1] https://github.com/SchedMD/slurm/commit/4cf80f20736b0ce81471582c56f2aca20ad8df27
David, Did you try the attached patch? Is there anything else I can help you with in regards to the issue? cheers, Marcin
Using --cpu-bind=thread --threads-per-core=1 meets our needs. I haven't tried out the patch.
David, We've added client side verification for mutually exclusive options that were silently ignored before in 20.11 and documented that in sbatchh/srun/salloc manual pages. Cheers, Marcin