Good afternoon, I should probably say "probably not working if " - but hear me out. First on an x86_64, ThreadsPerCore=1: [plazonic@adroit4 ~]$ cat cputest.slurm #!/bin/bash #SBATCH -N 1 #SBATCH -t 00:10:00 #SBATCH --ntasks-per-node=4 #SBATCH -c 4 srun --exclusive -n1 set.sh & srun --exclusive -n1 set.sh & srun --exclusive -n1 set.sh & srun --exclusive -n1 set.sh & wait [plazonic@adroit4 ~]$ cat set.sh #!/bin/bash echo HOST:$SLURMD_NODENAME $SLURM_STEP_ID `taskset -c -p $$` `hostname` sleep 30s [plazonic@adroit4 ~]$ sbatch cputest.slurm Submitted batch job 699064 [plazonic@adroit4 ~]$ cat slurm-699064.out HOST:adroit-13 0 pid 156701's current affinity list: 2,4,6,8 adroit-13 HOST:adroit-13 1 pid 156708's current affinity list: 10,12,14,16 adroit-13 HOST:adroit-13 2 pid 156715's current affinity list: 18,20,22,24 adroit-13 HOST:adroit-13 3 pid 156721's current affinity list: 1,26,28,30 adroit-13 And now the same test, just result, on power9 cluster with ThreadsPerCore=4: [plazonic@traverse ~]$ cat slurm-48414.out HOST:traverse-k01g2 0 pid 65359's current affinity list: 64-67 traverse-k01g2 HOST:traverse-k01g2 1 pid 65361's current affinity list: 64-67 traverse-k01g2 HOST:traverse-k01g2 3 pid 65369's current affinity list: 64-67 traverse-k01g2 HOST:traverse-k01g2 2 pid 65371's current affinity list: 64-67 traverse-k01g2 Nothing changes for -c 1/2/4/8 - just one set of CPUs is used. If one does just srun set.sh (instead of 4 srun --exclusive's), it all works correctly: HOST:traverse-k01g2 0 pid 65709's current affinity list: 76-79 traverse-k01g2 HOST:traverse-k01g2 0 pid 65706's current affinity list: 64-67 traverse-k01g2 HOST:traverse-k01g2 0 pid 65708's current affinity list: 72-75 traverse-k01g2 HOST:traverse-k01g2 0 pid 65707's current affinity list: 68-71 traverse-k01g2 We've had other trouble due to ThreadsPerCore=4 so I am tempted to blame it on that but you tell me... If you let me know what you need I'll upload config and/or logs but it might be worth seeing if this works on other systems with ThreadsPerCore>1 before we dig in deeper in our configs. Thanks! Josko
Hi Josko, I'm not 100% sure if I'm able to reproduce it could you please rerun your tests with the script like the one below executed as a step? Please note that you'll have to replace '/sys/fs/cgroup/cpuset' with the location of your cgroup cpuset filesystem. # cat /tmp/set.sh #!/bin/bash #exec > ./slurm-${SLURM_STEPID} date sleep 5 echo "===SHOW MY JOB===" scontrol show job ${SLURM_JOBID} echo "===SHOW MY STEPS===" scontrol show step ${SLURM_JOB_ID} echo "===SHOW MY CGROUP===" /bin/cat /sys/fs/cgroup/cpuset/slurm_${SLURMD_NODENAME}/uid_${SLURM_JOB_UID}/job_${SLURM_JOB_ID}/step_${SLURM_STEPID}/cpuset.cpus echo '===SHOW MY TASKSET===' taskset -cp $$ date echo '===DONE===' Additionally please add -o option to your srun calls in cpuset.slurm, like here: >srun -o slurm-%J.out -l --exclusive -n1 /tmp/testStep & This will label each line with taskId (in this case always 0) and create a separate file per job step. cheers, Marcin
Created attachment 13127 [details] Result of test task 0
Created attachment 13128 [details] Result of test task 1
Created attachment 13129 [details] Result of test task 2
Created attachment 13130 [details] Result of test task 3
Just attached results. Process affinity list is still only 8 cpus and does not change.
Josko, The result is quite surprising. did you run the same commands as in initial comment or maybe you changed --ntasks-per-node=4 or did you change it to 8? I think that this is what happened, but I'd like to be 100% sure. cheers, Marcin
Oh, sorry - I had -c 8 there... It doesn't invalidate the test though, still allocating same CPUs to all exclusive tasks.
Josko, I can reproduce it, but to be sure that we're at te same page in terms of trace in the code could you please share TaskPluginParam configuration parameter used from both clusters? cheers, Marcin
Hi there, good one: [root@adroit4 ~]# scontrol show config | grep TaskPlugin TaskPlugin = affinity,cgroup TaskPluginParam = (null type) "bad" one: [root@traverse ~]# scontrol show config | grep TaskPlugin TaskPlugin = affinity,cgroup TaskPluginParam = (null type) Thanks, Josko
Josko, I have a patch that should fix the issue, however, it didn't pass our QA process yet. Would you be interested in applying it locally before QA is completed? An alternative workaround that should work pretty good is to add --cpu-bind=none option to srun commands. This should disable task affinity which is limiting your steps to the same core and let the operating system assign resources - for computing-intensive processes this should work quite well. Let me know how you'd like to continue. cheers, Marcin
If it is not too complex I should be able to add it to our build of slurm and test it. Thanks.
Created attachment 13340 [details] fix _pick_step_cores for tasks_per_core > 1 for 19.05 (v1) Josko, The attached patch should cleanly apply on top of 19.05, as mentioned before it didn't pass SchedMD QA and is not yet scheduled for release. It's passing our automated regression tests without an issue. Your feedback will be very much appreciated. cheers, Marcin
Josko, Where you able to apply the patch and verify if it works for you? cheers, Marcin
Josko, Did you have a chance to apply the patch and verify if it works for you? cheers, Marcin
Comment on attachment 13340 [details] fix _pick_step_cores for tasks_per_core > 1 for 19.05 (v1) Josko, The patch is undergoing review. Please don't apply it now, we should get back to you with a final solution soon. cheers, Marcin
Josko, Fix for the bug was merged and will be available in slurm-19.05.7[1]. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/9028d1d49d551ff26e92e3039274bdfab4fc5c80