Ticket 8540

Summary: srun --exclusive not working on system with ThreadsPerCore>1
Product: Slurm Reporter: Josko Plazonic <plazonic>
Component: slurmdAssignee: Marcin Stolarek <cinek>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 19.05.5   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=10290
Site: Princeton (PICSciE) Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 19.05.7 20.02.3 20.11.0pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Result of test task 0
Result of test task 1
Result of test task 2
Result of test task 3
fix _pick_step_cores for tasks_per_core > 1 for 19.05 (v1)

Description Josko Plazonic 2020-02-19 14:34:33 MST
Good afternoon,

I should probably say "probably not working if " - but hear me out.  First on an x86_64, ThreadsPerCore=1:

[plazonic@adroit4 ~]$ cat cputest.slurm 
#!/bin/bash
#SBATCH -N 1
#SBATCH -t 00:10:00
#SBATCH --ntasks-per-node=4
#SBATCH -c 4
srun --exclusive -n1 set.sh &
srun --exclusive -n1 set.sh &
srun --exclusive -n1 set.sh &
srun --exclusive -n1 set.sh &
wait
[plazonic@adroit4 ~]$ cat set.sh 
#!/bin/bash
echo HOST:$SLURMD_NODENAME $SLURM_STEP_ID `taskset -c -p $$` `hostname`
sleep 30s
[plazonic@adroit4 ~]$ sbatch cputest.slurm 
Submitted batch job 699064
[plazonic@adroit4 ~]$ cat slurm-699064.out
HOST:adroit-13 0 pid 156701's current affinity list: 2,4,6,8 adroit-13
HOST:adroit-13 1 pid 156708's current affinity list: 10,12,14,16 adroit-13
HOST:adroit-13 2 pid 156715's current affinity list: 18,20,22,24 adroit-13
HOST:adroit-13 3 pid 156721's current affinity list: 1,26,28,30 adroit-13

And now the same test, just result, on power9 cluster with ThreadsPerCore=4:

[plazonic@traverse ~]$ cat slurm-48414.out 
HOST:traverse-k01g2 0 pid 65359's current affinity list: 64-67 traverse-k01g2
HOST:traverse-k01g2 1 pid 65361's current affinity list: 64-67 traverse-k01g2
HOST:traverse-k01g2 3 pid 65369's current affinity list: 64-67 traverse-k01g2
HOST:traverse-k01g2 2 pid 65371's current affinity list: 64-67 traverse-k01g2

Nothing changes for -c 1/2/4/8 - just one set of CPUs is used.

If one does just
srun set.sh
(instead of 4 srun --exclusive's), it all works correctly:
HOST:traverse-k01g2 0 pid 65709's current affinity list: 76-79 traverse-k01g2
HOST:traverse-k01g2 0 pid 65706's current affinity list: 64-67 traverse-k01g2
HOST:traverse-k01g2 0 pid 65708's current affinity list: 72-75 traverse-k01g2
HOST:traverse-k01g2 0 pid 65707's current affinity list: 68-71 traverse-k01g2

We've had other trouble due to ThreadsPerCore=4 so I am tempted to blame it on that but you tell me...

If you let me know what you need I'll upload config and/or logs but it might be worth seeing if this works on other systems with ThreadsPerCore>1 before we dig in deeper in our configs.

Thanks!

Josko
Comment 1 Marcin Stolarek 2020-02-21 10:25:11 MST
Hi Josko,

I'm not 100% sure if I'm able to reproduce it could you please rerun your tests with the script like the one below executed as a step?
Please note that you'll have to replace '/sys/fs/cgroup/cpuset' with the location of your cgroup cpuset filesystem.

# cat /tmp/set.sh 
#!/bin/bash 
#exec > ./slurm-${SLURM_STEPID}

date
sleep 5
echo "===SHOW MY JOB==="
scontrol show job ${SLURM_JOBID}
echo "===SHOW MY STEPS==="
scontrol show step ${SLURM_JOB_ID}
echo "===SHOW MY CGROUP==="
/bin/cat /sys/fs/cgroup/cpuset/slurm_${SLURMD_NODENAME}/uid_${SLURM_JOB_UID}/job_${SLURM_JOB_ID}/step_${SLURM_STEPID}/cpuset.cpus
echo '===SHOW MY TASKSET==='
taskset -cp $$  
date
echo '===DONE==='


Additionally please add -o option to your srun calls in cpuset.slurm, like here:
>srun -o slurm-%J.out -l --exclusive -n1 /tmp/testStep &

This will label each line with taskId (in this case always 0) and create a separate file per job step.

cheers,
Marcin
Comment 2 Josko Plazonic 2020-02-21 15:12:16 MST
Created attachment 13127 [details]
Result of test task 0
Comment 3 Josko Plazonic 2020-02-21 15:12:34 MST
Created attachment 13128 [details]
Result of test task 1
Comment 4 Josko Plazonic 2020-02-21 15:12:51 MST
Created attachment 13129 [details]
Result of test task 2
Comment 5 Josko Plazonic 2020-02-21 15:13:05 MST
Created attachment 13130 [details]
Result of test task 3
Comment 6 Josko Plazonic 2020-02-21 15:14:03 MST
Just attached results.  Process affinity list is still only 8 cpus and does not change.
Comment 7 Marcin Stolarek 2020-02-25 10:23:01 MST
Josko,

The result is quite surprising. did you run the same commands as in initial comment or maybe you changed --ntasks-per-node=4 or did you change it to 8? 

I think that this is what happened, but I'd like to be 100% sure.

cheers,
Marcin
Comment 8 Josko Plazonic 2020-02-25 11:57:42 MST
Oh, sorry - I had -c 8 there...  It doesn't invalidate the test though, still allocating same CPUs to all exclusive tasks.
Comment 10 Marcin Stolarek 2020-03-03 09:45:04 MST
Josko,

I can reproduce it, but to be sure that we're at te same page in terms of trace in the code could you please share TaskPluginParam configuration parameter used from both clusters?

cheers,
Marcin
Comment 11 Josko Plazonic 2020-03-03 09:46:39 MST
Hi there,

good one:
[root@adroit4 ~]# scontrol show config | grep TaskPlugin
TaskPlugin              = affinity,cgroup
TaskPluginParam         = (null type)

"bad" one:
[root@traverse ~]# scontrol show config | grep TaskPlugin
TaskPlugin              = affinity,cgroup
TaskPluginParam         = (null type)

Thanks,
Josko
Comment 14 Marcin Stolarek 2020-03-06 09:17:46 MST
Josko,

I have a patch that should fix the issue, however, it didn't pass our QA process yet. Would you be interested in applying it locally before QA is completed?

An alternative workaround that should work pretty good is to add --cpu-bind=none option to srun commands. This should disable task affinity which is limiting your steps to the same core and let the operating system assign resources - for computing-intensive processes this should work quite well. 

Let me know how you'd like to continue.

cheers,
Marcin
Comment 15 Josko Plazonic 2020-03-09 13:29:28 MDT
If it is not too complex I should be able to add it to our build of slurm and test it.

Thanks.
Comment 16 Marcin Stolarek 2020-03-11 05:47:52 MDT
Created attachment 13340 [details]
fix _pick_step_cores for tasks_per_core > 1 for 19.05 (v1)

Josko,

The attached patch should cleanly apply on top of 19.05, as mentioned before it didn't pass SchedMD QA and is not yet scheduled for release. It's passing our automated regression tests without an issue. 

Your feedback will be very much appreciated.

cheers,
Marcin
Comment 17 Marcin Stolarek 2020-04-03 05:28:36 MDT
Josko,

Where you able to apply the patch and verify if it works for you? 

cheers,
Marcin
Comment 18 Marcin Stolarek 2020-05-06 08:34:32 MDT
Josko,

Did you have a chance to apply the patch and verify if it works for you?

cheers,
Marcin
Comment 22 Marcin Stolarek 2020-05-08 08:40:39 MDT
Comment on attachment 13340 [details]
fix _pick_step_cores for tasks_per_core > 1 for 19.05 (v1)

Josko,

The patch is undergoing review. Please don't apply it now, we should get back to you with a final solution soon.

cheers,
Marcin
Comment 31 Marcin Stolarek 2020-05-20 00:41:52 MDT
Josko,

Fix for the bug was merged and will be available in slurm-19.05.7[1]. 

cheers,
Marcin

[1]https://github.com/SchedMD/slurm/commit/9028d1d49d551ff26e92e3039274bdfab4fc5c80