8540 – srun --exclusive not working on system with ThreadsPerCore>1

Ticket 8540 - srun --exclusive not working on system with ThreadsPerCore>1

Summary: srun --exclusive not working on system with ThreadsPerCore>1

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	19.05.5
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Marcin Stolarek
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-02-19 14:34 MST by Josko Plazonic
Modified:	2020-11-25 10:07 MST (History)
CC List:	0 users

See Also:	10290
Site:	Princeton (PICSciE)
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	19.05.7 20.02.3 20.11.0pre1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Result of test task 0 (3.67 KB, text/plain) 2020-02-21 15:12 MST, Josko Plazonic	Details
Result of test task 1 (3.67 KB, text/plain) 2020-02-21 15:12 MST, Josko Plazonic	Details
Result of test task 2 (3.67 KB, text/plain) 2020-02-21 15:12 MST, Josko Plazonic	Details
Result of test task 3 (3.67 KB, text/plain) 2020-02-21 15:13 MST, Josko Plazonic	Details
fix _pick_step_cores for tasks_per_core > 1 for 19.05 (v1) (2.45 KB, patch) 2020-03-11 05:47 MDT, Marcin Stolarek	Details \| Diff
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Josko Plazonic 2020-02-19 14:34:33 MST

Good afternoon,

I should probably say "probably not working if " - but hear me out.  First on an x86_64, ThreadsPerCore=1:

[plazonic@adroit4 ~]$ cat cputest.slurm 
#!/bin/bash
#SBATCH -N 1
#SBATCH -t 00:10:00
#SBATCH --ntasks-per-node=4
#SBATCH -c 4
srun --exclusive -n1 set.sh &
srun --exclusive -n1 set.sh &
srun --exclusive -n1 set.sh &
srun --exclusive -n1 set.sh &
wait
[plazonic@adroit4 ~]$ cat set.sh 
#!/bin/bash
echo HOST:$SLURMD_NODENAME $SLURM_STEP_ID `taskset -c -p $$` `hostname`
sleep 30s
[plazonic@adroit4 ~]$ sbatch cputest.slurm 
Submitted batch job 699064
[plazonic@adroit4 ~]$ cat slurm-699064.out
HOST:adroit-13 0 pid 156701's current affinity list: 2,4,6,8 adroit-13
HOST:adroit-13 1 pid 156708's current affinity list: 10,12,14,16 adroit-13
HOST:adroit-13 2 pid 156715's current affinity list: 18,20,22,24 adroit-13
HOST:adroit-13 3 pid 156721's current affinity list: 1,26,28,30 adroit-13

And now the same test, just result, on power9 cluster with ThreadsPerCore=4:

[plazonic@traverse ~]$ cat slurm-48414.out 
HOST:traverse-k01g2 0 pid 65359's current affinity list: 64-67 traverse-k01g2
HOST:traverse-k01g2 1 pid 65361's current affinity list: 64-67 traverse-k01g2
HOST:traverse-k01g2 3 pid 65369's current affinity list: 64-67 traverse-k01g2
HOST:traverse-k01g2 2 pid 65371's current affinity list: 64-67 traverse-k01g2

Nothing changes for -c 1/2/4/8 - just one set of CPUs is used.

If one does just
srun set.sh
(instead of 4 srun --exclusive's), it all works correctly:
HOST:traverse-k01g2 0 pid 65709's current affinity list: 76-79 traverse-k01g2
HOST:traverse-k01g2 0 pid 65706's current affinity list: 64-67 traverse-k01g2
HOST:traverse-k01g2 0 pid 65708's current affinity list: 72-75 traverse-k01g2
HOST:traverse-k01g2 0 pid 65707's current affinity list: 68-71 traverse-k01g2

We've had other trouble due to ThreadsPerCore=4 so I am tempted to blame it on that but you tell me...

If you let me know what you need I'll upload config and/or logs but it might be worth seeing if this works on other systems with ThreadsPerCore>1 before we dig in deeper in our configs.

Thanks!

Josko

Comment 1 Marcin Stolarek 2020-02-21 10:25:11 MST

Hi Josko,

I'm not 100% sure if I'm able to reproduce it could you please rerun your tests with the script like the one below executed as a step?
Please note that you'll have to replace '/sys/fs/cgroup/cpuset' with the location of your cgroup cpuset filesystem.

# cat /tmp/set.sh 
#!/bin/bash 
#exec > ./slurm-${SLURM_STEPID}

date
sleep 5
echo "===SHOW MY JOB==="
scontrol show job ${SLURM_JOBID}
echo "===SHOW MY STEPS==="
scontrol show step ${SLURM_JOB_ID}
echo "===SHOW MY CGROUP==="
/bin/cat /sys/fs/cgroup/cpuset/slurm_${SLURMD_NODENAME}/uid_${SLURM_JOB_UID}/job_${SLURM_JOB_ID}/step_${SLURM_STEPID}/cpuset.cpus
echo '===SHOW MY TASKSET==='
taskset -cp $$  
date
echo '===DONE==='


Additionally please add -o option to your srun calls in cpuset.slurm, like here:
>srun -o slurm-%J.out -l --exclusive -n1 /tmp/testStep &

This will label each line with taskId (in this case always 0) and create a separate file per job step.

cheers,
Marcin

Comment 2 Josko Plazonic 2020-02-21 15:12:16 MST

Created attachment 13127 [details]
Result of test task 0

Comment 3 Josko Plazonic 2020-02-21 15:12:34 MST

Created attachment 13128 [details]
Result of test task 1

Comment 4 Josko Plazonic 2020-02-21 15:12:51 MST

Created attachment 13129 [details]
Result of test task 2

Comment 5 Josko Plazonic 2020-02-21 15:13:05 MST

Created attachment 13130 [details]
Result of test task 3

Comment 6 Josko Plazonic 2020-02-21 15:14:03 MST

Just attached results.  Process affinity list is still only 8 cpus and does not change.

Comment 7 Marcin Stolarek 2020-02-25 10:23:01 MST

Josko,

The result is quite surprising. did you run the same commands as in initial comment or maybe you changed --ntasks-per-node=4 or did you change it to 8? 

I think that this is what happened, but I'd like to be 100% sure.

cheers,
Marcin

Comment 8 Josko Plazonic 2020-02-25 11:57:42 MST

Oh, sorry - I had -c 8 there...  It doesn't invalidate the test though, still allocating same CPUs to all exclusive tasks.

Comment 10 Marcin Stolarek 2020-03-03 09:45:04 MST

Josko,

I can reproduce it, but to be sure that we're at te same page in terms of trace in the code could you please share TaskPluginParam configuration parameter used from both clusters?

cheers,
Marcin

Comment 11 Josko Plazonic 2020-03-03 09:46:39 MST

Hi there,

good one:
[root@adroit4 ~]# scontrol show config | grep TaskPlugin
TaskPlugin              = affinity,cgroup
TaskPluginParam         = (null type)

"bad" one:
[root@traverse ~]# scontrol show config | grep TaskPlugin
TaskPlugin              = affinity,cgroup
TaskPluginParam         = (null type)

Thanks,
Josko

Comment 14 Marcin Stolarek 2020-03-06 09:17:46 MST

Josko,

I have a patch that should fix the issue, however, it didn't pass our QA process yet. Would you be interested in applying it locally before QA is completed?

An alternative workaround that should work pretty good is to add --cpu-bind=none option to srun commands. This should disable task affinity which is limiting your steps to the same core and let the operating system assign resources - for computing-intensive processes this should work quite well. 

Let me know how you'd like to continue.

cheers,
Marcin

Comment 15 Josko Plazonic 2020-03-09 13:29:28 MDT

If it is not too complex I should be able to add it to our build of slurm and test it.

Thanks.

Comment 16 Marcin Stolarek 2020-03-11 05:47:52 MDT

Created attachment 13340 [details]
fix _pick_step_cores for tasks_per_core > 1 for 19.05 (v1)

Josko,

The attached patch should cleanly apply on top of 19.05, as mentioned before it didn't pass SchedMD QA and is not yet scheduled for release. It's passing our automated regression tests without an issue. 

Your feedback will be very much appreciated.

cheers,
Marcin

Comment 17 Marcin Stolarek 2020-04-03 05:28:36 MDT

Josko,

Where you able to apply the patch and verify if it works for you? 

cheers,
Marcin

Comment 18 Marcin Stolarek 2020-05-06 08:34:32 MDT

Josko,

Did you have a chance to apply the patch and verify if it works for you?

cheers,
Marcin

Comment 22 Marcin Stolarek 2020-05-08 08:40:39 MDT

Comment on attachment 13340 [details]
fix _pick_step_cores for tasks_per_core > 1 for 19.05 (v1)

Josko,

The patch is undergoing review. Please don't apply it now, we should get back to you with a final solution soon.

cheers,
Marcin

Comment 31 Marcin Stolarek 2020-05-20 00:41:52 MDT

Josko,

Fix for the bug was merged and will be available in slurm-19.05.7[1]. 

cheers,
Marcin

[1]https://github.com/SchedMD/slurm/commit/9028d1d49d551ff26e92e3039274bdfab4fc5c80