Ticket 13584

Summary:	Same resource allocated to several "--exclusive" steps at the same time
Product:	Slurm	Reporter:	Hyacinthe Cartiaux <hyacinthe.cartiaux>
Component:	Scheduling	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	hyacinthe.cartiaux, marshall
Version:	20.11.7
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=11863 https://bugs.schedmd.com/show_bug.cgi?id=12462
Site:	University of Luxembourg	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	CentOS	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Hyacinthe Cartiaux 2022-03-08 08:48:10 MST

Hello,

One of our users reported a resource allocation issue which I cannot explain, while mixing exclusive steps and overlapping steps.

Here is a minimal example to reproduce this issue:

#! /bin/bash -l
#SBATCH -N 1
#SBATCH --ntasks-per-node=7
#SBATCH --cpus-per-task 1
#SBATCH --time=0-00:30:00
#SBATCH --partition interactive
SRUN=(srun --exclusive --ntasks=1 --cpus-per-task=1)
PARALLEL=( \
        parallel \
        --delay .2 \
        --jobs "${SLURM_NTASKS}" \
        --joblog parallel.log \
        --line-buffer \
)
NTASKS=1000
"${PARALLEL[@]}" "${SRUN[@]}" "./exec.sh" {} :::: <(seq "${NTASKS}")



The script "exec.sh" called above displays the cpu affinity of all exec.sh processes started via "srun --exclusive" at a given time.

#! /bin/bash
echo "PID $$: Starting, CPU affinities:"
for pid in $(pgrep exec.sh) ; do
        taskset -cp $pid
done
DELAY=$(( (RANDOM % 5) + 2 ))
echo "PID $$: Sleeping for ${DELAY}s"
sleep "$DELAY"


In the beginning, everything is fine, each process runs on a different cpu core:

PID 232353: Starting, CPU affinities:
pid 232217's current affinity list: 102
pid 232238's current affinity list: 59
pid 232259's current affinity list: 63
pid 232281's current affinity list: 67
pid 232304's current affinity list: 71
pid 232337's current affinity list: 75
pid 232353's current affinity list: 79
PID 232353: Sleeping for 2s
PID 232444: Starting, CPU affinities:
pid 232238's current affinity list: 59
pid 232259's current affinity list: 63
pid 232281's current affinity list: 67
pid 232304's current affinity list: 71
pid 232337's current affinity list: 75
pid 232353's current affinity list: 79
pid 232444's current affinity list: 102
PID 232444: Sleeping for 2s


As soon as we start one overlapping step, in example with this command used for debugging purposes (srun --jobid <JOB_ID> --overlap --gres=gpu:0  --pty bash -i), the resource allocated previously to this overlapping step start being allocated for multiple exclusive steps at the same time (in the following extract, the cpu 102).


PID 234869: Starting, CPU affinities:
pid 234619's current affinity list: 79
pid 234661's current affinity list: 59
pid 234709's current affinity list: 67
pid 234784's current affinity list: 102
pid 234821's current affinity list: 75
pid 234844's current affinity list: 71
pid 234869's current affinity list: 102
PID 234869: Sleeping for 6s


Is this an expected behavior ?

Thank you for your support,

Best regards,

Hyacinthe

Comment 2 Marshall Garey 2022-03-10 11:36:01 MST

I'm looking into this, though I haven't been able to reproduce it yet. I've also only tested on Slurm 21.08, so I will test on 20.11 to see if the problem exists there.

Comment 3 Marshall Garey 2022-03-10 11:37:31 MST

(In reply to Marshall Garey from comment #2)
> I'm looking into this, though I haven't been able to reproduce it yet. I've
> also only tested on Slurm 21.08, so I will test on 20.11 to see if the
> problem exists there.

Correction: I should have said "to see if I can reproduce the problem there" since you have seen this problem. This is unexpected behavior.

And thanks for your reproducer script.

Comment 4 Hyacinthe Cartiaux 2022-03-15 03:09:51 MDT

Thanks for having a look at this, the script is provided by one of our user.

If it is fixed by a new Slurm version, we will consider the upgrade later.

Comment 5 Marshall Garey 2022-04-18 16:38:18 MDT

I'm sorry for the delayed response.

We found some buggy and inconsistent behavior of steps sharing CPUs (--overlap in 20.11/21.08, which was just the default behavior prior to 20.11). For example, if an --overlap step is submitted after an exclusive step (with the exclusive step still running), then the overlap step could share CPUs with the exclusive step. But if the exclusive step is submitted after an overlap step (with the overlap step still running), the exclusive step could not share CPUs with the overlap step.


We made some changes to --overlap for the 22.05 release. In 20.11/21.08, --overlap only allowed sharing CPUs. In 22.05, --overlap allows the step to share CPUs, memory, and GRES. In addition, resources allocated to overlapping steps will not count towards resources allocated in the job, meaning it will always share its resources with all other steps (overlap or exclusive).


This change will make running debugging steps much better and easier, and should fix the bug that you reported.


See the following commits:
fe9f416ec2
8b00476873
5e446730c8


Unfortunately, we won't be able to backport this to 21.08 since it is a change in behavior. So you'll have to wait for 22.05 for these fixes.

Is there anything else I can help with for this ticket?

- Marshall

Comment 6 Hyacinthe Cartiaux 2022-04-26 07:10:19 MDT

Thank you very much for the explanation. I think this issue can be closed.

Comment 7 Jason Booth 2022-04-26 09:56:07 MDT

Resolving.