Ticket 11687

Summary: Possible srun regression in 20.11
Product: Slurm Reporter: Jonas Stare <jonst>
Component: slurmdAssignee: Carlos Tripiana Montes <tripiana>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: rar, tripiana
Version: 20.11.5   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=11357
Site: SNIC Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: NSC Tzag Elita Sites: ---
Linux Distro: --- Machine Name: tetralith.nsc.liu.se
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: slurm.conf
whereami.c
Logs from slurmd on nodes and slurmctld
Logs from sbatch and steps
job_submit.lua

Description Jonas Stare 2021-05-24 08:08:51 MDT
NOTE: we have set this to the highest priority due to the short time remaining before our planned upgrade to Slurm 20.11 (scheduled for Wednesday 08:00 CET). If you can provide a quick answer to our primary question “is this a bug or a feature”, you may then lower the priority of the ticket.

In Slurm 20.02.6, the following job script works as intended (Slurm makes sure that only one copy of “someapp” runs per core), but in Slurm 20.11.5 only one copy of “someapp” runs concurrently per node.

#!/bin/bash
#SBATCH -N 2
#SBATCH --exclusive
# Run 256 tasks on the 2*32 cores allocated to the job
for i in $(seq 1 256); do
    srun -N1 -n1 /somedir/someapp > ./out-${i}.log &
done
wait

Is this a regression, or an intentional change in behaviour in Slurm 20.11? If it is an intentional change, how should the above script be modified for 20.11?

If it is a regression, we will likely delay the planned Slurm upgrade until a fix is available. If it is an intentional change, we will instead have to educate our users on how to run on Slurm 20.11. And that is the reason we are looking for a quick reply from you.

The behaviour does not match our interpretation of the srun man page for 20.11. The man page suggests that --exclusive (which is the default) should ensure one “someapp” runs per core, and that --overlap would result in all 256 instances of “someapp” running concurrently.

We have tried various combinations of --exclusive, --overlap, --whole, --mem-per-node=0, … without finding anything that restores the previous behaviour.
Comment 1 Carlos Tripiana Montes 2021-05-24 08:33:38 MDT
Hi Jonas,

Please add:

#SBATCH -c 1

and --exact for srun.

In my tests, this combination works.

Tell us if this is working for you as well.

Thanks.
Comment 2 Marshall Garey 2021-05-24 09:17:10 MDT
We've had lots of questions about the change to srun in 20.11. This is a duplicate of bug 10383 comment 63, 11644, 10769, 11448, and probably others. Read bug 10383 comment 63 first and also RELEASE_NOTES and look for changes to srun, --exact, and -overlap.
Comment 3 Marshall Garey 2021-05-24 09:18:40 MDT
(And as Carlos mentioned in comment 1, exclusive means steps don't share resources. So definitely follow Carlos's hint.)
Comment 4 Jonas Stare 2021-05-25 04:17:31 MDT
Thank you very much for the quick replies.

We got things working by doing this (works with or without -c1):

#!/bin/bash
#SBATCH -N2
#SBATCH --exclusive

for i in $(seq 1 256);do
    srun -N1 -n1 --overlap --exact /somedir/someapp > ./out-${i}.log &
done
wait
Comment 5 Carlos Tripiana Montes 2021-05-25 05:56:10 MDT
Sound good too.

Let's close the bug with info given, if you agree.

Regards.
Comment 6 Rickard Armiento 2021-05-25 06:18:47 MDT
(I'm working with Jonas and have also looked at this issue.)

We are still unsure why the following snippet without --overlap works the way it does:

#!/bin/bash
#SBATCH -N2
#SBATCH -c 1
#SBATCH --exclusive

for i in $(seq 1 256);do
    srun -N1 -n1 --exact /somedir/someapp > ./out-${i}.log &
done
wait

This starts in parallel 32 processes on the first node and only ONE on the second node before delaying until the processes start finishing. However, adding the flag '--overlap' makes it start 32 processes on each node.

Why is the --overlap flag needed here? Why are the two nodes treated differently without that flag?
Comment 7 Carlos Tripiana Montes 2021-05-25 06:29:24 MDT
I guess it's because my test was not 2 times the number of cores of 2 nodes, but just the number of cores of 2 nodes.

But anyway, I've tested doubling the number of srun/steps. Then it gives my 32 per node (64), then waits. Not 32, for one node, the 1 alone in the other.

Regarding overlap, it's needed because you put double the steps than cores. Thus, the steps overlaps in cores.

Regards.
Comment 8 Rickard Armiento 2021-05-25 06:56:01 MDT
> But anyway, I've tested doubling the number of srun/steps. Then it gives my
> 32 per node (64), then waits. Not 32, for one node, the 1 alone in the other.

Thanks running these tests! Then I don't think the behavior we see is what is intended. We indeed see 32 processes on one node and 1 on the other, and then waiting.

> Regarding overlap, it's needed because you put double the steps than cores.
> Thus, the steps overlaps in cores.

Perhaps I misunderstand, but we are after the behavior you state you see with just --exact (64 processes, 32 on each node, then waiting). That would be no overlap, right? For some reason we get this desired behavior only when we use the combination of both --overlap and --exact, which seems a bit odd.
Comment 9 Carlos Tripiana Montes 2021-05-25 09:39:39 MDT
Ah, ok. I misunderstood your case. I guessed you wish the 256 at the same time.

I made a mistake reading logs. My test follows the same as you 256 in a loop, it executes 32+32 with exact+overlap.

But it does the same with the exact w/o overlap, in my case. Also, in my case -c1 does not add anything but in both cases just because of my config.

Please, send me the slurm.conf. The answer for the difference must be there, I guess. Or maybe you use job_submit.lua or similar, in the sense you rewrite some default values for flags when those are not explicitly used.

Thanks.
Comment 10 Jonas Stare 2021-05-26 03:13:26 MDT
Created attachment 19657 [details]
slurm.conf
Comment 11 Jonas Stare 2021-05-26 03:17:59 MDT
(In reply to Carlos Tripiana Montes from comment #9)
> I guess. Or maybe you use job_submit.lua or similar, in the sense you
> rewrite some default values for flags when those are not explicitly used.

Yes, in out job_submit.lua we set job_desc.shared = 0 if the user asked for more than one node. (Which should be the same as adding --exclusive)

But apart from that we shouldn't change anything relating to this.
Comment 12 Carlos Tripiana Montes 2021-05-26 03:53:32 MDT
Created attachment 19658 [details]
whereami.c

Hey Jonas,

Matching your config, I'm unable to reproduce the 32 steps in "node 1" plus 1 step in "node 2", at the same time. And the others waiting.

I'm getting 32 steps in "node 1" plus 32 step in "node 2", the rest waiting.

Please, use this script:

#!/bin/bash
#SBATCH -N 2
#SBATCH --exclusive
# Run 256 tasks on the 2*32 cores allocated to the job
for i in $(seq 1 256); do
    srun -vvv -N1 -n1 --exact bash -c 'whereami;sleep 60' &> ./out-${i}.log &
done
wait

and send it to queue as follows:

sbatch -vvv [script]

Post sbatch output, and all the slurm-[jobid].out plus out-[step].log files.

Additionally post the slurmctld.log and the slurmd.log (for the 2 nodes of the job), and the job_submit.lua.

I'm not sure, but I think it's job_submit.lua which is doing something?

Please, see attached whereami.c. Compile it 1st for the test.

Regards.
Comment 13 Jonas Stare 2021-05-26 04:47:00 MDT
Created attachment 19659 [details]
Logs from slurmd on nodes and slurmctld
Comment 14 Jonas Stare 2021-05-26 04:48:58 MDT
Created attachment 19660 [details]
Logs from sbatch and steps
Comment 15 Jonas Stare 2021-05-26 04:54:16 MDT
Created attachment 19661 [details]
job_submit.lua

This script is a bit hairy. But the function that might be interesting is _set_exclusive_if_needed(). The rest of the code _should_ only affect jobs that request a special reservations/qos or nodes with memory that differ from DefMemPerCPU. (We should probably have used partitions for this, but the solution we have is for historical reasons.)
Comment 16 Carlos Tripiana Montes 2021-05-27 07:32:23 MDT
Okey, okey... found it!

20.11 branch, head commit, is not failing.

20.11.5 does it fail.

Good news is you just need --exact w/o --overlap, if you update. That way makes more sense, and is what you where expecting from the stated in the documentation.

Bad news is that if you don't update, you need it.

I'm going to seek and send back to you the commit ID and related bug ID (if any) belonging to the fix, as soon as I've spot at it.

Give me a while digging into the git history.

Regards,
Comment 18 Jonas Stare 2021-05-28 04:14:18 MDT
Thank you very very much for the quick help.

I've tested a 20.11.7 version and verified that it works for us as well.

You can close the ticket.
Comment 19 Carlos Tripiana Montes 2021-05-28 05:13:13 MDT
Hi Jonas,

That's awesome news. Thank you very much for the feedback.

Let's close the bug.

Regards.