Ticket 11857 - Regression in 20.11.7 for back-to-back srun's
Summary: Regression in 20.11.7 for back-to-back srun's
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 20.11.7
Hardware: Linux Linux
: 2 - High Impact
Assignee: Marshall Garey
QA Contact:
URL:
: 11783 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2021-06-17 10:12 MDT by Luke Yeager
Modified: 2021-07-20 09:30 MDT (History)
9 users (show)

See Also:
Site: NVIDIA (PSLA)
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.11.8 21.08.0pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Luke Yeager 2021-06-17 10:12:22 MDT
We have a spank plugin which takes a bit of time to clean up after itself in slurm_spank_exit() with context=remote. In 20.11.7, 'srun -N1 true ; srun -N2 true' fails.

Minimal spank plugin:

> #include <slurm/spank.h>
> #include <unistd.h>
> #define PLUGIN_NAME test
> SPANK_PLUGIN(PLUGIN_NAME, 1);
> int slurm_spank_exit(spank_t sp, int ac, char **av) {
>         if (spank_context() == S_CTX_REMOTE) sleep(1);
>         return (0);
> }
Command to exhibit the buggy behavior:

> $ salloc -N2 bash -ex <<< 'srun -N1 hostname ; srun -N2 hostname'
> salloc: Granted job allocation 18
> salloc: Waiting for resource configuration
> salloc: Nodes node-[1-2] are ready for job
> + srun -N1 hostname
> lyeager-dt
> + srun -N2 hostname
> srun: error: Unable to create step for job 18: Requested node configuration is not available
> salloc: Relinquishing job allocation 18

Felix, my colleague, did a git bisect and found the culprit: https://github.com/SchedMD/slurm/commit/6a2c99edbf

This is a pretty serious issue for us.
Comment 4 Marshall Garey 2021-06-17 14:02:22 MDT
Thanks for reporting this, and nice job with git bisect finding the commit that started it! That certainly helped me find out what was happening faster.

I can reproduce this and I believe I have a fix. I'm submitting my patch to our review queue. Would you like me to make the patch available for you to test?

I discovered this bug can happen without a SPANK plugin with a simple job like this:

#!/bin/bash
#SBATCH -N2
srun -l -N1 bash -c 'hostname; sleep 10' &
sleep 1
srun -l -N2 bash -c 'hostname; sleep 10' &
wait


Before commit 6a2c99edbf96e, the second step wasn't rejected, the second step incorrectly ran in parallel with the first step, thus overallocating CPUs. So commit 6a2c99edbf actually fixed that bug (in addition to the bug I meant to fix with commit 6a2c99edbf). But unfortunately that then caused this new bug which you reported.
Comment 8 Luke Yeager 2021-06-17 14:13:50 MDT
Thanks for the quick confirmation. And it's good to know that this will affect sites and jobs which don't use spank plugins, too. I think we will probably just sit at 20.11.6 until 20.11.8 is released rather than patching 20.11.7. But it would be nice to have the patch public for the sake of other sites who might want to patch, I think. As you know, many sites need to be at 20.11.7 because of the CVE fix.
Comment 9 Marshall Garey 2021-06-17 14:16:03 MDT
(In reply to Luke Yeager from comment #8)
> Thanks for the quick confirmation. And it's good to know that this will
> affect sites and jobs which don't use spank plugins, too. I think we will
> probably just sit at 20.11.6 until 20.11.8 is released rather than patching
> 20.11.7.

Sounds good.


> But it would be nice to have the patch public for the sake of other
> sites who might want to patch, I think. As you know, many sites need to be
> at 20.11.7 because of the CVE fix.

As soon as we push the fix I will post the commit here. If someone wants the patch before it's pushed, just make a comment here and I will make it public.
Comment 17 Marshall Garey 2021-06-18 16:24:48 MDT
Hey Luke,

This turned out to be a little more complicated than I initially thought, but we got it figured out. We've fixed this and other issues in the following commits:

48009cbd97 Clear allocated cpus for running steps in a job before handling requested nodes on new step.
05f5abda9f Don't reject a step if not enough nodes are available.
49b1b2681e Don't reject a step if it requests a node that is allocated.

These will be in 20.11.8. Thanks for reporting this!

Closing this bug as resolved/fixed.
Comment 18 Marcin Stolarek 2021-06-24 03:31:11 MDT
*** Ticket 11783 has been marked as a duplicate of this ticket. ***
Comment 19 Luke Yeager 2021-07-20 09:30:38 MDT
I've verified the fix in 20.11.8 - thanks.