Ticket 11589

Summary: How to ignore allocation node count for steps
Product: Slurm Reporter: Matt Ezell <ezellma>
Component: SchedulingAssignee: Marcin Stolarek <cinek>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: cinek, lyeager, tim, vergaravg
Version: 20.11.6   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=11852
https://bugs.schedmd.com/show_bug.cgi?id=12912
https://bugs.schedmd.com/show_bug.cgi?id=14105
https://support.schedmd.com/show_bug.cgi?id=20799
Site: ORNL-OLCF Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 21.08pre1
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Matt Ezell 2021-05-10 19:02:15 MDT
We have at least two scenarios where the allocation node count and step node count are different, but we have to explicitly list the step node count to get the correct behavior.

We are using cons_tres with CR_PACK_NODES

Scenario 1: an ensemble where a group of nodes is allocated, but multiple parallel steps are intended to use them.

ezy@login1:~> salloc -N4 -A STF002 -t 30:00
salloc: Granted job allocation 45957
salloc: Waiting for resource configuration
salloc: Nodes spock[13-16] are ready for job
ezy@spock13:~> srun -n1 hostname
srun: Warning: can't run 1 processes on 4 nodes, setting nnodes to 1
spock13

Additionally, every step much be allocated onto each node, but with CR_PACK_NODES this results in a weird distribution:

ezy@spock13:~> srun -n10 hostname | sort |uniq -c
      7 spock13
      1 spock14
      1 spock15
      1 spock16

it packed the tasks on the first node, but still had to put one task on each node (all 10 processes ideally would have landed on the first node). 

Scenario 2: users allocate "extra" nodes so they can survive node failures and restart during the same allocation (avoid having to wait in the queue again).

Without explicitly listing a node count, Slurm will spread the step across all the nodes.



Even if I unset SLURM_NNODES and SLURM_JOB_NUM_NODES it still seems to attempt to use all the nodes. Due to Bug #11494 I can't reset the node count to NO_VAL in cli_filter.

Is there any way to avoid having the allocation node count "bleed over" to the step allocation?
Comment 6 Marcin Stolarek 2021-06-18 02:47:33 MDT
Matt,

I wanted to follow-up with you on the progress here. I have a behavioral change patch that I'm passing to our QA now.

The patch removes the srun side enforcement for minimal requested nodes for step inside allocation if CR_PACK_NODES set. The code is only effective when SLURM_JOB_NUM_NODES is not set, as you know, it's normally treated as an input option for srun.

I'll keep you posted on the review progress.

cheers,
Marcin
Comment 12 Marcin Stolarek 2021-07-20 01:11:17 MDT
Matt,

The behavior-changing commit is now in our master branch and will be released in Slurm 21.08[1].
I'm closing the bug report as fixed now - should you have any questions please reopen.

cheers,
Marcin

[1]https://github.com/SchedMD/slurm/commit/e942cadb345cc2acbfb2b7155f40eead93b64b43
Comment 13 Matt Ezell 2021-07-20 09:20:34 MDT
So users will still need to manually unset SLURM_JOB_NUM_NODES *and* SLURM_NNODES to actually see a behavior change?  I'm glad that it's now possible, but it's still going to cause some issues here when users forget to (or don't know to) unset those.

Based on the name, I would not expect SLURM_JOB_NUM_NODES to impact node count for a step allocation (only a job allocation). But I guess it was added as an alias to replace SLURM_NNODES, which sounds like it should impact both.

I guess there's no sane way to make the srun ignore those environment variables.
Comment 15 Marcin Stolarek 2021-07-21 01:14:09 MDT
Matt,

>So users will still need to manually unset SLURM_JOB_NUM_NODES *and* SLURM_NNODES to actually see a behavior change?

No - it's not the case after the patch, if CR_Pack_Nodes is enabled on 21.08 you'll see:
># salloc -N2 --exclusive
>salloc: Pending job allocation 7
>salloc: job 7 queued and waiting for resources
>salloc: job 7 has been allocated resources
>salloc: Granted job allocation 7
>[salloc] bash-4.2# srun -n3  /bin/bash -c 'echo $SLURMD_NODENAME'
>test01
>test01
>test01

or if the user explicitly sets number of nodes for step:
>[salloc] bash-4.2# srun -n3 -N2 /bin/bash -c 'echo $SLURMD_NODENAME'
>test01
>test01
>test02

or if we need more CPUs (for instance because of the default 1 CPU per task):
>[salloc] bash-4.2# srun -n 64 /bin/bash -c 'echo $SLURMD_NODENAME'   | sort | uniq -c
>     64 test01
>[salloc] bash-4.2# srun -n 67 /bin/bash -c 'echo $SLURMD_NODENAME'   | sort | uniq -c
>     64 test01
>      3 test02

Let me know what you think.

cheers,
Marcin
Comment 16 Matt Ezell 2021-07-21 05:53:46 MDT
(In reply to Marcin Stolarek from comment #15)
> Let me know what you think.

This sounds perfect. I misunderstood from your previous comment:

> The code is only effective when SLURM_JOB_NUM_NODES is not set

and assumed it was being set by the batch job itself.

Thanks!
Comment 17 Marcin Stolarek 2021-07-21 08:01:35 MDT
>and assumed it was being set by the batch job itself.

Ah.. sorry it's obviously me not letting you know that the final approach taken was different than I actually thought about at the beginning.

cheers,
Marcin