Ticket 9851 - Interaction between LLN and --ntasks-per-node
Summary: Interaction between LLN and --ntasks-per-node
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 19.05.5
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Marcin Stolarek
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-09-17 15:38 MDT by Matt Ezell
Modified: 2020-10-21 01:01 MDT (History)
1 user (show)

See Also:
Site: NOAA
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ORNL
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.11pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Matt Ezell 2020-09-17 15:38:52 MDT
We are using cons_res and have LLN set our our RDTN partition.

[root@es-slurm ~]# scontrol show part rdtn|grep LLN
   MaxNodes=1 MaxTime=16:00:00 MinNodes=0 LLN=YES MaxCPUsPerNode=UNLIMITED

If someone submits with --ntasks-per-node=1, then all of their jobs seem to pile up on the same node:

Matthew.Ezell@gaea10:~/llnfail> sinfo --local -N -n dtn[01-16] -p rdtn -o '%10N %8T %C'
NODELIST   STATE    CPUS(A/I/O/T)
dtn01      mixed    3/125/0/128
dtn02      mixed    4/124/0/128
dtn03      drained  0/0/128/128
dtn04      mixed    3/125/0/128
dtn05      mixed    3/125/0/128
dtn06      mixed    4/124/0/128
dtn07      mixed    3/125/0/128
dtn08      mixed    3/125/0/128
dtn09      mixed    2/126/0/128
dtn10      mixed    3/125/0/128
dtn11      mixed    3/125/0/128
dtn12      mixed    2/126/0/128
dtn13      mixed    2/126/0/128
dtn14      mixed    3/125/0/128
dtn15      drained  0/0/128/128
dtn16      down*    0/0/128/128
Matthew.Ezell@gaea10:~/llnfail> for i in $(seq 1 20);do sbatch -p rdtn -Mes --ntasks-per-node=1 --wrap "hostname && sleep 600";done
Submitted batch job 69430135 on cluster es
Submitted batch job 69430136 on cluster es
Submitted batch job 69430137 on cluster es
Submitted batch job 69430138 on cluster es
Submitted batch job 69430139 on cluster es
Submitted batch job 69430140 on cluster es
Submitted batch job 69430141 on cluster es
Submitted batch job 69430142 on cluster es
Submitted batch job 69430143 on cluster es
Submitted batch job 69430144 on cluster es
Submitted batch job 69430145 on cluster es
Submitted batch job 69430146 on cluster es
Submitted batch job 69430147 on cluster es
Submitted batch job 69430148 on cluster es
Submitted batch job 69430149 on cluster es
Submitted batch job 69430150 on cluster es
Submitted batch job 69430151 on cluster es
Submitted batch job 69430152 on cluster es
Submitted batch job 69430153 on cluster es
Submitted batch job 69430154 on cluster es
Matthew.Ezell@gaea10:~/llnfail> sinfo --local -N -n dtn[01-16] -p rdtn -o '%10N %8T %C'
NODELIST   STATE    CPUS(A/I/O/T)
dtn01      mixed    23/105/0/128
dtn02      mixed    4/124/0/128
dtn03      drained  0/0/128/128
dtn04      mixed    3/125/0/128
dtn05      mixed    2/126/0/128
dtn06      mixed    4/124/0/128
dtn07      mixed    3/125/0/128
dtn08      mixed    2/126/0/128
dtn09      mixed    1/127/0/128
dtn10      mixed    3/125/0/128
dtn11      mixed    3/125/0/128
dtn12      mixed    2/126/0/128
dtn13      mixed    2/126/0/128
dtn14      mixed    3/125/0/128
dtn15      drained  0/0/128/128
dtn16      down*    0/0/128/128
Matthew.Ezell@gaea10:~/llnfail> cat *|sort|uniq -c
     20 dtn01


If you don't specify --ntasks-per-node, they seems to be spread out as you might expect
<snip>
Matthew.Ezell@gaea10:~/llnfail> cat *|sort|uniq -c
      2 dtn01
      1 dtn02
      1 dtn04
      2 dtn05
      1 dtn06
      1 dtn07
      2 dtn08
      3 dtn09
      1 dtn10
      1 dtn11
      2 dtn12
      2 dtn13
      1 dtn14


This is not the behavior I expected - can you help me understand if this is a misconfiguration on our part or a Slurm bug? Thanks.
Comment 1 Jason Booth 2020-09-18 16:25:26 MDT
Hi Matt. We will need to look into this and get back to you once we analyze this a bit more.
Comment 4 Marcin Stolarek 2020-09-24 09:13:18 MDT
Matt,

I did reproduce your issue on both 19.05 and master branch, the issue is not specific to cons_res (cons_tres works the same way).
I see where the issue comes from, but will need to work a little bit longer on the case to find appropriate solution.

cheers,
Marcin
Comment 5 Marcin Stolarek 2020-10-07 03:23:48 MDT
Matt,

I discussed the bug with our senior developer and we concluded that fixing it requires changes deeply in select plugin. We cannot introduce in a released version. I'll continue working on the patch for master branch.

cheers,
Marcin
Comment 6 Matt Ezell 2020-10-08 14:25:01 MDT
(In reply to Marcin Stolarek from comment #5)
> Matt,
> 
> I discussed the bug with our senior developer and we concluded that fixing
> it requires changes deeply in select plugin. We cannot introduce in a
> released version. I'll continue working on the patch for master branch.
> 
> cheers,
> Marcin

Thanks for the update. Luckily this cluster is for data transfer (not MPI), so --ntasks-per-node > 1 doesn't make sense. We've asked users to omit this parameter in the meantime.

What's the likelihood that you will be able to get a viable before the 20.11 code cutoff? Are we looking at 21.08?
Comment 8 Marcin Stolarek 2020-10-12 06:17:52 MDT
Matt,

I have a patch I'm passing to review now. I think that we should be able to fix it ahead of 20.11 release, however, it's not something I can guarantee today.

The patch can't be easily applied to Slurm 19.05 cons_res, it can be adopted to cons_tres though. Starting from version 20.02 the patch is compatible with both consumable resources plugins.

cheers,
Marcin
Comment 14 Marcin Stolarek 2020-10-21 01:00:31 MDT
Matt,

The issue is now fixed on our master branch by the following commits:
37d40e94ff Fix _eval_nodes_lln in cons_tres and cons_res
9339f36a00 No logic change - remove redundant line from _eval_nodes_lln
5ce271ee13 Use avail_res->max_cpus as number of CPUs available on node
b88992a6ac No logic change - use avail_cpus instead of max_cpus

Those will be released in Slurm 20.11 as touching the fundamental part of the resource selection, but we didn't see any regression and technically you can apply them on top of 20.02 (there is no protocol/compatibility change). (For 19.05 as described in comment 8).

I'm closing the bug report now. Should you have any questions please don't hesitate to reopen.

cheers,
Marcin