Ticket 10707 - --constraint doesn't work when combining operators "[]" and "&"
Summary: --constraint doesn't work when combining operators "[]" and "&"
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 20.11.1
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Marshall Garey
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-01-26 18:04 MST by Felix Abecassis
Modified: 2023-03-13 15:47 MDT (History)
5 users (show)

See Also:
Site: NVIDIA (PSLA)
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 21.08.3 22.05.0pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Felix Abecassis 2021-01-26 18:04:46 MST
During local testing with multiple slurmd on the same machine, I added static features to slurm.conf:
NodeName=node-1 [...] Feature=rack1,blue
NodeName=node-2 [...] Feature=rack2,red

And I verified that it is applied correctly:
$ sinfo -a -o "%8N | %16b | %16f"
NODELIST | ACTIVE_FEATURES  | AVAIL_FEATURES  
node-1   | rack1,blue       | rack1,blue      
node-2   | rack2,red        | rack2,red  

Given the node configuration, I would expect the constraint "[rack1*1&rack2*1]&blue" to not be schedulable. I interpret it as requesting 2 nodes, one node from rack1, and one node from rack2, but *both* of them must be "blue".

It works fine with Slurm 20.11.1:
$ srun -N2  --constraint="[rack1*1&rack2*1]&blue" bash -c 'echo $SLURM_STEP_NODELIST $SLURMD_NODENAME'
node-[1-2] node-2
node-[1-2] node-1

The following should, I believe, be equivalent, and is rejected properly:
$ srun -N2  --constraint="[(rack1&blue)*1&(rack2&blue)*1]" bash -c 'echo $SLURM_STEP_NODELIST $SLURMD_NODENAME'
srun: error: Unable to allocate resources: Requested node configuration is not available


If some combinations of features operators are not supported, it should probably be documented.
Comment 1 Felix Abecassis 2021-01-26 18:06:41 MST
To clarify, above I mentioned "It works fine with Slurm 20.11.1". But by that I mean "the job is being scheduled while it should not be possible".
Comment 4 Marshall Garey 2021-01-28 15:59:54 MST
Hi Felix,

I can reproduce what you're seeing.

NodeName=n1-[1-5] NodeAddr=localhost Port=31001-31005 Features=rack1,blue
NodeName=n1-[6-10] NodeAddr=localhost Port=31006-31010 Features=rack2,red

$ srun -N2 -C '[rack1*1&rack2*1]&blue' whereami

0001 n1-6 - Cpus_allowed:       00000101        Cpus_allowed_list:      0,8
0000 n1-1 - Cpus_allowed:       00000101        Cpus_allowed_list:      0,8


So you can see that it is following the constraint within the brackets, but ignores the "&blue" at the end.

But it's not as simple as saying that brackets and '&' don't work together, since the '&' does work inside the brackets. 


I found that this syntax isn't actually supported (it's a known issue). I'd like to documented unsupported syntax, but it's not clear to me yet exactly what syntax is and isn't supported. I'm also afraid that any list I make will be incomplete, though I think an incomplete list of unsupported syntax is better than no list.

I'm still looking into this, so I'll get back to you with more information.
Comment 5 Felix Abecassis 2021-01-28 16:55:23 MST
Thanks Marshall, I also expressed issues with the constraint DSL in https://bugs.schedmd.com/show_bug.cgi?id=9567. Right now, node_features plugin need to parse the constraint language in a way that is compatible with the Slurm code, and it's not documented.

Moving forward, it would be very helpful if we could start by defining a simple grammar for this constraint language.
Comment 8 Felix Abecassis 2021-02-04 15:33:28 MST
Adding more examples of constraint expressions that are parsed/interpreted incorrectly or in a surprising way.


1) Trailing & operand
$ srun -N1 --constraint="rack1&" bash -c 'echo $SLURMD_NODENAME'
node-1


2) Trailing | operand
$ srun -N1 --constraint="rack1|" bash -c 'echo $SLURMD_NODENAME'
node-1


3) Misaligned parentheses / brackets
$ srun -N1 --constraint="([rack1)]" bash -c 'echo $SLURMD_NODENAME'
node-1


4) OR inside parentheses *when a node_features plugin is used*
Without a node_features plugin in use:
$ srun -N2 --constraint="(rack1|rack2)&(red|blue)" bash -c 'echo $SLURMD_NODENAME'
node-2
node-1

With a node_features plugin active (but no changeable features are being used here):
$ srun -N2 --constraint="(rack1|rack2)&(red|blue)" bash -c 'echo $SLURMD_NODENAME'
srun: error: Unable to allocate resources: Invalid feature specification

You have to look at the source code to understand why it's happening:
https://github.com/SchedMD/slurm/blob/fcd64016188aec04cf16122b09b314b18914e83c/src/slurmctld/job_scheduler.c#L4615-L4624


5) OR not working with arbitrary expressions
$ srun -N2 --constraint="[blue*1&red*1]|[rack1*1&rack2*1]" bash -c 'echo $SLURMD_NODENAME'
srun: error: Unable to allocate resources: Requested node configuration is not available
But --constraint="[blue*1&red*1]" or --constraint="[rack1*1&rack2*1]" work fine.


6) Operator * can be specified multiple times
$ srun -N1 --constraint="rack1*10*1" bash -c 'echo $SLURMD_NODENAME'
node-1


7) You can't nest parentheses
$ srun -N1 --constraint="((blue|red)&rack1)" bash -c 'echo $SLURMD_NODENAME'
srun: error: Unable to allocate resources: Invalid feature specification

$ srun -N1 --constraint="(blue|red)&rack1" bash -c 'echo $SLURMD_NODENAME'
node-1


8) Operator precedence is not well-defined, & and | have the same precedence.
$ srun -N1 --constraint="blue|rack1&rack2" bash -c 'echo $SLURMD_NODENAME'
srun: error: Unable to allocate resources: Requested node configuration is not available

In some languages, like C, & has an higher precedence than |, so the expression would be equivalent to blue|(rack1&rack2). But here it's not the case:
$ srun -N1 --constraint="blue|(rack1&rack2)" bash -c 'echo $SLURMD_NODENAME'
node-1

Combined with 4) and 7), this is unfortunate since it means you can't use parentheses to enforce the order of operations.
Comment 10 Marshall Garey 2021-03-15 11:47:47 MDT
Hi Felix,

Sorry for taking awhile to get back to you. As you have found, some syntax simply isn't supported. Some of the history with this involves how KNL manages features. Another thing to realize is that our syntax for constraints does *not* work exactly like C (or other programming languages), and is certainly not fully fleshed out.

Exactly what we support and why and how is a little complicated and will take me some time to untangle. There might be a bug, or it might just be a matter of clarifying documentation on what is or isn't supported. However, because languages and syntax are complicated, any documentation we have will most likely be incomplete.
Comment 11 Julie Bernauer 2021-03-15 11:48:03 MDT
I am OOO until March 16th. Email replies will be delayed.
Comment 15 Marshall Garey 2021-10-14 16:22:48 MDT
Felix,

I have a quick update on this bug.

I think most of the things you've posted are basically by design, or are oversights due to an incomplete DSL for node features, or just not rejecting all unexpected syntax that doesn't work. Since you have a pending bug 12286 that requests us to make this work better especially with the node_features/helpers plugin, I'm not going to pursue those.

However, I do want to pursue one of your examples more as a potential bug. It looks like a bug to me, anyway, and I hope shouldn't be too hard to fix.

(In reply to Felix Abecassis from comment #8)
> 4) OR inside parentheses *when a node_features plugin is used*
> Without a node_features plugin in use:
> $ srun -N2 --constraint="(rack1|rack2)&(red|blue)" bash -c 'echo
> $SLURMD_NODENAME'
> node-2
> node-1
>
> With a node_features plugin active (but no changeable features are being
> used here):
> $ srun -N2 --constraint="(rack1|rack2)&(red|blue)" bash -c 'echo
> $SLURMD_NODENAME'
> srun: error: Unable to allocate resources: Invalid feature specification
>
> You have to look at the source code to understand why it's happening:
> https://github.com/SchedMD/slurm/blob/
> fcd64016188aec04cf16122b09b314b18914e83c/src/slurmctld/job_scheduler.c#L4615-
> L4624

This also still happens.

# helpers.conf:
Feature=asdf

# slurm.conf:
NodeFeaturesPlugins=node_features/helpers
NodeName=n1-[1-10] NodeAddr=voyager Port=11001-11010 \
     Features=rack1,rack2,red,blue

If I remove those features from the NodeName definition and put them in helpers.conf, I still see the error but I see an additional log in slurmctld.log:

# helpers.conf
Feature=rack1
Feature=rack2
Feature=red
Feature=blue

$ srun -N2 --constraint="(rack1|rack2)&(red|blue)" bash -c 'echo $SLURMD_NODENAME'
srun: error: Unable to allocate resources: Invalid feature specification

slurmctld log:
[2021-10-04T10:21:15.835] error: operator(s) "[]()|*" not allowed in constraint "(rack1|rack2)&(red|blue)" when using changeable features

So this is by design for changeable features (and that is I believe what you are requesting to fix in bug 12286). But it is supposed to work for static features.

But, if we have a node_features plugin configured, it looks like we still reject the job even if no changeable features are requested.



I'm going to look into this further.
Comment 22 Marshall Garey 2021-10-26 14:14:06 MDT
Felix,

(In reply to Felix Abecassis from comment #8)
> 4) OR inside parentheses *when a node_features plugin is used*
> Without a node_features plugin in use:
> $ srun -N2 --constraint="(rack1|rack2)&(red|blue)" bash -c 'echo
> $SLURMD_NODENAME'
> node-2
> node-1
> 
> With a node_features plugin active (but no changeable features are being
> used here):
> $ srun -N2 --constraint="(rack1|rack2)&(red|blue)" bash -c 'echo
> $SLURMD_NODENAME'
> srun: error: Unable to allocate resources: Invalid feature specification
> 
> You have to look at the source code to understand why it's happening:
> https://github.com/SchedMD/slurm/blob/
> fcd64016188aec04cf16122b09b314b18914e83c/src/slurmctld/job_scheduler.c#L4615-
> L4624

We fixed this issue in commit d50fb97b61, which will be in 21.08.3. This will allow using the special expressions (|, &, parentheses, brackets) when requesting static features even if a node_features plugin is configured.

For the rest of the requests here, I'm going to let them be handled by the RFE in bug 12286. If after that bug is resolved there are still some issues with constraints, let's either re-open this ticket or open a new ticket but reference this one.

Thanks for the bug report, and for your great sleuthing to make fixing this specific bug pretty easy. Sorry it took so long to get this fix in. I'm closing this as resolved/fixed in 21.08.3.

- Marshall