During local testing with multiple slurmd on the same machine, I added static features to slurm.conf: NodeName=node-1 [...] Feature=rack1,blue NodeName=node-2 [...] Feature=rack2,red And I verified that it is applied correctly: $ sinfo -a -o "%8N | %16b | %16f" NODELIST | ACTIVE_FEATURES | AVAIL_FEATURES node-1 | rack1,blue | rack1,blue node-2 | rack2,red | rack2,red Given the node configuration, I would expect the constraint "[rack1*1&rack2*1]&blue" to not be schedulable. I interpret it as requesting 2 nodes, one node from rack1, and one node from rack2, but *both* of them must be "blue". It works fine with Slurm 20.11.1: $ srun -N2 --constraint="[rack1*1&rack2*1]&blue" bash -c 'echo $SLURM_STEP_NODELIST $SLURMD_NODENAME' node-[1-2] node-2 node-[1-2] node-1 The following should, I believe, be equivalent, and is rejected properly: $ srun -N2 --constraint="[(rack1&blue)*1&(rack2&blue)*1]" bash -c 'echo $SLURM_STEP_NODELIST $SLURMD_NODENAME' srun: error: Unable to allocate resources: Requested node configuration is not available If some combinations of features operators are not supported, it should probably be documented.
To clarify, above I mentioned "It works fine with Slurm 20.11.1". But by that I mean "the job is being scheduled while it should not be possible".
Hi Felix, I can reproduce what you're seeing. NodeName=n1-[1-5] NodeAddr=localhost Port=31001-31005 Features=rack1,blue NodeName=n1-[6-10] NodeAddr=localhost Port=31006-31010 Features=rack2,red $ srun -N2 -C '[rack1*1&rack2*1]&blue' whereami 0001 n1-6 - Cpus_allowed: 00000101 Cpus_allowed_list: 0,8 0000 n1-1 - Cpus_allowed: 00000101 Cpus_allowed_list: 0,8 So you can see that it is following the constraint within the brackets, but ignores the "&blue" at the end. But it's not as simple as saying that brackets and '&' don't work together, since the '&' does work inside the brackets. I found that this syntax isn't actually supported (it's a known issue). I'd like to documented unsupported syntax, but it's not clear to me yet exactly what syntax is and isn't supported. I'm also afraid that any list I make will be incomplete, though I think an incomplete list of unsupported syntax is better than no list. I'm still looking into this, so I'll get back to you with more information.
Thanks Marshall, I also expressed issues with the constraint DSL in https://bugs.schedmd.com/show_bug.cgi?id=9567. Right now, node_features plugin need to parse the constraint language in a way that is compatible with the Slurm code, and it's not documented. Moving forward, it would be very helpful if we could start by defining a simple grammar for this constraint language.
Adding more examples of constraint expressions that are parsed/interpreted incorrectly or in a surprising way. 1) Trailing & operand $ srun -N1 --constraint="rack1&" bash -c 'echo $SLURMD_NODENAME' node-1 2) Trailing | operand $ srun -N1 --constraint="rack1|" bash -c 'echo $SLURMD_NODENAME' node-1 3) Misaligned parentheses / brackets $ srun -N1 --constraint="([rack1)]" bash -c 'echo $SLURMD_NODENAME' node-1 4) OR inside parentheses *when a node_features plugin is used* Without a node_features plugin in use: $ srun -N2 --constraint="(rack1|rack2)&(red|blue)" bash -c 'echo $SLURMD_NODENAME' node-2 node-1 With a node_features plugin active (but no changeable features are being used here): $ srun -N2 --constraint="(rack1|rack2)&(red|blue)" bash -c 'echo $SLURMD_NODENAME' srun: error: Unable to allocate resources: Invalid feature specification You have to look at the source code to understand why it's happening: https://github.com/SchedMD/slurm/blob/fcd64016188aec04cf16122b09b314b18914e83c/src/slurmctld/job_scheduler.c#L4615-L4624 5) OR not working with arbitrary expressions $ srun -N2 --constraint="[blue*1&red*1]|[rack1*1&rack2*1]" bash -c 'echo $SLURMD_NODENAME' srun: error: Unable to allocate resources: Requested node configuration is not available But --constraint="[blue*1&red*1]" or --constraint="[rack1*1&rack2*1]" work fine. 6) Operator * can be specified multiple times $ srun -N1 --constraint="rack1*10*1" bash -c 'echo $SLURMD_NODENAME' node-1 7) You can't nest parentheses $ srun -N1 --constraint="((blue|red)&rack1)" bash -c 'echo $SLURMD_NODENAME' srun: error: Unable to allocate resources: Invalid feature specification $ srun -N1 --constraint="(blue|red)&rack1" bash -c 'echo $SLURMD_NODENAME' node-1 8) Operator precedence is not well-defined, & and | have the same precedence. $ srun -N1 --constraint="blue|rack1&rack2" bash -c 'echo $SLURMD_NODENAME' srun: error: Unable to allocate resources: Requested node configuration is not available In some languages, like C, & has an higher precedence than |, so the expression would be equivalent to blue|(rack1&rack2). But here it's not the case: $ srun -N1 --constraint="blue|(rack1&rack2)" bash -c 'echo $SLURMD_NODENAME' node-1 Combined with 4) and 7), this is unfortunate since it means you can't use parentheses to enforce the order of operations.
Hi Felix, Sorry for taking awhile to get back to you. As you have found, some syntax simply isn't supported. Some of the history with this involves how KNL manages features. Another thing to realize is that our syntax for constraints does *not* work exactly like C (or other programming languages), and is certainly not fully fleshed out. Exactly what we support and why and how is a little complicated and will take me some time to untangle. There might be a bug, or it might just be a matter of clarifying documentation on what is or isn't supported. However, because languages and syntax are complicated, any documentation we have will most likely be incomplete.
I am OOO until March 16th. Email replies will be delayed.
Felix, I have a quick update on this bug. I think most of the things you've posted are basically by design, or are oversights due to an incomplete DSL for node features, or just not rejecting all unexpected syntax that doesn't work. Since you have a pending bug 12286 that requests us to make this work better especially with the node_features/helpers plugin, I'm not going to pursue those. However, I do want to pursue one of your examples more as a potential bug. It looks like a bug to me, anyway, and I hope shouldn't be too hard to fix. (In reply to Felix Abecassis from comment #8) > 4) OR inside parentheses *when a node_features plugin is used* > Without a node_features plugin in use: > $ srun -N2 --constraint="(rack1|rack2)&(red|blue)" bash -c 'echo > $SLURMD_NODENAME' > node-2 > node-1 > > With a node_features plugin active (but no changeable features are being > used here): > $ srun -N2 --constraint="(rack1|rack2)&(red|blue)" bash -c 'echo > $SLURMD_NODENAME' > srun: error: Unable to allocate resources: Invalid feature specification > > You have to look at the source code to understand why it's happening: > https://github.com/SchedMD/slurm/blob/ > fcd64016188aec04cf16122b09b314b18914e83c/src/slurmctld/job_scheduler.c#L4615- > L4624 This also still happens. # helpers.conf: Feature=asdf # slurm.conf: NodeFeaturesPlugins=node_features/helpers NodeName=n1-[1-10] NodeAddr=voyager Port=11001-11010 \ Features=rack1,rack2,red,blue If I remove those features from the NodeName definition and put them in helpers.conf, I still see the error but I see an additional log in slurmctld.log: # helpers.conf Feature=rack1 Feature=rack2 Feature=red Feature=blue $ srun -N2 --constraint="(rack1|rack2)&(red|blue)" bash -c 'echo $SLURMD_NODENAME' srun: error: Unable to allocate resources: Invalid feature specification slurmctld log: [2021-10-04T10:21:15.835] error: operator(s) "[]()|*" not allowed in constraint "(rack1|rack2)&(red|blue)" when using changeable features So this is by design for changeable features (and that is I believe what you are requesting to fix in bug 12286). But it is supposed to work for static features. But, if we have a node_features plugin configured, it looks like we still reject the job even if no changeable features are requested. I'm going to look into this further.
Felix, (In reply to Felix Abecassis from comment #8) > 4) OR inside parentheses *when a node_features plugin is used* > Without a node_features plugin in use: > $ srun -N2 --constraint="(rack1|rack2)&(red|blue)" bash -c 'echo > $SLURMD_NODENAME' > node-2 > node-1 > > With a node_features plugin active (but no changeable features are being > used here): > $ srun -N2 --constraint="(rack1|rack2)&(red|blue)" bash -c 'echo > $SLURMD_NODENAME' > srun: error: Unable to allocate resources: Invalid feature specification > > You have to look at the source code to understand why it's happening: > https://github.com/SchedMD/slurm/blob/ > fcd64016188aec04cf16122b09b314b18914e83c/src/slurmctld/job_scheduler.c#L4615- > L4624 We fixed this issue in commit d50fb97b61, which will be in 21.08.3. This will allow using the special expressions (|, &, parentheses, brackets) when requesting static features even if a node_features plugin is configured. For the rest of the requests here, I'm going to let them be handled by the RFE in bug 12286. If after that bug is resolved there are still some issues with constraints, let's either re-open this ticket or open a new ticket but reference this one. Thanks for the bug report, and for your great sleuthing to make fixing this specific bug pretty easy. Sorry it took so long to get this fix in. I'm closing this as resolved/fixed in 21.08.3. - Marshall