As was pointed out in the original RFC, the new node_features/helpers plugin does not support arbitrary job constraints as it would require understanding the full DSL: https://bugs.schedmd.com/show_bug.cgi?id=9567 (item a.) And parsing the DSL would have to be done exactly like Slurm does, see https://bugs.schedmd.com/show_bug.cgi?id=10707 This is still the case with the current version, for instance asking for 2 nodes on the same rack, with the NUMA Per Socket (NPS) changeable feature set to 2: $ srun --constraint="[rack1|rack2]&nps=2" -N2 true srun: error: Unable to allocate resources: Invalid feature specification From the slurmctld log: slurmctld: error: operator(s) "[]()|*" not allowed in constraint "[rack1|rack2]&mig=on" when using changeable features The code: https://github.com/SchedMD/slurm/blob/ae667b9858c78d8c4e355acf35a3db5535dc9738/src/plugins/node_features/helpers/node_features_helpers.c#L475-L479 This RFE is filed to fix this gap in a future release.
Hi Felix, Luke, Julie, The parentheses and bar operators are now functional for the node_features/helpers plugin. This work is upstream in the following commits: 5b1f4911e3..bca101dc65 Here are some clarifications of the new behavior: * For changeable features, the bar operator always means Matching OR, not OR. For example: salloc -N2 -C'a|b' - OR means that one node could have feature 'a' and the other node could have feature 'b'. Maybe one of the nodes has both features, but not all nodes have 'a' or not all nodes have 'b'. This does not make sense for changeable features. If your user requests gpu configuration a or gpu configuration b, they don't want some nodes to be in configuration a and other nodes to be in configuration b. - Matching OR means that all nodes in the job allocation have 'a' or all nodes have 'b'. Some or all nodes could have both. - The node_features/helpers plugin finds the first set of features that matches all nodes in the job. - If at least one bar operator is in the feature expression and at least one of the features is a changeable feature, then all bar operators are converted to Matching OR. * Although the SOW specified that one set of parentheses would be allowed, we were able to allow as many sets of parentheses as you want. - For example, (a|b)&(c|d)|(e&f) is equivalent to (a&c)|(a&d)|(b&c)|(b&d)|(e&f) - Nesting parentheses is still not allowed. Here is the updated documentation for the salloc man page (the sbatch and srun man pages have the same changes). https://github.com/SchedMD/slurm/blob/bca101dc6530f988d9722beb10eb4e2bbde736e3/doc/man/man1/salloc.1#L228-L354 Do you have any questions about this?
*** Ticket 14273 has been marked as a duplicate of this ticket. ***
Please take a look at https://bugs.schedmd.com/show_bug.cgi?id=15760 as I was not able to test the feature yet. Thanks!
Minor issue, but when I tested using a non-existing constraint rack3 in expression "--constraint=rack1|rack3&nps=4" then the error message from slurmctld shows an weird expression: slurmctld: Job specs has invalid feature list: (rack1&[nps=4)|(rack3&nps=4)] It looks like Slurm accepts having a closing ')' before a closing ']' in an like "(rack1&[nps=4)", but that's an highly unusual form for an expression, so please consider fixing it as the current form is hard to understand when debugging (I did not figure out immediately that rack3 did not exist, I thought the new code was generating invalid constraints expressions).
Also, to go back to the original bug description with expression "[rack1|rack2]&nps=2", my understanding is that with the current code it is rejected because of the square brackets, but "(rack1|rack2)&nps=2" is strictly equivalent because the "|" becomes a matching-OR, right? If that is so, what is the current issue with supporting "[rack1|rack2]&nps=2"? Is there a remaining case when dealing with square brackets that isn't solved?
(In reply to Felix Abecassis from comment #20) > Minor issue, but when I tested using a non-existing constraint rack3 in > expression "--constraint=rack1|rack3&nps=4" then the error message from > slurmctld shows an weird expression: > slurmctld: Job specs has invalid feature list: (rack1&[nps=4)|(rack3&nps=4)] > > It looks like Slurm accepts having a closing ')' before a closing ']' in an > like "(rack1&[nps=4)", but that's an highly unusual form for an expression, > so please consider fixing it as the current form is hard to understand when > debugging (I did not figure out immediately that rack3 did not exist, I > thought the new code was generating invalid constraints expressions). There are two bugs here. The first bug is actually just a cosmetic bug in the logs. Slurm first parses the user's string to build a job feature list, then attempts to reconstruct the string from the job feature list while doing some validation. The bug is in reconstructing the string, where you see the misplaced square bracket. I found this while working on this ticket, and I don't see any reason why the validation can't be done while constructing the job features list. I also see no reason why the string has to be reconstructed when we already have the original string supplied by the user. I already have an internal bug open to fix this. The second bug is, as you point out, Slurm accepting having a closing ')' before a closing ']'. Slurm also should not accept having an opening '[' after an opening '('. $ srun -C '(a&[b)|(c&b)]' hostname curiosity I'll fix this bug as well.
Following up on my last comment: bug 15715 is tracking the fixes. I made bug 15715 public so you could see it. and marked 12286 as see also. (In reply to Felix Abecassis from comment #21) > Also, to go back to the original bug description with expression > "[rack1|rack2]&nps=2", my understanding is that with the current code it is > rejected because of the square brackets, but "(rack1|rack2)&nps=2" is > strictly equivalent because the "|" becomes a matching-OR, right? That's correct. > If that is so, what is the current issue with supporting > "[rack1|rack2]&nps=2"? Is there a remaining case when dealing with square > brackets that isn't solved? That's right. Square brackets allows counts on a feature, which is essentially the same as a heterogeneous job. https://slurm.schedmd.com/sbatch.html#OPT_Multiple-Counts https://slurm.schedmd.com/sbatch.html#OPT_Brackets Ampersand inside of square brackets is different from ampersand outside of square brackets. Inside of square brackets, it is the count syntax and requires the asterisk. Outside of square brackets, it is a boolean AND. For example: $ srun -C 'a&c' whereami 0000 n1-3 - Cpus_allowed: 00000101 Cpus_allowed_list: 0,8 means one node with features a and c. $ srun -C '[a&c]' whereami srun: error: Unable to allocate resources: Invalid feature specification This is invalid because I didn't specify counts. $ srun -C '[a*1&c*1]' whereami 0000 n1-1 - Cpus_allowed: 00000101 Cpus_allowed_list: 0,8 0001 n1-3 - Cpus_allowed: 00000101 Cpus_allowed_list: 0,8 In addition, bar inside of parentheses inside of square brackets is parsed as boolean OR, but bar (not inside of parentheses) inside of square brackets is parsed as matching OR. It was much simpler to not allow square brackets in order to avoid the scheduling logic in Slurm that parses square brackets differently. With the node_features/helpers plugin, you can get the same behavior as the count syntax by requesting a heterogeneous job instead, where each heterogeneous job component would request a different --constraint expression. (The count syntax preceded heterogenous jobs.)
Thanks for the explanation, it makes sense now!
(In reply to Felix Abecassis from comment #24) > Thanks for the explanation, it makes sense now! You're welcome, and thanks for testing this so quickly! I'm closing this bug as fixed ahead of 23.02.