Hello SchedMD, We are running into an issue when specifying certain combinations of constraints in our environment. We use the job_submit script to append "&[intel14|intel16|intel18|(amr|acm)|nvf|nal|nif]" automatically to any user specified constraint. This keeps jobs from running across different generations of nodes. For example, a user specified constraint of "v100" would result in a constraint of "v100&[intel14|intel16|intel18|(amr|acm)|nvf|nal|nif]", since there are both nvf and intel18 nodes with the v100 feature, this constraint ensures that multi-node jobs are allocated to only one type of nodes instead of both. The issue we are seeing is when a user specifies a constraint of intel18 and more than 40 CPUs. The job_submit script translates this constraint into "intel18&[intel14|intel16|intel18|(amr|acm)|nvf|nal|nif]". We don't have any nodes with the intel18 feature and more than 40 CPUs, so this request should fail, but it instead gets granted on nodes with the amr feature. It appears SLURM is confused by the (amr|acm) nested within the brackets and other OR constraints. Or maybe we are misunderstanding the constraint syntax. What we want is for multi-node jobs to be limited to only one type of node feature except for amr and acm which are compatible. Should parentheses within brackets work this way?
Created attachment 29300 [details] Slurm Configuration
Hi Steve, The parentheses inside of brackets are supposed to work as you expect them to. I can reproduce this. The parentheses inside the brackets is OR'ing its nodes with everything that came before it. A quick solution is to *prepend* your job submission script's constraint instead of *append*. By putting your job submission script's constraint before the user's, the user's constraints are AND'd afterward. Can you let me know if this fixes your issue? I'm looking into a solution.
Hello Marshall, Prepending the bracketed constraint instead of appending them does resolve this issue for us. Thanks! Steve
Hi Steve, We have fixed this in commits 1faa51ef69..dedd9f6fcc. They'll be part of 23.02.1. I'm closing this as fixed.