Ticket 16259 - Constraint parenthesis within brackets not working as expected
Summary: Constraint parenthesis within brackets not working as expected
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 21.08.8
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Marshall Garey
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-03-13 13:04 MDT by Steve Ford
Modified: 2023-03-21 16:12 MDT (History)
2 users (show)

See Also:
Site: MSU
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 23.02.1 23.11.0rc1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Slurm Configuration (12.08 KB, application/x-compressed)
2023-03-13 13:04 MDT, Steve Ford
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Steve Ford 2023-03-13 13:04:06 MDT
Hello SchedMD,

We are running into an issue when specifying certain combinations of constraints in our environment. We use the job_submit script to append "&[intel14|intel16|intel18|(amr|acm)|nvf|nal|nif]" automatically to any user specified constraint. This keeps jobs from running across different generations of nodes. For example, a user specified constraint of "v100" would result in a constraint of "v100&[intel14|intel16|intel18|(amr|acm)|nvf|nal|nif]", since there are both nvf and intel18 nodes with the v100 feature, this constraint ensures that multi-node jobs are allocated to only one type of nodes instead of both.

The issue we are seeing is when a user specifies a constraint of intel18 and more than 40 CPUs. The job_submit script translates this constraint into "intel18&[intel14|intel16|intel18|(amr|acm)|nvf|nal|nif]". We don't have any nodes with the intel18 feature and more than 40 CPUs, so this request should fail, but it instead gets granted on nodes with the amr feature.

It appears SLURM is confused by the (amr|acm) nested within the brackets and other OR constraints. Or maybe we are misunderstanding the constraint syntax. What we want is for multi-node jobs to be limited to only one type of node feature except for amr and acm which are compatible. Should parentheses within brackets work this way?
Comment 1 Steve Ford 2023-03-13 13:04:38 MDT
Created attachment 29300 [details]
Slurm Configuration
Comment 2 Marshall Garey 2023-03-13 15:26:16 MDT
Hi Steve,

The parentheses inside of brackets are supposed to work as you expect them to.

I can reproduce this. The parentheses inside the brackets is OR'ing its nodes with everything that came before it.

A quick solution is to *prepend* your job submission script's constraint instead of *append*. By putting your job submission script's constraint before the user's, the user's constraints are AND'd afterward.

Can you let me know if this fixes your issue?

I'm looking into a solution.
Comment 4 Steve Ford 2023-03-13 15:56:33 MDT
Hello Marshall,

Prepending the bracketed constraint instead of appending them does resolve this issue for us.

Thanks!

Steve
Comment 11 Marshall Garey 2023-03-21 16:12:48 MDT
Hi Steve,

We have fixed this in commits 1faa51ef69..dedd9f6fcc. They'll be part of 23.02.1. I'm closing this as fixed.