Ticket 10067

Summary: request for information: submitting with incorrect partition/constraint pair
Product: Slurm Reporter: Michael Hebenstreit <michael.hebenstreit>
Component: User CommandsAssignee: Marcin Stolarek <cinek>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: cinek
Version: 20.02.4   
Hardware: Linux   
OS: Linux   
Site: Intel CRT Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Michael Hebenstreit 2020-10-27 06:27:37 MDT
We have 20+ different node types according to hardware defined in 10 partitions. Sometimes users get exclusive access to certain nodes, those nodes are removed from the standard queues. But during that time users cannot submit jobs for future use.

example - submission to inteldevq fails, as all nodes of that type are currently in idealq. In 24h the nodes would move back to inteldevq:

sbatch --exclusive   "-p" "inteldevq" "-N" "29" "-n" "2048" "--constraint=icx36cpq2" "-J" "FFTW_2k_29.%J.err" "-o" "FFTW_2k_29.%J.log" "-t" "60" "/panfs/users/dmishura/TI-fftwbench_ICX/run_impi.sh"
sbatch: error: Batch job submission failed: Requested node configuration is not available

sbatch --exclusive   "-p" "idealq" "-N" "29" "-n" "2048" "--constraint=icx36cpq2" "-J" "FFTW_2k_29.%J.err" "-o" "FFTW_2k_29.%J.log" "-t" "60" "/panfs/users/dmishura/TI-fftwbench_ICX/run_impi.sh"
Submitted batch job 2809

Is there a way to override such tests?
Comment 1 Marcin Stolarek 2020-10-27 07:08:00 MDT
>Sometimes users get exclusive access to certain nodes, those nodes are removed from the standard queues.

Did you consider creation of an advanced reservation for those nodes for specific time instead of removal from the standard queue[1]?

cheers,
Marcin

[1]https://slurm.schedmd.com/reservations.html
Comment 2 Michael Hebenstreit 2020-10-27 07:15:34 MDT
Our procedures are set via queues. Doing reservations might be a longtime option we are considering, but for the moment we’d need an answer to that question. If the answer is “not possible” our users will have to live with that but will be unhappy

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, October 27, 2020 7:08 AM
To: Hebenstreit, Michael <michael.hebenstreit@intel.com>
Subject: [Bug 10067] request for information: submitting with incorrect partition/constraint pair

Marcin Stolarek<mailto:cinek@schedmd.com> changed bug 10067<https://bugs.schedmd.com/show_bug.cgi?id=10067>
What
Removed
Added
CC

cinek@schedmd.com<mailto:cinek@schedmd.com>
Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=10067#c1> on bug 10067<https://bugs.schedmd.com/show_bug.cgi?id=10067> from Marcin Stolarek<mailto:cinek@schedmd.com>

>Sometimes users get exclusive access to certain nodes, those nodes are removed from the standard queues.



Did you consider creation of an advanced reservation for those nodes for

specific time instead of removal from the standard queue[1]?



cheers,

Marcin



[1]https://slurm.schedmd.com/reservations.html

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 3 Marcin Stolarek 2020-10-27 08:23:34 MDT
 Michael,

The "hack" that should work may be addition of dummy - not really existing DOWN state nodes to the partition with the configuration supported by real nodes while real nodes are removed. 

This will have a drawback of slurmctld trying to ping those nodes resulting in missleading error messages, but if you set the IP address of those to something not available everything should fail quickly without big impact on the controler.

At a glance I don't see any serious issue that may come out of this approach, but since it's rather not standard let me know if this leads to unexpected behavior - maybe we'll be able to further tune it.

cheers,
Marcin
Comment 4 Michael Hebenstreit 2020-10-27 08:40:54 MDT
Thanks, that’s what I thought too
You can close the ticket

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, October 27, 2020 8:24 AM
To: Hebenstreit, Michael <michael.hebenstreit@intel.com>
Subject: [Bug 10067] request for information: submitting with incorrect partition/constraint pair

Comment # 3<https://bugs.schedmd.com/show_bug.cgi?id=10067#c3> on bug 10067<https://bugs.schedmd.com/show_bug.cgi?id=10067> from Marcin Stolarek<mailto:cinek@schedmd.com>

 Michael,



The "hack" that should work may be addition of dummy - not really existing DOWN

state nodes to the partition with the configuration supported by real nodes

while real nodes are removed.



This will have a drawback of slurmctld trying to ping those nodes resulting in

missleading error messages, but if you set the IP address of those to something

not available everything should fail quickly without big impact on the

controler.



At a glance I don't see any serious issue that may come out of this approach,

but since it's rather not standard let me know if this leads to unexpected

behavior - maybe we'll be able to further tune it.



cheers,

Marcin

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 5 Marcin Stolarek 2020-10-27 09:09:20 MDT
Marking as infogiven. Should you have any questions please don't hesitate to reopen.

cheers,
Marcin