Summary: | job scheduled to wrong nodes with different constraint | ||
---|---|---|---|
Product: | Slurm | Reporter: | Michael Hebenstreit <michael.hebenstreit> |
Component: | Scheduling | Assignee: | Dominik Bartkiewicz <bart> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | dwightman, fabecassis, lyeager |
Version: | 20.11.7 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=8019 https://bugs.schedmd.com/show_bug.cgi?id=10707 |
||
Site: | Intel CRT | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | 21.08.2 | |
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: |
slurm.conf
slurmctl.log sched.log |
Description
Michael Hebenstreit
2021-08-17 11:50:27 MDT
Created attachment 20862 [details]
slurm.conf
Created attachment 20863 [details]
slurmctl.log
Created attachment 20864 [details]
sched.log
Hi Could you send me sbatch script and command line used to submit job 174615? Did this happen just once, or are you able to reproduce this issue? Dominik so far it was only detected once sbatch --exclusive "-C" "[leaf113&reconfig]" "-t" "60" "-N" "16" "-n" "1024" "-p" "idealq" "RUN-amg.slurm" $ cat RUN-amg.slurm #!/bin/bash -login #SBATCH --nodes=16 #SBATCH --ntasks-per-node=64 #SBATCH --threads-per-core=1 #SBATCH -J amg #SBATCH --time=1:00:00 #SBATCH --exclusive #SBATCH -d singleton ulimit -s unlimited module purge source /opt/intel/oneAPI/2021.3.0.3219/setvars.sh ..... Hi [leaf113&reconfig] has incorrect syntax and slurmctld should block it at submission. Its behavior is undefined. I will let you know when this fix will be in the repo. sbatch man: ... Multiple Counts Specific counts of multiple resources may be specified by using the AND operator and enclosing the options within square brackets. For example, --constraint="[rack1*2&rack2*4]" might be used to specify that two nodes must be allocated from nodes with the feature of "rack1" and four nodes must be allocated from nodes with the feature "rack2". ... Dominik but the point here is that the node must have both constraints - how should that be formulated correctly? Without square brackets. From sbatch man: ... AND If only nodes with all of specified features will be used. The ampersand is used for an AND operator. For example, --constraint="intel&gpu" ... oh, so the brackets should not have been there. User error then submission should be: sbatch --exclusive "-C" "leaf113&reconfig" "-t" "60" "-N" "16" "-n" "1024" "-p" "idealq" "RUN-amg.slurm" correct? Yes exactly. FYI there are some more parsing quirks to look out for listed in bug#10707. I've added my site to the CC list because we have a vested interest in the way this field is parsed - see bug#12286 - and we want to track related changes. *** Ticket 8019 has been marked as a duplicate of this ticket. *** Hi Those commits protect from using incorrect syntaxes in constrain expression: https://github.com/SchedMD/slurm/commit/27370b018 https://github.com/SchedMD/slurm/commit/2aa867638 https://github.com/SchedMD/slurm/commit/98bdb3f4e We also improved documentation: https://github.com/SchedMD/slurm/commit/1e5345842 The next 21.08 release will include those patches. Let me know if it is OK to close this ticket now. Dominik the problem was not an incorrect submit, but incorrect scheduling. How do those patches fit the error? Hi I thought that we agreed that this was a user error (comment 12). The correct syntax for such requests should not contain square brackets. Dominik if you are sure that's the only problem then close it |