| Summary: | job scheduled to wrong nodes with different constraint | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Michael Hebenstreit <michael.hebenstreit> |
| Component: | Scheduling | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | dwightman, fabecassis, lyeager |
| Version: | 20.11.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=8019 https://bugs.schedmd.com/show_bug.cgi?id=10707 |
||
| Site: | Intel CRT | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 21.08.2 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurm.conf
slurmctl.log sched.log |
||
Created attachment 20862 [details]
slurm.conf
Created attachment 20863 [details]
slurmctl.log
Created attachment 20864 [details]
sched.log
Hi Could you send me sbatch script and command line used to submit job 174615? Did this happen just once, or are you able to reproduce this issue? Dominik so far it was only detected once sbatch --exclusive "-C" "[leaf113&reconfig]" "-t" "60" "-N" "16" "-n" "1024" "-p" "idealq" "RUN-amg.slurm" $ cat RUN-amg.slurm #!/bin/bash -login #SBATCH --nodes=16 #SBATCH --ntasks-per-node=64 #SBATCH --threads-per-core=1 #SBATCH -J amg #SBATCH --time=1:00:00 #SBATCH --exclusive #SBATCH -d singleton ulimit -s unlimited module purge source /opt/intel/oneAPI/2021.3.0.3219/setvars.sh ..... Hi
[leaf113&reconfig] has incorrect syntax and slurmctld should block it at submission.
Its behavior is undefined. I will let you know when this fix will be in the repo.
sbatch man:
...
Multiple Counts
Specific counts of multiple resources may be specified by using the AND
operator and enclosing the options within square brackets. For example,
--constraint="[rack1*2&rack2*4]" might be used to specify that two nodes
must be allocated from nodes with the feature of "rack1" and four nodes must
be allocated from nodes with the feature "rack2".
...
Dominik
but the point here is that the node must have both constraints - how should that be formulated correctly? Without square brackets.
From sbatch man:
...
AND If only nodes with all of specified features will be used. The ampersand is
used for an AND operator. For example, --constraint="intel&gpu"
...
oh, so the brackets should not have been there. User error then submission should be: sbatch --exclusive "-C" "leaf113&reconfig" "-t" "60" "-N" "16" "-n" "1024" "-p" "idealq" "RUN-amg.slurm" correct? Yes exactly. FYI there are some more parsing quirks to look out for listed in bug#10707. I've added my site to the CC list because we have a vested interest in the way this field is parsed - see bug#12286 - and we want to track related changes. *** Ticket 8019 has been marked as a duplicate of this ticket. *** Hi Those commits protect from using incorrect syntaxes in constrain expression: https://github.com/SchedMD/slurm/commit/27370b018 https://github.com/SchedMD/slurm/commit/2aa867638 https://github.com/SchedMD/slurm/commit/98bdb3f4e We also improved documentation: https://github.com/SchedMD/slurm/commit/1e5345842 The next 21.08 release will include those patches. Let me know if it is OK to close this ticket now. Dominik the problem was not an incorrect submit, but incorrect scheduling. How do those patches fit the error? Hi I thought that we agreed that this was a user error (comment 12). The correct syntax for such requests should not contain square brackets. Dominik if you are sure that's the only problem then close it |
we just found a new error - a job that was scheduled to the completely wrong nodes. Requested was constraint leaf113, but it was scheduled to leaf213 (bhosts is a script calling sinfo with different format to show node constraints): [root@eslurm1 crtdc]# bhosts -p idealq -n eii[217-232] NODELIST NODES CPUS(A/I/O/T) PARTITION STATE AVAIL_FEATURES eii[219-232] 14 1008/0/0/1008 idealq alloc reconfig,leaf213,icx8360Y,icx8360Yf3,icx8360Yopa,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eii[217-218] 2 0/144/0/144 idealq idle reconfig,leaf213,icx8360Y,icx8360Yf3,icx8360Yopa,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 [root@eslurm1 crtdc]# sacct -j 174615 --format Constraints,NodeList Constraints NodeList ------------------- --------------- [leaf113&reconfig] eii[217-232] eii217 eii[218-232] eii[218-232] eii[218-232] [root@eslurm1 crtdc]# grep leaf213 slurm.conf NodeName=eii[217-234] Boards=1 SocketsPerBoard=2 CoresPerSocket=36 State=UNKNOWN Feature=reconfig,leaf213,icx8360Y,icx8360Yf3,icx8360Yopa,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 [root@eslurm1 crtdc]# grep leaf113 slurm.conf NodeName=eia[073-077] Boards=1 SocketsPerBoard=2 CoresPerSocket=36 State=UNKNOWN Feature=leaf113,icx8360Y,icx8360Yf2,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 NodeName=eia[078-080] Boards=1 SocketsPerBoard=2 CoresPerSocket=36 State=UNKNOWN Feature=leaf113,icx8360Y,icx8360Yf2,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 NodeName=eia[081-084] Boards=1 SocketsPerBoard=2 CoresPerSocket=36 State=UNKNOWN Feature=leaf113,icx8360Y,icx8360Yf2,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 NodeName=eia[085-093] Boards=1 SocketsPerBoard=2 CoresPerSocket=36 State=UNKNOWN Feature=leaf113,icx8360Y,icx8360Yf2,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 [root@eslurm1 crtdc]# grep 174615 /opt/slurm/current/logs/slurm/slurmctl.log [2021-08-16T23:15:42.523] _slurm_rpc_submit_batch_job: JobId=174615 InitPrio=4294726966 usec=762 [2021-08-16T23:15:45.809] sched: Allocate JobId=174615 NodeList=eii[217-232] #CPUs=1152 Partition=idealq [2021-08-16T23:24:10.821] JobId=174615 boot complete for all 16 nodes [2021-08-16T23:24:10.821] prolog_running_decr: Configuration for JobId=174615 is complete [2021-08-16T23:24:49.968] _job_complete: JobId=174615 WEXITSTATUS 1 [2021-08-16T23:24:49.970] _job_complete: JobId=174615 done [2021-08-16T23:26:03.364] cleanup_completing: JobId=174615 completion process took 74 seconds [root@eslurm1 crtdc]# grep 174615 /opt/slurm/current/logs/slurm/sched.log sched: [2021-08-16T23:15:42.523] JobId=174615 allocated resources: NodeList=(null) sched: [2021-08-16T23:15:45.809] JobId=174615 initiated sched: [2021-08-16T23:15:45.809] Allocate JobId=174615 NodeList=eii[217-232] #CPUs=1152 Partition=idealq