we just found a new error - a job that was scheduled to the completely wrong nodes. Requested was constraint leaf113, but it was scheduled to leaf213 (bhosts is a script calling sinfo with different format to show node constraints): [root@eslurm1 crtdc]# bhosts -p idealq -n eii[217-232] NODELIST NODES CPUS(A/I/O/T) PARTITION STATE AVAIL_FEATURES eii[219-232] 14 1008/0/0/1008 idealq alloc reconfig,leaf213,icx8360Y,icx8360Yf3,icx8360Yopa,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eii[217-218] 2 0/144/0/144 idealq idle reconfig,leaf213,icx8360Y,icx8360Yf3,icx8360Yopa,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 [root@eslurm1 crtdc]# sacct -j 174615 --format Constraints,NodeList Constraints NodeList ------------------- --------------- [leaf113&reconfig] eii[217-232] eii217 eii[218-232] eii[218-232] eii[218-232] [root@eslurm1 crtdc]# grep leaf213 slurm.conf NodeName=eii[217-234] Boards=1 SocketsPerBoard=2 CoresPerSocket=36 State=UNKNOWN Feature=reconfig,leaf213,icx8360Y,icx8360Yf3,icx8360Yopa,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 [root@eslurm1 crtdc]# grep leaf113 slurm.conf NodeName=eia[073-077] Boards=1 SocketsPerBoard=2 CoresPerSocket=36 State=UNKNOWN Feature=leaf113,icx8360Y,icx8360Yf2,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 NodeName=eia[078-080] Boards=1 SocketsPerBoard=2 CoresPerSocket=36 State=UNKNOWN Feature=leaf113,icx8360Y,icx8360Yf2,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 NodeName=eia[081-084] Boards=1 SocketsPerBoard=2 CoresPerSocket=36 State=UNKNOWN Feature=leaf113,icx8360Y,icx8360Yf2,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 NodeName=eia[085-093] Boards=1 SocketsPerBoard=2 CoresPerSocket=36 State=UNKNOWN Feature=leaf113,icx8360Y,icx8360Yf2,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 [root@eslurm1 crtdc]# grep 174615 /opt/slurm/current/logs/slurm/slurmctl.log [2021-08-16T23:15:42.523] _slurm_rpc_submit_batch_job: JobId=174615 InitPrio=4294726966 usec=762 [2021-08-16T23:15:45.809] sched: Allocate JobId=174615 NodeList=eii[217-232] #CPUs=1152 Partition=idealq [2021-08-16T23:24:10.821] JobId=174615 boot complete for all 16 nodes [2021-08-16T23:24:10.821] prolog_running_decr: Configuration for JobId=174615 is complete [2021-08-16T23:24:49.968] _job_complete: JobId=174615 WEXITSTATUS 1 [2021-08-16T23:24:49.970] _job_complete: JobId=174615 done [2021-08-16T23:26:03.364] cleanup_completing: JobId=174615 completion process took 74 seconds [root@eslurm1 crtdc]# grep 174615 /opt/slurm/current/logs/slurm/sched.log sched: [2021-08-16T23:15:42.523] JobId=174615 allocated resources: NodeList=(null) sched: [2021-08-16T23:15:45.809] JobId=174615 initiated sched: [2021-08-16T23:15:45.809] Allocate JobId=174615 NodeList=eii[217-232] #CPUs=1152 Partition=idealq
Created attachment 20862 [details] slurm.conf
Created attachment 20863 [details] slurmctl.log
Created attachment 20864 [details] sched.log
Hi Could you send me sbatch script and command line used to submit job 174615? Did this happen just once, or are you able to reproduce this issue? Dominik
so far it was only detected once
sbatch --exclusive "-C" "[leaf113&reconfig]" "-t" "60" "-N" "16" "-n" "1024" "-p" "idealq" "RUN-amg.slurm" $ cat RUN-amg.slurm #!/bin/bash -login #SBATCH --nodes=16 #SBATCH --ntasks-per-node=64 #SBATCH --threads-per-core=1 #SBATCH -J amg #SBATCH --time=1:00:00 #SBATCH --exclusive #SBATCH -d singleton ulimit -s unlimited module purge source /opt/intel/oneAPI/2021.3.0.3219/setvars.sh .....
Hi [leaf113&reconfig] has incorrect syntax and slurmctld should block it at submission. Its behavior is undefined. I will let you know when this fix will be in the repo. sbatch man: ... Multiple Counts Specific counts of multiple resources may be specified by using the AND operator and enclosing the options within square brackets. For example, --constraint="[rack1*2&rack2*4]" might be used to specify that two nodes must be allocated from nodes with the feature of "rack1" and four nodes must be allocated from nodes with the feature "rack2". ... Dominik
but the point here is that the node must have both constraints - how should that be formulated correctly?
Without square brackets. From sbatch man: ... AND If only nodes with all of specified features will be used. The ampersand is used for an AND operator. For example, --constraint="intel&gpu" ...
oh, so the brackets should not have been there. User error then submission should be: sbatch --exclusive "-C" "leaf113&reconfig" "-t" "60" "-N" "16" "-n" "1024" "-p" "idealq" "RUN-amg.slurm" correct?
Yes exactly.
FYI there are some more parsing quirks to look out for listed in bug#10707. I've added my site to the CC list because we have a vested interest in the way this field is parsed - see bug#12286 - and we want to track related changes.
*** Ticket 8019 has been marked as a duplicate of this ticket. ***
Hi Those commits protect from using incorrect syntaxes in constrain expression: https://github.com/SchedMD/slurm/commit/27370b018 https://github.com/SchedMD/slurm/commit/2aa867638 https://github.com/SchedMD/slurm/commit/98bdb3f4e We also improved documentation: https://github.com/SchedMD/slurm/commit/1e5345842 The next 21.08 release will include those patches. Let me know if it is OK to close this ticket now. Dominik
the problem was not an incorrect submit, but incorrect scheduling. How do those patches fit the error?
Hi I thought that we agreed that this was a user error (comment 12). The correct syntax for such requests should not contain square brackets. Dominik
if you are sure that's the only problem then close it