We have a number of jobs in the queue for a partition "xeon24" which seem to be blocked by some drained (defective) nodes in a different partition "xeon8". For example this job: # scontrol show job 448648 JobId=448648 JobName=BaBr2-MoS2/nm/ UserId=tdeilm(221341) GroupId=camdvip(1250) MCS_label=N/A Priority=73542 Nice=-53967 Account=camdvip QOS=normal JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes:a137,d[031-032] Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A SubmitTime=Mon 16:42:49 EligibleTime=Mon 16:42:49 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=xeon24 AllocNode:Sid=sylg:29844 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) ... This is really strange, since the job would never be scheduled on the unavailable nodes. Either this seems to be a bug, or the Reason=ReqNodeNotAvail,_UnavailableNodes information is incorrect. This seems to be the same issue as in bug 3058, which apparently was never resolved, but now the issue is apparently coming up again.
Created attachment 6406 [details] slurm.conf
An observation: As some nodes are or resumed, the UnavailableNodes list changes accordingly. Even if the drained nodes list is empty, the UnavailableNodes is still present, but pointing to an empty node list. At this time 2 nodes in a partition xeon16 (which the job could never get scheduled to) seem (incorrectly) to block the job: # scontrol show job 448648 JobId=448648 JobName=BaBr2-MoS2/nm/ UserId=tdeilm(221341) GroupId=camdvip(1250) MCS_label=N/A Priority=191467 Nice=8533 Account=camdvip QOS=normal JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes:g[018,026] Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A SubmitTime=Mon 16:42:49 EligibleTime=Mon 16:42:49 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=xeon24 AllocNode:Sid=sylg:29844 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=40 NumCPUs=960 NumTasks=960 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=960,node=1
Here's a current job status with an empty list of UnavailableNodes: # scontrol show job 448648 JobId=448648 JobName=BaBr2-MoS2/nm/ UserId=tdeilm(221341) GroupId=camdvip(1250) MCS_label=N/A Priority=191467 Nice=8533 Account=camdvip QOS=normal JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes: Dependency=(null) ...
The queue behavior is really odd! We have no drained/offline nodes sicne several days, yet jobs got blocked as described. This morning the blocked jobs became unblocked for unknown reasons (no configuration changes at all) and look normal: # squeue -p xeon24 JOBID PARTITION NAME USER ACCOUNT STATE PRIORITY TIME TIME_LIMI NODES CPUS NODELIST(REASON) 454921 xeon24 sbatch tdeilm camdvip PENDING 219983 0:00 2-02:00:00 4 96 (Resources) 453159 xeon24 Tl2Te2-G tdeilm camdvip PENDING 219767 0:00 2-00:00:00 21 504 (Priority) 453158 xeon24 SrI2-MoS tdeilm camdvip PENDING 219561 0:00 2-00:00:00 21 504 (Priority) 453157 xeon24 SrI2-CdI tdeilm camdvip PENDING 219109 0:00 2-00:00:00 21 504 (Priority) ... As a test, I drained a node a137 in a different partition, and after some minutes this caused the jobs to become blocked again: JOBID PARTITION NAME USER ACCOUNT STATE PRIORITY TIME TIME_LIMI NODES CPUS NODELIST(REASON) 454921 xeon24 sbatch tdeilm camdvip PENDING 220087 0:00 2-02:00:00 4 96 (ReqNodeNotAvail, UnavailableNodes:a137) 453159 xeon24 Tl2Te2-G tdeilm camdvip PENDING 219871 0:00 2-00:00:00 21 504 (ReqNodeNotAvail, UnavailableNodes:a137) 453158 xeon24 SrI2-MoS tdeilm camdvip PENDING 219666 0:00 2-00:00:00 21 504 (ReqNodeNotAvail, UnavailableNodes:a137) 453157 xeon24 SrI2-CdI tdeilm camdvip PENDING 219213 0:00 2-00:00:00 21 504 (ReqNodeNotAvail, UnavailableNodes:a137) Then I resume the node a137, and the UnavailableNodes list becomes empty: JOBID PARTITION NAME USER ACCOUNT STATE PRIORITY TIME TIME_LIMI NODES CPUS NODELIST(REASON) 454921 xeon24 sbatch tdeilm camdvip PENDING 220108 0:00 2-02:00:00 4 96 (ReqNodeNotAvail, UnavailableNodes:) 453159 xeon24 Tl2Te2-G tdeilm camdvip PENDING 219892 0:00 2-00:00:00 21 504 (ReqNodeNotAvail, UnavailableNodes:) 453158 xeon24 SrI2-MoS tdeilm camdvip PENDING 219686 0:00 2-00:00:00 21 504 (ReqNodeNotAvail, UnavailableNodes:) 453157 xeon24 SrI2-CdI tdeilm camdvip PENDING 219234 0:00 2-00:00:00 21 504 (ReqNodeNotAvail, UnavailableNodes:) Somehow the scheduler seems to block jobs for incorrect reasons, and then never unblock the jobs when the reason vanishes. This is very odd indeed!
Hi Have these jobs requested any features? What memory/MemoryPerCPU specification have these jobs? How many jobs wait in xeon24_512 partition? Could you send my specification of Nodes=x[001-192]? Dominik
(In reply to Dominik Bartkiewicz from comment #5) > Have these jobs requested any features? No. The jobs are submitted with these flags: --partition=xeon24 -n 504 --exclusive=user --time=48:00:00 --mem=0 > What memory/MemoryPerCPU specification have these jobs? --mem=0 > How many jobs wait in xeon24_512 partition? # squeue -p xeon24_512 JOBID PARTITION NAME USER ACCOUNT STATE PRIORITY TIME TIME_LIMI NODES CPUS NODELIST(REASON) 453271 xeon24_51 Cl_H2O_b kasv ecsvip PENDING 56963 0:00 1-00:00:00 4 96 (Resources) 455024 xeon24_51 d2 linjelv camdvip PENDING 55501 0:00 2-02:00:00 3 72 (Priority) 454975 xeon24_51 Cl_H2O_b kasv ecsvip PENDING 39465 0:00 1-00:00:00 4 96 (Priority) 455302 xeon24_51 clean_be kasv ecsvip PENDING 39041 0:00 6:00:00 4 96 (Priority) 455411 xeon24_51 double_t kasv ecsvip PENDING 38904 0:00 2-00:00:00 4 96 (Priority) 455412 xeon24_51 double_1 kasv ecsvip PENDING 38897 0:00 2-00:00:00 4 96 (Priority) 455415 xeon24_51 double_1 kasv ecsvip PENDING 38882 0:00 2-00:00:00 4 96 (Priority) > Could you send my specification of Nodes=x[001-192]? This data is already in slurm.conf, do you need further details? Briefly, these are 24-core Broadwell servers with 256 GB RAM (nodes x[169-180] have 512 GB). The nodes have an Intel Omni-Path 100 Gbit fabric. Thanks, Ole
Hi I can reproduce this and I will let you know when we fix this. Dominik
(In reply to Dominik Bartkiewicz from comment #7) > I can reproduce this and I will let you know when we fix this. Thanks Dominik, that was really fast! If you can suggest any workaround for the jobs or slurm.conf to avoid the issue, this would be really helpful to us! /Ole
(In reply to Dominik Bartkiewicz from comment #7) > I can reproduce this and I will let you know when we fix this. Could you kindly update the bug Status: UNCONFIRMED appropriately? Maybe you could update the old bug 3973 as well? Thanks, Ole
Hi The cause of the problem is that slurmctld can set wrong reason for jobs with "--exclusive=user". What I understand this state isn’t permanent and doesn’t disturb scheduling. Dominim
Hi This two commits should solve this issue https://github.com/SchedMD/slurm/commit/e2a14b8d7f4f https://github.com/SchedMD/slurm/commit/fc4e5ac9e056 These commits will be included in 17.11.6 and above. I am closing this ticket as resolved. Dominik
*** Ticket 5198 has been marked as a duplicate of this ticket. ***