Ticket 4932

Summary:	JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes
Product:	Slurm	Reporter:	Ole.H.Nielsen <Ole.H.Nielsen>
Component:	Scheduling	Assignee:	Dominik Bartkiewicz <bart>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	sergey_meirovich
Version:	17.02.9
Hardware:	Linux
OS:	Linux
Site:	DTU Physics	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	17.11.6
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf

Description Ole.H.Nielsen@fysik.dtu.dk 2018-03-15 02:15:02 MDT

We have a number of jobs in the queue for a partition "xeon24" which seem to be blocked by some drained (defective) nodes in a different partition "xeon8".  For example this job:

# scontrol show job 448648
JobId=448648 JobName=BaBr2-MoS2/nm/
   UserId=tdeilm(221341) GroupId=camdvip(1250) MCS_label=N/A
   Priority=73542 Nice=-53967 Account=camdvip QOS=normal
   JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes:a137,d[031-032] Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=Mon 16:42:49 EligibleTime=Mon 16:42:49
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=xeon24 AllocNode:Sid=sylg:29844
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   ...

This is really strange, since the job would never be scheduled on the unavailable nodes.  Either this seems to be a bug, or the Reason=ReqNodeNotAvail,_UnavailableNodes information is incorrect.  This seems to be the same issue as in bug 3058, which apparently was never resolved, but now the issue is apparently coming up again.

Comment 1 Ole.H.Nielsen@fysik.dtu.dk 2018-03-16 05:17:19 MDT

Created attachment 6406 [details]
slurm.conf

Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2018-03-16 05:21:19 MDT

An observation: As some nodes are or resumed, the UnavailableNodes list changes accordingly.  Even if the drained nodes list is empty, the UnavailableNodes is still present, but pointing to an empty node list.

At this time 2 nodes in a partition xeon16 (which the job could never get scheduled to) seem (incorrectly) to block the job:

# scontrol show job 448648
JobId=448648 JobName=BaBr2-MoS2/nm/
   UserId=tdeilm(221341) GroupId=camdvip(1250) MCS_label=N/A
   Priority=191467 Nice=8533 Account=camdvip QOS=normal
   JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes:g[018,026] Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=Mon 16:42:49 EligibleTime=Mon 16:42:49
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=xeon24 AllocNode:Sid=sylg:29844
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=40 NumCPUs=960 NumTasks=960 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=960,node=1

Comment 3 Ole.H.Nielsen@fysik.dtu.dk 2018-03-16 06:48:19 MDT

Here's a current job status with an empty list of UnavailableNodes:

# scontrol show job 448648
JobId=448648 JobName=BaBr2-MoS2/nm/
   UserId=tdeilm(221341) GroupId=camdvip(1250) MCS_label=N/A
   Priority=191467 Nice=8533 Account=camdvip QOS=normal
   JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes: Dependency=(null)
   ...

Comment 4 Ole.H.Nielsen@fysik.dtu.dk 2018-03-19 02:51:34 MDT

The queue behavior is really odd!  We have no drained/offline nodes sicne several days, yet jobs got blocked as described.

This morning the blocked jobs became unblocked for unknown reasons (no configuration changes at all) and look normal:

# squeue -p xeon24
            JOBID PARTITION     NAME     USER  ACCOUNT      STATE PRIORITY       TIME TIME_LIMI  NODES   CPUS NODELIST(REASON)
            454921    xeon24   sbatch   tdeilm  camdvip    PENDING 219983       0:00 2-02:00:00      4     96 (Resources)
            453159    xeon24 Tl2Te2-G   tdeilm  camdvip    PENDING 219767       0:00 2-00:00:00     21    504 (Priority)
            453158    xeon24 SrI2-MoS   tdeilm  camdvip    PENDING 219561       0:00 2-00:00:00     21    504 (Priority)
            453157    xeon24 SrI2-CdI   tdeilm  camdvip    PENDING 219109       0:00 2-00:00:00     21    504 (Priority) 
...

As a test, I drained a node a137 in a different partition, and after some minutes this caused the jobs to become blocked again:

            JOBID PARTITION     NAME     USER  ACCOUNT      STATE  PRIORITY       TIME TIME_LIMI  NODES   CPUS NODELIST(REASON)
            454921    xeon24   sbatch   tdeilm  camdvip    PENDING    220087       0:00 2-02:00:00      4     96 (ReqNodeNotAvail, UnavailableNodes:a137)
            453159    xeon24 Tl2Te2-G   tdeilm  camdvip    PENDING    219871       0:00 2-00:00:00     21    504 (ReqNodeNotAvail, UnavailableNodes:a137)
            453158    xeon24 SrI2-MoS   tdeilm  camdvip    PENDING    219666       0:00 2-00:00:00     21    504 (ReqNodeNotAvail, UnavailableNodes:a137)
            453157    xeon24 SrI2-CdI   tdeilm  camdvip    PENDING    219213       0:00 2-00:00:00     21    504 (ReqNodeNotAvail, UnavailableNodes:a137)

Then I resume the node a137, and the UnavailableNodes list becomes empty:

             JOBID PARTITION     NAME     USER  ACCOUNT      STATE  PRIORITY       TIME TIME_LIMI  NODES   CPUS NODELIST(REASON)
            454921    xeon24   sbatch   tdeilm  camdvip    PENDING    220108       0:00 2-02:00:00      4     96 (ReqNodeNotAvail, UnavailableNodes:)
            453159    xeon24 Tl2Te2-G   tdeilm  camdvip    PENDING    219892       0:00 2-00:00:00     21    504 (ReqNodeNotAvail, UnavailableNodes:)
            453158    xeon24 SrI2-MoS   tdeilm  camdvip    PENDING    219686       0:00 2-00:00:00     21    504 (ReqNodeNotAvail, UnavailableNodes:)
            453157    xeon24 SrI2-CdI   tdeilm  camdvip    PENDING    219234       0:00 2-00:00:00     21    504 (ReqNodeNotAvail, UnavailableNodes:)

Somehow the scheduler seems to block jobs for incorrect reasons, and then never unblock the jobs when the reason vanishes.  This is very odd indeed!

Comment 5 Dominik Bartkiewicz 2018-03-19 06:18:55 MDT

Hi

Have these jobs requested any features?
What memory/MemoryPerCPU specification have these jobs?
How many jobs wait in xeon24_512 partition?
Could you send my specification of Nodes=x[001-192]?

Dominik

Comment 6 Ole.H.Nielsen@fysik.dtu.dk 2018-03-19 07:11:28 MDT

(In reply to Dominik Bartkiewicz from comment #5)
> Have these jobs requested any features?

No.  The jobs are submitted with these flags:
--partition=xeon24 -n 504 --exclusive=user --time=48:00:00 --mem=0 

> What memory/MemoryPerCPU specification have these jobs?

--mem=0 

> How many jobs wait in xeon24_512 partition?

# squeue  -p xeon24_512
             JOBID PARTITION     NAME     USER  ACCOUNT      STATE  PRIORITY       TIME TIME_LIMI  NODES   CPUS NODELIST(REASON)
            453271 xeon24_51 Cl_H2O_b     kasv   ecsvip    PENDING     56963       0:00 1-00:00:00      4     96 (Resources)
            455024 xeon24_51       d2  linjelv  camdvip    PENDING     55501       0:00 2-02:00:00      3     72 (Priority)
            454975 xeon24_51 Cl_H2O_b     kasv   ecsvip    PENDING     39465       0:00 1-00:00:00      4     96 (Priority)
            455302 xeon24_51 clean_be     kasv   ecsvip    PENDING     39041       0:00   6:00:00      4     96 (Priority)
            455411 xeon24_51 double_t     kasv   ecsvip    PENDING     38904       0:00 2-00:00:00      4     96 (Priority)
            455412 xeon24_51 double_1     kasv   ecsvip    PENDING     38897       0:00 2-00:00:00      4     96 (Priority)
            455415 xeon24_51 double_1     kasv   ecsvip    PENDING     38882       0:00 2-00:00:00      4     96 (Priority)

> Could you send my specification of Nodes=x[001-192]?

This data is already in slurm.conf, do you need further details?  Briefly, these are 24-core Broadwell servers with 256 GB RAM (nodes x[169-180] have 512 GB).  The nodes have an Intel Omni-Path 100 Gbit fabric.

Thanks,
Ole

Comment 7 Dominik Bartkiewicz 2018-03-19 08:01:22 MDT

Hi

I can reproduce this and I will let you know when we fix this.

Dominik

Comment 8 Ole.H.Nielsen@fysik.dtu.dk 2018-03-19 08:10:30 MDT

(In reply to Dominik Bartkiewicz from comment #7)
> I can reproduce this and I will let you know when we fix this.

Thanks Dominik, that was really fast!  If you can suggest any workaround for the jobs or slurm.conf to avoid the issue, this would be really helpful to us!

/Ole

Comment 9 Ole.H.Nielsen@fysik.dtu.dk 2018-03-19 08:13:08 MDT

(In reply to Dominik Bartkiewicz from comment #7)
> I can reproduce this and I will let you know when we fix this.

Could you kindly update the bug Status: UNCONFIRMED appropriately?  
Maybe you could update the old bug 3973 as well?

Thanks,
Ole

Comment 10 Dominik Bartkiewicz 2018-03-19 09:24:51 MDT

Hi

The cause of the problem is that slurmctld can set wrong reason for jobs with "--exclusive=user". What I understand this state isn’t permanent and doesn’t disturb scheduling.

Dominim

Comment 25 Dominik Bartkiewicz 2018-04-17 05:00:09 MDT

Hi

This two commits should solve this issue
https://github.com/SchedMD/slurm/commit/e2a14b8d7f4f
https://github.com/SchedMD/slurm/commit/fc4e5ac9e056
These commits will be included in 17.11.6 and above.
I am closing this ticket as resolved.

Dominik

Comment 26 Sergey Meirovich 2018-08-22 13:49:10 MDT

*** Ticket 5198 has been marked as a duplicate of this ticket. ***