Ticket 3990

Summary: draining of two nodes outside of some partition prevents submission to that partition
Product: Slurm Reporter: Sergey Meirovich <sergey_meirovich>
Component: SchedulingAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED INVALID QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
Site: AMAT Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf
slurmctld log
slurmctld log requeted by Dominik Bartkiewicz in comment #9

Description Sergey Meirovich 2017-07-12 16:52:30 MDT

    
Comment 1 Sergey Meirovich 2017-07-12 16:53:34 MDT
Sorry,
Accidentally sent case too early.

Here is a problem

-bash-4.1$ scontrol show partition=low
PartitionName=low
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=dcalph[001-036,045-050,067-090]
   Priority=2 RootOnly=NO ReqResv=NO Shared=FORCE:1 PreemptMode=SUSPEND
   State=UP TotalCPUs=2056 TotalNodes=66 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

-bash-4.1$ sinfo | grep drain
-bash-4.1$  scontrol update NodeName=dcalph116 reason="fan replacement" state=drain
-bash-4.1$  scontrol update NodeName=dcalph120 reason="fan replacement" state=drain
-bash-4.1$ sbatch -p low -n 36 --wrap="srun hostname"
Submitted batch job 296564
-bash-4.1$ squeue -j 296564
JOBID  USER     ST PARTITION    NAME      COMMAND   SUBMIT_TIME  CPUS NODES NODELIST(REASON)                                  
296564 e154466  PD low          wrap      (null)    Jul 12 15:50 36   1     (ReqNodeNotAvail, UnavailableNodes:dcalph[116,120]
-bash-4.1$
Comment 2 Sergey Meirovich 2017-07-12 16:54:37 MDT
I understand that the version is unsupported but that behaviour might appear after patch for https://bugs.schedmd.com/show_bug.cgi?id=3824
Comment 3 Sergey Meirovich 2017-07-12 16:58:05 MDT
Created attachment 4908 [details]
slurm.conf
Comment 4 Sergey Meirovich 2017-07-12 16:59:37 MDT
Created attachment 4909 [details]
slurmctld log
Comment 5 Tim Wickberg 2017-07-12 17:08:28 MDT
I assume you're running a patched version of 15.08.12?
Comment 6 Sergey Meirovich 2017-07-12 17:13:49 MDT
Yes. But nothing is touching schedulling I think.
Comment 7 Dominik Bartkiewicz 2017-07-13 03:29:06 MDT
Hi

Are you sure that 296564 should start immediately after submitting?
Could you send me full output from squeue and sinfo?

Dominik
Comment 8 Sergey Meirovich 2017-07-13 11:41:27 MDT
Hi,

Yes. I am sure it should start immediatly. Reproduces the issue one more time:

==============================================================================

-bash-4.1$ scontrol update NodeName=dcalph116 reason="fan replacement" state=drain
-bash-4.1$ scontrol update NodeName=dcalph120 reason="fan replacement" state=drain
-bash-4.1$ sbatch -p low -n 36 --wrap="srun hostname"
Submitted batch job 296759
-bash-4.1$ squeue -j 296759
JOBID  USER     ST PARTITION    NAME      COMMAND   SUBMIT_TIME  CPUS NODES NODELIST(REASON)                                  
296759 e154466  PD low          wrap      (null)    Jul 13 10:39 36   1     (ReqNodeNotAvail, UnavailableNodes:dcalph[116,120]
-bash-4.1$ sinfo
HOSTNAMES      STATE AVAIL FREE_MEM  MEMORY CPUS CPU_LOAD CPUS(A/I/O/T)
dcalph116   draining    up   240709  258219   44    22.00 22/0/22/44
dcalph120   draining    up   247367  258219   44    18.00 18/0/26/44
dcalph036      mixed    up    63165  128950   36    19.06 35/1/0/36
dcalph038      mixed    up   103010  128950   36    29.97 30/6/0/36
dcalph041      mixed    up    63722  128950   36    23.62 24/12/0/36
dcalph053      mixed    up   100508  128950   36    30.00 30/6/0/36
dcalph056      mixed    up    91362  128950   36    30.00 30/6/0/36
dcalph060      mixed    up    90157  128950   36    12.00 12/24/0/36
dcalph065      mixed    up    71400  258222   36     3.28 33/3/0/36
dcalph074      mixed    up   238021  258222   36    16.85 28/8/0/36
dcalph075      mixed    up     1552   64386   16     2.08 15/1/0/16
dcalph107      mixed    up    38320  258219   44    17.00 17/27/0/44
dcalph108      mixed    up    90785  258219   44    34.00 34/10/0/44
dcalph109      mixed    up    20045  258219   44    11.03 11/33/0/44
dcalph110      mixed    up    25382  258219   44    33.00 33/11/0/44
dcalph111      mixed    up    61267  258219   44    11.01 11/33/0/44
dcalph112      mixed    up    28699  258219   44    20.00 20/24/0/44
dcalph113      mixed    up   157114  258219   44    27.00 27/17/0/44
dcalph114      mixed    up    17801  258219   44    26.39 22/22/0/44
dcalph115      mixed    up   108144  258219   44    36.00 36/8/0/44
dcalph118      mixed    up   236600  258219   44    41.00 41/3/0/44
dcalph119      mixed    up   143639  258219   44    28.00 28/16/0/44
dcalph001  allocated    up    40950  128950   36    36.00 36/0/0/36
dcalph002  allocated    up    22417  128950   36    36.02 36/0/0/36
dcalph003  allocated    up     1648  128950   36    36.17 36/0/0/36
dcalph004  allocated    up    13083  128950   36    36.01 36/0/0/36
dcalph005  allocated    up    16612  128950   36    35.98 36/0/0/36
dcalph006  allocated    up     9587  128950   36    36.00 36/0/0/36
dcalph007  allocated    up    10562  128950   36    35.98 36/0/0/36
dcalph008  allocated    up    32058  128950   36    36.00 36/0/0/36
dcalph009  allocated    up    60309  258222   36    36.00 36/0/0/36
dcalph010  allocated    up    88959  258222   36    32.92 36/0/0/36
dcalph011  allocated    up    78177  258222   36     1.54 36/0/0/36
dcalph012  allocated    up   171519  258222   36    34.02 36/0/0/36
dcalph013  allocated    up    51616  128950   36    26.99 36/0/0/36
dcalph014  allocated    up    40634  128950   36    36.00 36/0/0/36
dcalph015  allocated    up    26290  128950   36    36.18 36/0/0/36
dcalph017  allocated    up    46985  128950   36     0.53 36/0/0/36
dcalph018  allocated    up    47868  128950   36    36.00 36/0/0/36
dcalph019  allocated    up    48807  128950   36    36.15 36/0/0/36
dcalph020  allocated    up     2159  128950   36    36.00 36/0/0/36
dcalph021  allocated    up    14916  128950   36    36.00 36/0/0/36
dcalph022  allocated    up    69988  128950   36    36.00 36/0/0/36
dcalph023  allocated    up    39179  128950   36    36.18 36/0/0/36
dcalph024  allocated    up     1326  128950   36    36.01 36/0/0/36
dcalph025  allocated    up     1396  128950   36    36.00 36/0/0/36
dcalph026  allocated    up    19818  128950   36    36.01 36/0/0/36
dcalph027  allocated    up    49040  128950   36    36.00 36/0/0/36
dcalph028  allocated    up    14279  128950   36    29.00 36/0/0/36
dcalph029  allocated    up    99113  128950   36    36.00 36/0/0/36
dcalph030  allocated    up     8574  128950   36    36.02 36/0/0/36
dcalph031  allocated    up    71386  128950   36    35.00 36/0/0/36
dcalph032  allocated    up    19968  128950   36    36.00 36/0/0/36
dcalph033  allocated    up    20410  128950   36    36.16 36/0/0/36
dcalph034  allocated    up    37547  128950   36    36.00 36/0/0/36
dcalph037  allocated    up    17024  128950   36    36.00 36/0/0/36
dcalph039  allocated    up    22707  128950   36    36.00 36/0/0/36
dcalph040  allocated    up    31548  128950   36    36.06 36/0/0/36
dcalph042  allocated    up    66197  128950   36    36.01 36/0/0/36
dcalph043  allocated    up    12817  128950   36    36.00 36/0/0/36
dcalph044  allocated    up   100179  128950   36    36.00 36/0/0/36
dcalph045  allocated    up    38445  128950   36    15.00 36/0/0/36
dcalph046  allocated    up    42078  128950   36    35.97 36/0/0/36
dcalph047  allocated    up    49033  128950   36    36.01 36/0/0/36
dcalph048  allocated    up    33439  128950   36    35.98 36/0/0/36
dcalph049  allocated    up    34099  128950   36    36.00 36/0/0/36
dcalph051  allocated    up     1018  128950   36    36.00 36/0/0/36
dcalph052  allocated    up    74521  128950   36    36.01 36/0/0/36
dcalph054  allocated    up     8713  128950   36    36.00 36/0/0/36
dcalph055  allocated    up    95274  128950   36    36.04 36/0/0/36
dcalph057  allocated    up    90972  128950   36    35.96 36/0/0/36
dcalph058  allocated    up    84192  128950   36    36.04 36/0/0/36
dcalph059  allocated    up    76218  128950   36    36.00 36/0/0/36
dcalph061  allocated    up    82655  258222   36    26.77 36/0/0/36
dcalph062  allocated    up   182135  258222   36    32.01 36/0/0/36
dcalph066  allocated    up    99176  258222   36    35.99 36/0/0/36
dcalph067  allocated    up     5715  258222   36    33.12 36/0/0/36
dcalph068  allocated    up   230290  258222   36    36.24 36/0/0/36
dcalph069  allocated    up   217337  258222   36    33.06 36/0/0/36
dcalph070  allocated    up   216534  258222   36    36.00 36/0/0/36
dcalph071  allocated    up   171538  258222   36    36.00 36/0/0/36
dcalph072  allocated    up   112350  258222   36    36.00 36/0/0/36
dcalph076  allocated    up    48422   64386   16    16.00 16/0/0/16
dcalph077  allocated    up    47656   64386   16    13.00 16/0/0/16
dcalph078  allocated    up    24149   64386   16    16.00 16/0/0/16
dcalph079  allocated    up    39163   64386   16    16.00 16/0/0/16
dcalph080  allocated    up    52725   64386   16    16.01 16/0/0/16
dcalph081  allocated    up    26973   64386   16    16.00 16/0/0/16
dcalph082  allocated    up    38953   64386   16    16.00 16/0/0/16
dcalph083  allocated    up    29874   64386   16    16.00 16/0/0/16
dcalph084  allocated    up    28241   64386   16     5.01 16/0/0/16
dcalph085  allocated    up    51598   64386   16    15.00 16/0/0/16
dcalph086  allocated    up    15353   64386   16     3.35 16/0/0/16
dcalph087  allocated    up    36209   64386   16    15.98 16/0/0/16
dcalph088  allocated    up    17896   64386   16     1.22 16/0/0/16
dcalph089  allocated    up    17428   64386   16     2.15 16/0/0/16
dcalph090  allocated    up    43410   64386   16    11.98 16/0/0/16
dcalph117  allocated    up   215827  258219   44    38.48 44/0/0/44
dcalph016       idle    up    45296  128950   36     0.02 36/0/0/36
dcalph035       idle    up    66938  128950   36     0.05 36/0/0/36
dcalph050       idle    up    34762  128947   36    12.93 36/0/0/36
dcalph063       idle    up   226246  258222   36     0.11 36/0/0/36
dcalph064       idle    up   121211  258222   36     0.04 36/0/0/36
dcalph073       idle    up    87857  258222   36     0.00 36/0/0/36
dcalph091       idle    up    36245   64386   16     0.01 0/16/0/16
dcalph092       idle    up    12428   64386   16     0.00 0/16/0/16
dcalph093       idle    up    14087   64386   16     0.02 0/16/0/16
dcalph094       idle    up     7856   64386   16     0.00 0/16/0/16
dcalph095       idle    up     4322   64386   16     0.00 0/16/0/16
dcalph096       idle    up    12875   64386   16     0.03 0/16/0/16
dcalph097       idle    up    39054   64386   16     0.00 0/16/0/16
dcalph098       idle    up    57514   64386   16     0.00 0/16/0/16
dcalph099       idle    up    16953   64386   16     0.00 0/16/0/16
dcalph100       idle    up    14369   64386   16     0.00 0/16/0/16
dcalph101       idle    up    14639   64386   16     0.00 0/16/0/16
dcalph102       idle    up    28729   64386   16     0.00 0/16/0/16
dcalph103       idle    up    10854   64386   16     0.04 0/16/0/16
dcalph104       idle    up    13521   64386   16     0.00 0/16/0/16
dcalph105       idle    up    27261   64386   16     0.00 0/16/0/16
dcalph106       idle    up    13924   64386   16     0.00 0/16/0/16
-bash-4.1$ squeue 
JOBID  USER     ST PARTITION    NAME      COMMAND   SUBMIT_TIME  CPUS NODES NODELIST(REASON)                                  
294560 x068334  R  normal       CRTRS 2016./crtrs.BpJul  3  9:38 8    1     dcalph010                                         
294563 x068334  R  normal       CRTRS 2016./crtrs.S3Jul  3  9:55 10   1     dcalph012                                         
294556 x068334  R  normal       CRTRS 2016./crtrs.lfJul  3  9:20 10   1     dcalph036                                         
294561 x068334  R  normal       CRTRS 2016./crtrs.FUJul  3  9:40 8    1     dcalph045                                         
294555 x068334  R  normal       CRTRS 2016./crtrs.0tJul  3  9:17 10   1     dcalph067                                         
210549 x068334  R  interact     VNC       /user/x068Mar 10  8:51 1    1     dcalph075                                         
295968 e116933  R  license      cfdace 14./tmp/tmp.CJul  8 19:06 8    1     dcalph010                                         
295965 e116933  R  license      cfdace 14./tmp/tmp.rJul  8 17:31 8    1     dcalph010                                         
296754 e116933  R  license      cfdace 14./tmp/tmp.EJul 13 10:26 8    1     dcalph045                                         
295929 e116933  R  license      cfdace 14./tmp/tmp.fJul  7 22:53 8    1     dcalph067                                         
295835 e116933  R  license      cfdace 14./tmp/tmp.WJul  7 12:36 8    3     dcalph[067-069]                                   
296710 e116533  PD normal       Mechanical/dat/usr/eJul 13  9:38 12   1     (AssocGrpCpuLimit)                                
296729 e116533  PD normal       Mechanical/dat/usr/eJul 13  9:49 12   1     (AssocGrpCpuLimit)                                
296732 e116533  PD normal       Mechanical/dat/usr/eJul 13  9:52 12   1     (AssocGrpCpuLimit)                                
296738 e116533  PD normal       Mechanical/dat/usr/eJul 13 10:03 12   1     (AssocGrpCpuLimit)                                
296742 e116533  PD normal       Mechanical/dat/usr/eJul 13 10:07 12   1     (AssocGrpCpuLimit)                                
296747 e116533  PD normal       Mechanical/dat/usr/eJul 13 10:18 12   1     (AssocGrpCpuLimit)                                
296751 e116533  PD normal       Mechanical/dat/usr/eJul 13 10:22 12   1     (AssocGrpCpuLimit)                                
296695 e116533  R  normal       Mechanical/dat/usr/eJul 13  9:20 12   1     dcalph036                                         
296698 e116533  R  normal       Mechanical/dat/usr/eJul 13  9:23 12   1     dcalph045                                         
296707 e116533  R  normal       Mechanical/dat/usr/eJul 13  9:34 12   1     dcalph074                                         
296590 e116533  R  license      cfdace 14./tmp/tmp.8Jul 12 17:20 12   1     dcalph087                                         
296686 e116533  R  normal       Mechanical/dat/usr/eJul 13  9:05 12   1     dcalph088                                         
296687 e116533  R  normal       Mechanical/dat/usr/eJul 13  9:09 12   1     dcalph089                                         
296591 e116533  R  license      cfdace 14./tmp/tmp.EJul 12 17:20 12   1     dcalph090                                         
287676 e119858  R  interact     VNC       /user/e119Jun 14 10:56 1    1     dcalph075                                         
287662 e119858  R  interact     VNC       /user/e119Jun 14 10:48 1    1     dcalph075                                         
295026 x041729  R  license      fluent    /tmp/tmp.4Jul  5  0:19 32   2     dcalph[012-013]                                   
296078 x041729  R  license      fluent    /tmp/tmp.sJul 10  4:45 32   2     dcalph[067-068]                                   
296079 x041729  R  license      fluent    /tmp/tmp.tJul 10  4:46 32   2     dcalph[068-069]                                   
296220 x052225  R  open         VG_KALE   /tmp/tmp.TJul 11  4:49 6    1     dcalph038                                         
281533 e143971  R  interact     VNC       /user/e143May 24  1:30 1    1     dcalph075                                         
296076 x070437  PD normal       Ge_30_14C /dat/usr/xJul 10  1:49 15   1     (AssocGrpCpuLimit)                                
296074 x070437  R  normal       Ge_17_14C /dat/usr/xJul 10  1:49 15   1     dcalph028                                         
296438 x070437  R  normal       VG_split7_/tmp/tmp.dJul 12  7:59 6    1     dcalph036                                         
295760 x070437  R  normal       VG_split6_/tmp/tmp.jJul  7  4:55 6    1     dcalph045                                         
296693 x070437  R  normal       Ge_12_14C /dat/usr/xJul 13  9:17 15   2     dcalph[083-084]                                   
296075 x070437  R  normal       Ge_22_14C /dat/usr/xJul 10  1:49 15   1     dcalph085                                         
296085 x072161  R  license      cfdace 14./tmp/tmp.hJul 10  8:58 8    1     dcalph069                                         
292932 x072161  R  interact     VNC       /dat/usr/xJun 26 22:24 1    1     dcalph075                                         
296415 x072161  R  license      cfdace 14./tmp/tmp.qJul 12  3:10 4    1     dcalph087                                         
296200 e120711  R  license      cfdace 14./tmp/tmp.cJul 11  1:15 8    1     dcalph010                                         
296656 e120711  R  license      cfdace 14./tmp/tmp.wJul 13  2:52 6    1     dcalph013                                         
296272 e120711  R  license      cfdace 14./tmp/tmp.jJul 11  9:56 6    1     dcalph013                                         
296396 e120711  R  license      cfdace 14./tmp/tmp.BJul 11 22:22 6    1     dcalph028                                         
296274 e120711  R  license      cfdace 14./tmp/tmp.nJul 11  9:56 6    1     dcalph028                                         
289556 e111472  R  interact     VNC       /user/e111Jun 20 15:21 1    1     dcalph075                                         
240545 e121045  R  interact     VNC       /user/e121Mar 29 23:25 1    1     dcalph075                                         
296759 e154466  CF low          wrap      (null)    Jul 13 10:39 36   7     dcalph[036,084-086,088-090]                       
216154 e154466  R  interact     VNC       /user/e154Mar 15 10:06 1    1     dcalph075                                         
296118 e153547  R  normal       VASP 5.4.1/tmp/tmp.uJul 10 13:09 180  5     dcalph[001-003,029-030]                           
296135 e153547  R  normal       VASP 5.4.1/tmp/tmp.YJul 10 14:19 180  5     dcalph[004-007,014]                               
296115 e153547  R  normal       VASP 5.4.1/tmp/tmp.xJul 10 12:48 180  5     dcalph[018-022]                                   
296133 e153547  R  normal       VASP 5.4.1/tmp/tmp.aJul 10 14:03 180  5     dcalph[023-027]                                   
292019 e153547  R  normal       wrap      (null)    Jun 24  1:10 8    1     dcalph028                                         
296394 e153547  R  open         VASP 5.4.1/tmp/tmp.ZJul 11 22:19 36   1     dcalph052                                         
292119 e153547  R  open         wrap      (null)    Jun 24  1:21 8    1     dcalph084                                         
296402 e153547  R  open         VASP 5.4.1/tmp/tmp.EJul 11 23:16 36   1     dcalph115                                         
296734 e153732  PD normal       VASP 5.4.1/tmp/tmp.jJul 13  9:59 72   2     (QOSGrpCpuLimit)                                  
296404 e153732  R  normal       VASP 5.4.1/tmp/tmp.VJul 11 23:21 72   2     dcalph[008-009]                                   
296406 e153732  R  normal       VASP 5.4.1/tmp/tmp.cJul 11 23:43 108  3     dcalph[032-034]                                   
296654 e153732  R  open         VASP 5.4.1/tmp/tmp.lJul 13  2:21 72   2     dcalph[039-040]                                   
296650 e153732  R  open         VASP 5.4.1/tmp/tmp.SJul 13  1:36 72   2     dcalph[043-044]                                   
277363 e121419  R  open         VNC       /dat/usr/eMay  9  1:13 2    1     dcalph012                                         
296127 e121419  R  license      cfdace 15./tmp/tmp.yJul 10 13:37 1    1     dcalph013                                         
296409 e121419  R  license      cfdace 15./tmp/tmp.vJul 12  0:41 1    1     dcalph028                                         
296088 e121419  R  license      cfdace 15./tmp/tmp.TJul 10  9:06 1    1     dcalph049                                         
296090 e121419  R  license      cfdace 15./tmp/tmp.uJul 10  9:11 1    1     dcalph069                                         
296089 e121419  R  license      cfdace 15./tmp/tmp.9Jul 10  9:06 1    1     dcalph069                                         
295878 e154414  S  low          VASP 5.4.1/tmp/tmp.dJul  7 17:20 108  5     dcalph[001-002,017-018,022]                       
295889 e154414  S  low          VASP 5.4.1/tmp/tmp.MJul  7 17:30 108  3     dcalph[003,027,030]                               
294122 e154414  S  low          VASP 5.4.1/tmp/tmp.EJun 30  0:08 108  3     dcalph[004,031-032]                               
295881 e154414  S  low          VASP 5.4.1/tmp/tmp.wJul  7 17:21 108  3     dcalph[005-007]                                   
295882 e154414  S  low          VASP 5.4.1/tmp/tmp.JJul  7 17:22 108  3     dcalph[008-009,072]                               
295883 e154414  S  low          VASP 5.4.1/tmp/tmp.JJul  7 17:22 108  3     dcalph[011,071,073]                               
295893 e154414  S  low          VASP 5.4.1/tmp/tmp.MJul  7 17:31 108  3     dcalph[014,016,050]                               
295886 e154414  S  low          VASP 5.4.1/tmp/tmp.hJul  7 17:28 108  3     dcalph[020-021,023]                               
295888 e154414  S  low          VASP 5.4.1/tmp/tmp.2Jul  7 17:29 108  3     dcalph[024-025,033]                               
294123 e154414  S  low          VASP 5.4.1/tmp/tmp.sJun 30  0:21 108  3     dcalph[026,035,070]                               
295894 e154414  R  low          VASP 5.4.1/tmp/tmp.tJul  7 17:32 108  3     dcalph[046-048]                                   
295971 e154414  R  normal       VASP 5.4.1/tmp/tmp.xJul  8 20:06 108  3     dcalph[070-072]                                   
249905 e154414  R  interact     VNC       /dat/usr/eApr 10 15:07 1    1     dcalph075                                         
296757 e119661  R  license      cfdace 14./tmp/tmp.6Jul 13 10:37 1    1     dcalph017                                         
296758 e119661  R  license      cfdace 14./tmp/tmp.GJul 13 10:37 1    1     dcalph045                                         
209097 e119661  R  interact     VNC       /user/e119Mar  7 11:10 1    1     dcalph075                                         
288719 x045157  S  open         my_2D_P4_1/dat/usr/xJun 19  8:53 113  5     dcalph[061-065]                                   
295763 x045157  R  normal       my_2D_P4_1/dat/usr/xJul  7  6:53 113  8     dcalph[076-083]                                   
284045 e120594  R  license      cfdace 14./tmp/tmp.AMay 31 17:28 4    1     dcalph010                                         
290547 e120594  R  license      cfdace 14./tmp/tmp.XJun 22 15:59 8    3     dcalph[012-013,045]                               
296360 e120594  R  license      cfdace 14./tmp/tmp.wJul 11 14:58 8    1     dcalph013                                         
284602 e120594  R  license      cfdace 14./tmp/tmp.5Jun  2 14:04 4    1     dcalph036                                         
285760 e120594  R  license      cfdace 14./tmp/tmp.nJun  7 10:30 4    1     dcalph067                                         
284791 e120594  R  license      cfdace 14./tmp/tmp.hJun  2 18:14 4    1     dcalph069                                         
285766 e120594  R  license      cfdace 14./tmp/tmp.kJun  7 10:31 4    1     dcalph077                                         
296697 e156059  R  license      Mechanical/dat/usr/eJul 13  9:22 36   1     dcalph011                                         
210071 e156059  R  interact     VNC       /user/e156Mar  9 11:05 1    1     dcalph075                                         
296744 e156425  R  open         x892736566/dat/usr/eJul 13 10:07 8    1     dcalph041                                         
296743 e156425  R  open         x892736566/dat/usr/eJul 13 10:07 8    1     dcalph117                                         
294404 e119448  R  interact     VNC       /user/e119Jul  1 14:11 1    1     dcalph075                                         
294329 e119448  R  interact     VNC       /user/e119Jun 30 15:25 1    1     dcalph075                                         
296736 e157618  PD open         VASP 5.4.4/tmp/tmp.hJul 13 10:02 36   1     (Priority)                                        
296740 e157618  PD open         VASP 5.4.4/tmp/tmp.7Jul 13 10:05 36   1     (Priority)                                        
296746 e157618  PD open         VASP 5.4.4/tmp/tmp.bJul 13 10:10 36   1     (Priority)                                        
296752 e157618  PD open         VASP 5.4.4/tmp/tmp.uJul 13 10:22 36   1     (Priority)                                        
296709 e157618  PD normal       VASP 5.4.4/tmp/tmp.WJul 13  9:37 36   1     (QOSGrpCpuLimit)                                  
296700 e157618  R  normal       VASP 5.4.4/tmp/tmp.1Jul 13  9:30 36   1     dcalph015                                         
296706 e157618  R  normal       VNC       /user/e157Jul 13  9:33 1    1     dcalph017                                         
296131 e157618  R  normal       VASP 5.4.4/tmp/tmp.WJul 10 13:48 35   1     dcalph031                                         
296548 e157618  R  open         VASP 5.4.4/tmp/tmp.3Jul 12 15:35 36   1     dcalph037                                         
295954 e157618  R  open         VASP 5.4.4/tmp/tmp.GJul  8 16:00 35   1     dcalph049                                         
296688 e157618  R  open         VASP 5.4.4/tmp/tmp.QJul 13  9:09 36   1     dcalph054                                         
296724 e157618  R  open         VASP 5.4.4/tmp/tmp.JJul 13  9:46 36   1     dcalph055                                         
296730 e157618  R  open         VASP 5.4.4/tmp/tmp.WJul 13  9:50 36   1     dcalph057                                         
296731 e157618  R  open         VASP 5.4.4/tmp/tmp.TJul 13  9:52 36   1     dcalph059                                         
296722 e157618  R  open         VASP 5.4.4/tmp/tmp.pJul 13  9:45 36   1     dcalph066                                         
277024 e157653  R  interact     VNC       /user/e157May  7 19:54 1    1     dcalph075                                         
296719 x057829  R  open         g09       /tmp/tmp.kJul 13  9:43 8    1     dcalph041                                         
296673 x057829  R  open         g09       /tmp/tmp.nJul 13  8:10 8    1     dcalph074                                         
296670 x057829  R  open         g09       /tmp/tmp.0Jul 13  7:47 8    1     dcalph074                                         
296739 e158714  PD open         insertedsu/tmp/tmp.xJul 13 10:04 36   1     (Resources)                                       
296755 e158714  PD open         insertedsu/tmp/tmp.XJul 13 10:28 36   1     (Priority)                                        
296692 e158714  R  open         insertedsu/tmp/tmp.wJul 13  9:16 36   1     dcalph042                                         
296683 e158714  R  open         subs_x_sio/tmp/tmp.DJul 13  8:51 36   1     dcalph117                                         
209429 e116763  R  interact     VNC       /user/e116Mar  8 10:22 1    1     dcalph075                                         
-bash-4.1$
Comment 9 Dominik Bartkiewicz 2017-07-14 06:11:07 MDT
Hi,

on squeue output I notice that 296759 is in configuring state.
Could you also attach slurmctld.log?
Before version 16.05, assoc_limit_continue wasn’t default behavior therefore main scheduler doesn't start a job on nodes from higher prio part if any different job is pending on that higher prio partition.

Dominik
Comment 10 Sergey Meirovich 2017-07-14 11:37:45 MDT
Created attachment 4924 [details]
slurmctld log requeted by Dominik Bartkiewicz in comment #9

(In reply to Dominik Bartkiewicz from comment #9)
> Hi,
> 
> on squeue output I notice that 296759 is in configuring state.
> Could you also attach slurmctld.log?

Here is goes.

> Before version 16.05, assoc_limit_continue wasn’t default behavior therefore
> main scheduler doesn't start a job on nodes from higher prio part if any
> different job is pending on that higher prio partition.
> 
> Dominik
Comment 11 Dominik Bartkiewicz 2017-07-17 10:02:17 MDT
Hi

According to slurmctld.log this job was started 16 sec after submitted by Backfill. It looks like normal slurm behavior and not like something introduced by patch from
https://bugs.schedmd.com/show_bug.cgi?id=3824

Dominik
Comment 12 Sergey Meirovich 2017-07-17 17:21:17 MDT
Dominick,

You are correct. In fact I was trying to reproduce behavior for job # 296513 that indeed was in pending for at least several minutes.
Comment 13 Dominik Bartkiewicz 2017-07-18 04:43:39 MDT
Hi

I can't recreate this on my test system.
Without more info it will be difficult to find reason why 296513 was pending so long. If it happens again could you catch squeue output , slurmctld.log with debugflag SelectType , increased debug level will be also useful. 

scontrol setdebugflag +SelectType
scontrol setdebugflag -SelectType

Dominik
Comment 14 Dominik Bartkiewicz 2017-08-02 04:26:36 MDT
Hi

Any news?

Dominik
Comment 15 Sergey Meirovich 2017-08-02 17:11:45 MDT
I could not reproduce that bug. Not sure what at was. Anyway - closing it.

Thank you!