| Summary: | draining of two nodes outside of some partition prevents submission to that partition | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Sergey Meirovich <sergey_meirovich> |
| Component: | Scheduling | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | AMAT | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
slurmctld log slurmctld log requeted by Dominik Bartkiewicz in comment #9 |
||
|
Description
Sergey Meirovich
2017-07-12 16:52:30 MDT
Sorry, Accidentally sent case too early. Here is a problem -bash-4.1$ scontrol show partition=low PartitionName=low AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=dcalph[001-036,045-050,067-090] Priority=2 RootOnly=NO ReqResv=NO Shared=FORCE:1 PreemptMode=SUSPEND State=UP TotalCPUs=2056 TotalNodes=66 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED -bash-4.1$ sinfo | grep drain -bash-4.1$ scontrol update NodeName=dcalph116 reason="fan replacement" state=drain -bash-4.1$ scontrol update NodeName=dcalph120 reason="fan replacement" state=drain -bash-4.1$ sbatch -p low -n 36 --wrap="srun hostname" Submitted batch job 296564 -bash-4.1$ squeue -j 296564 JOBID USER ST PARTITION NAME COMMAND SUBMIT_TIME CPUS NODES NODELIST(REASON) 296564 e154466 PD low wrap (null) Jul 12 15:50 36 1 (ReqNodeNotAvail, UnavailableNodes:dcalph[116,120] -bash-4.1$ I understand that the version is unsupported but that behaviour might appear after patch for https://bugs.schedmd.com/show_bug.cgi?id=3824 Created attachment 4908 [details]
slurm.conf
Created attachment 4909 [details]
slurmctld log
I assume you're running a patched version of 15.08.12? Yes. But nothing is touching schedulling I think. Hi Are you sure that 296564 should start immediately after submitting? Could you send me full output from squeue and sinfo? Dominik Hi, Yes. I am sure it should start immediatly. Reproduces the issue one more time: ============================================================================== -bash-4.1$ scontrol update NodeName=dcalph116 reason="fan replacement" state=drain -bash-4.1$ scontrol update NodeName=dcalph120 reason="fan replacement" state=drain -bash-4.1$ sbatch -p low -n 36 --wrap="srun hostname" Submitted batch job 296759 -bash-4.1$ squeue -j 296759 JOBID USER ST PARTITION NAME COMMAND SUBMIT_TIME CPUS NODES NODELIST(REASON) 296759 e154466 PD low wrap (null) Jul 13 10:39 36 1 (ReqNodeNotAvail, UnavailableNodes:dcalph[116,120] -bash-4.1$ sinfo HOSTNAMES STATE AVAIL FREE_MEM MEMORY CPUS CPU_LOAD CPUS(A/I/O/T) dcalph116 draining up 240709 258219 44 22.00 22/0/22/44 dcalph120 draining up 247367 258219 44 18.00 18/0/26/44 dcalph036 mixed up 63165 128950 36 19.06 35/1/0/36 dcalph038 mixed up 103010 128950 36 29.97 30/6/0/36 dcalph041 mixed up 63722 128950 36 23.62 24/12/0/36 dcalph053 mixed up 100508 128950 36 30.00 30/6/0/36 dcalph056 mixed up 91362 128950 36 30.00 30/6/0/36 dcalph060 mixed up 90157 128950 36 12.00 12/24/0/36 dcalph065 mixed up 71400 258222 36 3.28 33/3/0/36 dcalph074 mixed up 238021 258222 36 16.85 28/8/0/36 dcalph075 mixed up 1552 64386 16 2.08 15/1/0/16 dcalph107 mixed up 38320 258219 44 17.00 17/27/0/44 dcalph108 mixed up 90785 258219 44 34.00 34/10/0/44 dcalph109 mixed up 20045 258219 44 11.03 11/33/0/44 dcalph110 mixed up 25382 258219 44 33.00 33/11/0/44 dcalph111 mixed up 61267 258219 44 11.01 11/33/0/44 dcalph112 mixed up 28699 258219 44 20.00 20/24/0/44 dcalph113 mixed up 157114 258219 44 27.00 27/17/0/44 dcalph114 mixed up 17801 258219 44 26.39 22/22/0/44 dcalph115 mixed up 108144 258219 44 36.00 36/8/0/44 dcalph118 mixed up 236600 258219 44 41.00 41/3/0/44 dcalph119 mixed up 143639 258219 44 28.00 28/16/0/44 dcalph001 allocated up 40950 128950 36 36.00 36/0/0/36 dcalph002 allocated up 22417 128950 36 36.02 36/0/0/36 dcalph003 allocated up 1648 128950 36 36.17 36/0/0/36 dcalph004 allocated up 13083 128950 36 36.01 36/0/0/36 dcalph005 allocated up 16612 128950 36 35.98 36/0/0/36 dcalph006 allocated up 9587 128950 36 36.00 36/0/0/36 dcalph007 allocated up 10562 128950 36 35.98 36/0/0/36 dcalph008 allocated up 32058 128950 36 36.00 36/0/0/36 dcalph009 allocated up 60309 258222 36 36.00 36/0/0/36 dcalph010 allocated up 88959 258222 36 32.92 36/0/0/36 dcalph011 allocated up 78177 258222 36 1.54 36/0/0/36 dcalph012 allocated up 171519 258222 36 34.02 36/0/0/36 dcalph013 allocated up 51616 128950 36 26.99 36/0/0/36 dcalph014 allocated up 40634 128950 36 36.00 36/0/0/36 dcalph015 allocated up 26290 128950 36 36.18 36/0/0/36 dcalph017 allocated up 46985 128950 36 0.53 36/0/0/36 dcalph018 allocated up 47868 128950 36 36.00 36/0/0/36 dcalph019 allocated up 48807 128950 36 36.15 36/0/0/36 dcalph020 allocated up 2159 128950 36 36.00 36/0/0/36 dcalph021 allocated up 14916 128950 36 36.00 36/0/0/36 dcalph022 allocated up 69988 128950 36 36.00 36/0/0/36 dcalph023 allocated up 39179 128950 36 36.18 36/0/0/36 dcalph024 allocated up 1326 128950 36 36.01 36/0/0/36 dcalph025 allocated up 1396 128950 36 36.00 36/0/0/36 dcalph026 allocated up 19818 128950 36 36.01 36/0/0/36 dcalph027 allocated up 49040 128950 36 36.00 36/0/0/36 dcalph028 allocated up 14279 128950 36 29.00 36/0/0/36 dcalph029 allocated up 99113 128950 36 36.00 36/0/0/36 dcalph030 allocated up 8574 128950 36 36.02 36/0/0/36 dcalph031 allocated up 71386 128950 36 35.00 36/0/0/36 dcalph032 allocated up 19968 128950 36 36.00 36/0/0/36 dcalph033 allocated up 20410 128950 36 36.16 36/0/0/36 dcalph034 allocated up 37547 128950 36 36.00 36/0/0/36 dcalph037 allocated up 17024 128950 36 36.00 36/0/0/36 dcalph039 allocated up 22707 128950 36 36.00 36/0/0/36 dcalph040 allocated up 31548 128950 36 36.06 36/0/0/36 dcalph042 allocated up 66197 128950 36 36.01 36/0/0/36 dcalph043 allocated up 12817 128950 36 36.00 36/0/0/36 dcalph044 allocated up 100179 128950 36 36.00 36/0/0/36 dcalph045 allocated up 38445 128950 36 15.00 36/0/0/36 dcalph046 allocated up 42078 128950 36 35.97 36/0/0/36 dcalph047 allocated up 49033 128950 36 36.01 36/0/0/36 dcalph048 allocated up 33439 128950 36 35.98 36/0/0/36 dcalph049 allocated up 34099 128950 36 36.00 36/0/0/36 dcalph051 allocated up 1018 128950 36 36.00 36/0/0/36 dcalph052 allocated up 74521 128950 36 36.01 36/0/0/36 dcalph054 allocated up 8713 128950 36 36.00 36/0/0/36 dcalph055 allocated up 95274 128950 36 36.04 36/0/0/36 dcalph057 allocated up 90972 128950 36 35.96 36/0/0/36 dcalph058 allocated up 84192 128950 36 36.04 36/0/0/36 dcalph059 allocated up 76218 128950 36 36.00 36/0/0/36 dcalph061 allocated up 82655 258222 36 26.77 36/0/0/36 dcalph062 allocated up 182135 258222 36 32.01 36/0/0/36 dcalph066 allocated up 99176 258222 36 35.99 36/0/0/36 dcalph067 allocated up 5715 258222 36 33.12 36/0/0/36 dcalph068 allocated up 230290 258222 36 36.24 36/0/0/36 dcalph069 allocated up 217337 258222 36 33.06 36/0/0/36 dcalph070 allocated up 216534 258222 36 36.00 36/0/0/36 dcalph071 allocated up 171538 258222 36 36.00 36/0/0/36 dcalph072 allocated up 112350 258222 36 36.00 36/0/0/36 dcalph076 allocated up 48422 64386 16 16.00 16/0/0/16 dcalph077 allocated up 47656 64386 16 13.00 16/0/0/16 dcalph078 allocated up 24149 64386 16 16.00 16/0/0/16 dcalph079 allocated up 39163 64386 16 16.00 16/0/0/16 dcalph080 allocated up 52725 64386 16 16.01 16/0/0/16 dcalph081 allocated up 26973 64386 16 16.00 16/0/0/16 dcalph082 allocated up 38953 64386 16 16.00 16/0/0/16 dcalph083 allocated up 29874 64386 16 16.00 16/0/0/16 dcalph084 allocated up 28241 64386 16 5.01 16/0/0/16 dcalph085 allocated up 51598 64386 16 15.00 16/0/0/16 dcalph086 allocated up 15353 64386 16 3.35 16/0/0/16 dcalph087 allocated up 36209 64386 16 15.98 16/0/0/16 dcalph088 allocated up 17896 64386 16 1.22 16/0/0/16 dcalph089 allocated up 17428 64386 16 2.15 16/0/0/16 dcalph090 allocated up 43410 64386 16 11.98 16/0/0/16 dcalph117 allocated up 215827 258219 44 38.48 44/0/0/44 dcalph016 idle up 45296 128950 36 0.02 36/0/0/36 dcalph035 idle up 66938 128950 36 0.05 36/0/0/36 dcalph050 idle up 34762 128947 36 12.93 36/0/0/36 dcalph063 idle up 226246 258222 36 0.11 36/0/0/36 dcalph064 idle up 121211 258222 36 0.04 36/0/0/36 dcalph073 idle up 87857 258222 36 0.00 36/0/0/36 dcalph091 idle up 36245 64386 16 0.01 0/16/0/16 dcalph092 idle up 12428 64386 16 0.00 0/16/0/16 dcalph093 idle up 14087 64386 16 0.02 0/16/0/16 dcalph094 idle up 7856 64386 16 0.00 0/16/0/16 dcalph095 idle up 4322 64386 16 0.00 0/16/0/16 dcalph096 idle up 12875 64386 16 0.03 0/16/0/16 dcalph097 idle up 39054 64386 16 0.00 0/16/0/16 dcalph098 idle up 57514 64386 16 0.00 0/16/0/16 dcalph099 idle up 16953 64386 16 0.00 0/16/0/16 dcalph100 idle up 14369 64386 16 0.00 0/16/0/16 dcalph101 idle up 14639 64386 16 0.00 0/16/0/16 dcalph102 idle up 28729 64386 16 0.00 0/16/0/16 dcalph103 idle up 10854 64386 16 0.04 0/16/0/16 dcalph104 idle up 13521 64386 16 0.00 0/16/0/16 dcalph105 idle up 27261 64386 16 0.00 0/16/0/16 dcalph106 idle up 13924 64386 16 0.00 0/16/0/16 -bash-4.1$ squeue JOBID USER ST PARTITION NAME COMMAND SUBMIT_TIME CPUS NODES NODELIST(REASON) 294560 x068334 R normal CRTRS 2016./crtrs.BpJul 3 9:38 8 1 dcalph010 294563 x068334 R normal CRTRS 2016./crtrs.S3Jul 3 9:55 10 1 dcalph012 294556 x068334 R normal CRTRS 2016./crtrs.lfJul 3 9:20 10 1 dcalph036 294561 x068334 R normal CRTRS 2016./crtrs.FUJul 3 9:40 8 1 dcalph045 294555 x068334 R normal CRTRS 2016./crtrs.0tJul 3 9:17 10 1 dcalph067 210549 x068334 R interact VNC /user/x068Mar 10 8:51 1 1 dcalph075 295968 e116933 R license cfdace 14./tmp/tmp.CJul 8 19:06 8 1 dcalph010 295965 e116933 R license cfdace 14./tmp/tmp.rJul 8 17:31 8 1 dcalph010 296754 e116933 R license cfdace 14./tmp/tmp.EJul 13 10:26 8 1 dcalph045 295929 e116933 R license cfdace 14./tmp/tmp.fJul 7 22:53 8 1 dcalph067 295835 e116933 R license cfdace 14./tmp/tmp.WJul 7 12:36 8 3 dcalph[067-069] 296710 e116533 PD normal Mechanical/dat/usr/eJul 13 9:38 12 1 (AssocGrpCpuLimit) 296729 e116533 PD normal Mechanical/dat/usr/eJul 13 9:49 12 1 (AssocGrpCpuLimit) 296732 e116533 PD normal Mechanical/dat/usr/eJul 13 9:52 12 1 (AssocGrpCpuLimit) 296738 e116533 PD normal Mechanical/dat/usr/eJul 13 10:03 12 1 (AssocGrpCpuLimit) 296742 e116533 PD normal Mechanical/dat/usr/eJul 13 10:07 12 1 (AssocGrpCpuLimit) 296747 e116533 PD normal Mechanical/dat/usr/eJul 13 10:18 12 1 (AssocGrpCpuLimit) 296751 e116533 PD normal Mechanical/dat/usr/eJul 13 10:22 12 1 (AssocGrpCpuLimit) 296695 e116533 R normal Mechanical/dat/usr/eJul 13 9:20 12 1 dcalph036 296698 e116533 R normal Mechanical/dat/usr/eJul 13 9:23 12 1 dcalph045 296707 e116533 R normal Mechanical/dat/usr/eJul 13 9:34 12 1 dcalph074 296590 e116533 R license cfdace 14./tmp/tmp.8Jul 12 17:20 12 1 dcalph087 296686 e116533 R normal Mechanical/dat/usr/eJul 13 9:05 12 1 dcalph088 296687 e116533 R normal Mechanical/dat/usr/eJul 13 9:09 12 1 dcalph089 296591 e116533 R license cfdace 14./tmp/tmp.EJul 12 17:20 12 1 dcalph090 287676 e119858 R interact VNC /user/e119Jun 14 10:56 1 1 dcalph075 287662 e119858 R interact VNC /user/e119Jun 14 10:48 1 1 dcalph075 295026 x041729 R license fluent /tmp/tmp.4Jul 5 0:19 32 2 dcalph[012-013] 296078 x041729 R license fluent /tmp/tmp.sJul 10 4:45 32 2 dcalph[067-068] 296079 x041729 R license fluent /tmp/tmp.tJul 10 4:46 32 2 dcalph[068-069] 296220 x052225 R open VG_KALE /tmp/tmp.TJul 11 4:49 6 1 dcalph038 281533 e143971 R interact VNC /user/e143May 24 1:30 1 1 dcalph075 296076 x070437 PD normal Ge_30_14C /dat/usr/xJul 10 1:49 15 1 (AssocGrpCpuLimit) 296074 x070437 R normal Ge_17_14C /dat/usr/xJul 10 1:49 15 1 dcalph028 296438 x070437 R normal VG_split7_/tmp/tmp.dJul 12 7:59 6 1 dcalph036 295760 x070437 R normal VG_split6_/tmp/tmp.jJul 7 4:55 6 1 dcalph045 296693 x070437 R normal Ge_12_14C /dat/usr/xJul 13 9:17 15 2 dcalph[083-084] 296075 x070437 R normal Ge_22_14C /dat/usr/xJul 10 1:49 15 1 dcalph085 296085 x072161 R license cfdace 14./tmp/tmp.hJul 10 8:58 8 1 dcalph069 292932 x072161 R interact VNC /dat/usr/xJun 26 22:24 1 1 dcalph075 296415 x072161 R license cfdace 14./tmp/tmp.qJul 12 3:10 4 1 dcalph087 296200 e120711 R license cfdace 14./tmp/tmp.cJul 11 1:15 8 1 dcalph010 296656 e120711 R license cfdace 14./tmp/tmp.wJul 13 2:52 6 1 dcalph013 296272 e120711 R license cfdace 14./tmp/tmp.jJul 11 9:56 6 1 dcalph013 296396 e120711 R license cfdace 14./tmp/tmp.BJul 11 22:22 6 1 dcalph028 296274 e120711 R license cfdace 14./tmp/tmp.nJul 11 9:56 6 1 dcalph028 289556 e111472 R interact VNC /user/e111Jun 20 15:21 1 1 dcalph075 240545 e121045 R interact VNC /user/e121Mar 29 23:25 1 1 dcalph075 296759 e154466 CF low wrap (null) Jul 13 10:39 36 7 dcalph[036,084-086,088-090] 216154 e154466 R interact VNC /user/e154Mar 15 10:06 1 1 dcalph075 296118 e153547 R normal VASP 5.4.1/tmp/tmp.uJul 10 13:09 180 5 dcalph[001-003,029-030] 296135 e153547 R normal VASP 5.4.1/tmp/tmp.YJul 10 14:19 180 5 dcalph[004-007,014] 296115 e153547 R normal VASP 5.4.1/tmp/tmp.xJul 10 12:48 180 5 dcalph[018-022] 296133 e153547 R normal VASP 5.4.1/tmp/tmp.aJul 10 14:03 180 5 dcalph[023-027] 292019 e153547 R normal wrap (null) Jun 24 1:10 8 1 dcalph028 296394 e153547 R open VASP 5.4.1/tmp/tmp.ZJul 11 22:19 36 1 dcalph052 292119 e153547 R open wrap (null) Jun 24 1:21 8 1 dcalph084 296402 e153547 R open VASP 5.4.1/tmp/tmp.EJul 11 23:16 36 1 dcalph115 296734 e153732 PD normal VASP 5.4.1/tmp/tmp.jJul 13 9:59 72 2 (QOSGrpCpuLimit) 296404 e153732 R normal VASP 5.4.1/tmp/tmp.VJul 11 23:21 72 2 dcalph[008-009] 296406 e153732 R normal VASP 5.4.1/tmp/tmp.cJul 11 23:43 108 3 dcalph[032-034] 296654 e153732 R open VASP 5.4.1/tmp/tmp.lJul 13 2:21 72 2 dcalph[039-040] 296650 e153732 R open VASP 5.4.1/tmp/tmp.SJul 13 1:36 72 2 dcalph[043-044] 277363 e121419 R open VNC /dat/usr/eMay 9 1:13 2 1 dcalph012 296127 e121419 R license cfdace 15./tmp/tmp.yJul 10 13:37 1 1 dcalph013 296409 e121419 R license cfdace 15./tmp/tmp.vJul 12 0:41 1 1 dcalph028 296088 e121419 R license cfdace 15./tmp/tmp.TJul 10 9:06 1 1 dcalph049 296090 e121419 R license cfdace 15./tmp/tmp.uJul 10 9:11 1 1 dcalph069 296089 e121419 R license cfdace 15./tmp/tmp.9Jul 10 9:06 1 1 dcalph069 295878 e154414 S low VASP 5.4.1/tmp/tmp.dJul 7 17:20 108 5 dcalph[001-002,017-018,022] 295889 e154414 S low VASP 5.4.1/tmp/tmp.MJul 7 17:30 108 3 dcalph[003,027,030] 294122 e154414 S low VASP 5.4.1/tmp/tmp.EJun 30 0:08 108 3 dcalph[004,031-032] 295881 e154414 S low VASP 5.4.1/tmp/tmp.wJul 7 17:21 108 3 dcalph[005-007] 295882 e154414 S low VASP 5.4.1/tmp/tmp.JJul 7 17:22 108 3 dcalph[008-009,072] 295883 e154414 S low VASP 5.4.1/tmp/tmp.JJul 7 17:22 108 3 dcalph[011,071,073] 295893 e154414 S low VASP 5.4.1/tmp/tmp.MJul 7 17:31 108 3 dcalph[014,016,050] 295886 e154414 S low VASP 5.4.1/tmp/tmp.hJul 7 17:28 108 3 dcalph[020-021,023] 295888 e154414 S low VASP 5.4.1/tmp/tmp.2Jul 7 17:29 108 3 dcalph[024-025,033] 294123 e154414 S low VASP 5.4.1/tmp/tmp.sJun 30 0:21 108 3 dcalph[026,035,070] 295894 e154414 R low VASP 5.4.1/tmp/tmp.tJul 7 17:32 108 3 dcalph[046-048] 295971 e154414 R normal VASP 5.4.1/tmp/tmp.xJul 8 20:06 108 3 dcalph[070-072] 249905 e154414 R interact VNC /dat/usr/eApr 10 15:07 1 1 dcalph075 296757 e119661 R license cfdace 14./tmp/tmp.6Jul 13 10:37 1 1 dcalph017 296758 e119661 R license cfdace 14./tmp/tmp.GJul 13 10:37 1 1 dcalph045 209097 e119661 R interact VNC /user/e119Mar 7 11:10 1 1 dcalph075 288719 x045157 S open my_2D_P4_1/dat/usr/xJun 19 8:53 113 5 dcalph[061-065] 295763 x045157 R normal my_2D_P4_1/dat/usr/xJul 7 6:53 113 8 dcalph[076-083] 284045 e120594 R license cfdace 14./tmp/tmp.AMay 31 17:28 4 1 dcalph010 290547 e120594 R license cfdace 14./tmp/tmp.XJun 22 15:59 8 3 dcalph[012-013,045] 296360 e120594 R license cfdace 14./tmp/tmp.wJul 11 14:58 8 1 dcalph013 284602 e120594 R license cfdace 14./tmp/tmp.5Jun 2 14:04 4 1 dcalph036 285760 e120594 R license cfdace 14./tmp/tmp.nJun 7 10:30 4 1 dcalph067 284791 e120594 R license cfdace 14./tmp/tmp.hJun 2 18:14 4 1 dcalph069 285766 e120594 R license cfdace 14./tmp/tmp.kJun 7 10:31 4 1 dcalph077 296697 e156059 R license Mechanical/dat/usr/eJul 13 9:22 36 1 dcalph011 210071 e156059 R interact VNC /user/e156Mar 9 11:05 1 1 dcalph075 296744 e156425 R open x892736566/dat/usr/eJul 13 10:07 8 1 dcalph041 296743 e156425 R open x892736566/dat/usr/eJul 13 10:07 8 1 dcalph117 294404 e119448 R interact VNC /user/e119Jul 1 14:11 1 1 dcalph075 294329 e119448 R interact VNC /user/e119Jun 30 15:25 1 1 dcalph075 296736 e157618 PD open VASP 5.4.4/tmp/tmp.hJul 13 10:02 36 1 (Priority) 296740 e157618 PD open VASP 5.4.4/tmp/tmp.7Jul 13 10:05 36 1 (Priority) 296746 e157618 PD open VASP 5.4.4/tmp/tmp.bJul 13 10:10 36 1 (Priority) 296752 e157618 PD open VASP 5.4.4/tmp/tmp.uJul 13 10:22 36 1 (Priority) 296709 e157618 PD normal VASP 5.4.4/tmp/tmp.WJul 13 9:37 36 1 (QOSGrpCpuLimit) 296700 e157618 R normal VASP 5.4.4/tmp/tmp.1Jul 13 9:30 36 1 dcalph015 296706 e157618 R normal VNC /user/e157Jul 13 9:33 1 1 dcalph017 296131 e157618 R normal VASP 5.4.4/tmp/tmp.WJul 10 13:48 35 1 dcalph031 296548 e157618 R open VASP 5.4.4/tmp/tmp.3Jul 12 15:35 36 1 dcalph037 295954 e157618 R open VASP 5.4.4/tmp/tmp.GJul 8 16:00 35 1 dcalph049 296688 e157618 R open VASP 5.4.4/tmp/tmp.QJul 13 9:09 36 1 dcalph054 296724 e157618 R open VASP 5.4.4/tmp/tmp.JJul 13 9:46 36 1 dcalph055 296730 e157618 R open VASP 5.4.4/tmp/tmp.WJul 13 9:50 36 1 dcalph057 296731 e157618 R open VASP 5.4.4/tmp/tmp.TJul 13 9:52 36 1 dcalph059 296722 e157618 R open VASP 5.4.4/tmp/tmp.pJul 13 9:45 36 1 dcalph066 277024 e157653 R interact VNC /user/e157May 7 19:54 1 1 dcalph075 296719 x057829 R open g09 /tmp/tmp.kJul 13 9:43 8 1 dcalph041 296673 x057829 R open g09 /tmp/tmp.nJul 13 8:10 8 1 dcalph074 296670 x057829 R open g09 /tmp/tmp.0Jul 13 7:47 8 1 dcalph074 296739 e158714 PD open insertedsu/tmp/tmp.xJul 13 10:04 36 1 (Resources) 296755 e158714 PD open insertedsu/tmp/tmp.XJul 13 10:28 36 1 (Priority) 296692 e158714 R open insertedsu/tmp/tmp.wJul 13 9:16 36 1 dcalph042 296683 e158714 R open subs_x_sio/tmp/tmp.DJul 13 8:51 36 1 dcalph117 209429 e116763 R interact VNC /user/e116Mar 8 10:22 1 1 dcalph075 -bash-4.1$ Hi, on squeue output I notice that 296759 is in configuring state. Could you also attach slurmctld.log? Before version 16.05, assoc_limit_continue wasn’t default behavior therefore main scheduler doesn't start a job on nodes from higher prio part if any different job is pending on that higher prio partition. Dominik Created attachment 4924 [details] slurmctld log requeted by Dominik Bartkiewicz in comment #9 (In reply to Dominik Bartkiewicz from comment #9) > Hi, > > on squeue output I notice that 296759 is in configuring state. > Could you also attach slurmctld.log? Here is goes. > Before version 16.05, assoc_limit_continue wasn’t default behavior therefore > main scheduler doesn't start a job on nodes from higher prio part if any > different job is pending on that higher prio partition. > > Dominik Hi According to slurmctld.log this job was started 16 sec after submitted by Backfill. It looks like normal slurm behavior and not like something introduced by patch from https://bugs.schedmd.com/show_bug.cgi?id=3824 Dominik Dominick, You are correct. In fact I was trying to reproduce behavior for job # 296513 that indeed was in pending for at least several minutes. Hi I can't recreate this on my test system. Without more info it will be difficult to find reason why 296513 was pending so long. If it happens again could you catch squeue output , slurmctld.log with debugflag SelectType , increased debug level will be also useful. scontrol setdebugflag +SelectType scontrol setdebugflag -SelectType Dominik Hi Any news? Dominik I could not reproduce that bug. Not sure what at was. Anyway - closing it. Thank you! |