Ticket 6814

Summary: reason=AssocMaxJobsLimit not correct
Product: Slurm Reporter: Daniel Grimwood <daniel.grimwood>
Component: SchedulingAssignee: Director of Support <support>
Status: RESOLVED FIXED QA Contact: Brian Christiansen <brian>
Severity: 4 - Minor Issue    
Priority: --- CC: arc-slurm-admins, phils, tim
Version: 18.08.6   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=8012
https://bugs.schedmd.com/show_bug.cgi?id=12202
Site: Pawsey Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 19.05.5
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Daniel Grimwood 2019-04-08 23:51:23 MDT
Hi,

just looked at this with Tim while he is here, who asked for this to be in a ticket...

squeue reason is listed as assocmaxjobslimit but is probably actually just priority.

With regards,
Daniel.


reaper@magnus-2:~> squeue -u bjin -r
JOBID            USER     ACCOUNT     PARTITION            NAME EXEC_HOST ST     REASON       START_TIME         END_TIME  TIME_LEFT NODES   PRIORITY
3487484_70       bjin     pawsey0205  workq      LQE_Re60_Iter0  nid01244  R       None 2019-04-09T02:32 2019-04-10T02:30   12:51:49     1       5149
3487484_69       bjin     pawsey0205  workq      LQE_Re60_Iter0  nid01328  R       None 2019-04-09T02:18 2019-04-10T02:16   12:37:25     1       5148
3493679_72       bjin     pawsey0205  workq      LQE_Re60_Iter1       n/a PD   Priority              N/A              N/A   23:58:00     1       5120
3493679_73       bjin     pawsey0205  workq      LQE_Re60_Iter1       n/a PD AssocMaxJo              N/A              N/A   23:58:00     1       5120
3493679_74       bjin     pawsey0205  workq      LQE_Re60_Iter1       n/a PD AssocMaxJo              N/A              N/A   23:58:00     1       5120
3493679_75       bjin     pawsey0205  workq      LQE_Re60_Iter1       n/a PD AssocMaxJo              N/A              N/A   23:58:00     1       5120
3493679_76       bjin     pawsey0205  workq      LQE_Re60_Iter1       n/a PD AssocMaxJo              N/A              N/A   23:58:00     1       5115
3493679_77       bjin     pawsey0205  workq      LQE_Re60_Iter1       n/a PD AssocMaxJo              N/A              N/A   23:58:00     1       5115
3494357_60       bjin     pawsey0205  workq      LQE_Re110_Iter       n/a PD   Priority              N/A              N/A   23:58:00     1       5115
3494357_62       bjin     pawsey0205  workq      LQE_Re110_Iter       n/a PD AssocMaxJo              N/A              N/A   23:58:00     1       5115
3494357_63       bjin     pawsey0205  workq      LQE_Re110_Iter       n/a PD AssocMaxJo              N/A              N/A   23:58:00     1       5108
3494357_65       bjin     pawsey0205  workq      LQE_Re110_Iter       n/a PD AssocMaxJo              N/A              N/A   23:58:00     1       5108


reaper@magnus-2:~> scontrol show job 3493679_73
JobId=3493681 ArrayJobId=3493679 ArrayTaskId=73 JobName=LQE_Re60_Iter1
   UserId=bjin(24886) GroupId=bjin(24886) MCS_label=N/A
   Priority=5120 Nice=0 Account=pawsey0205 QOS=normal
   JobState=PENDING Reason=AssocMaxJobsLimit Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=23:58:00 TimeMin=N/A
   SubmitTime=2019-04-09T09:23:17 EligibleTime=2019-04-09T09:23:17
   AccrueTime=2019-04-09T09:23:17
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-04-09T13:38:43
   Partition=workq AllocNode:Sid=magnus-1:19079
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=1,mem=60000M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=60000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/scratch/pawsey0205/bjin/2019/LQE/Re060/task1/test.slurm
   WorkDir=/scratch/pawsey0205/bjin/2019/LQE/Re060/task1
   StdErr=/scratch/pawsey0205/bjin/2019/LQE/Re060/task1/LQE_Re60_Iter1-73.err
   StdIn=/dev/null
   StdOut=/scratch/pawsey0205/bjin/2019/LQE/Re060/task1/LQE_Re60_Iter1-73.out
   Power=

Snippets of scontrol show assoc...

ClusterName=magnus Account=root UserName= Partition= ID=1
    SharesRaw/Norm/Level/Factor=1/1.00/1/0.00
    UsageRaw/Norm/Efctv=44437177776.00/1.00/1.00
    ParentAccount= Lft=1 DefAssoc=No
    GrpJobs=N(366) GrpJobsAccrue=N(358)
    GrpSubmitJobs=N(1383) GrpWall=N(2801042.05)
    GrpTRES=cpu=N(70656),mem=N(88320000),energy=N(0),node=N(1472),billing=N(70656),fs/disk=N(0),vmem=N(0),pages=N(0)
    GrpTRESMins=cpu=N(740619629),mem=N(922772679827),energy=N(0),node=N(15429575),billing=N(740619629),fs/disk=N(0),vmem=N(0),pages=N(0)
    GrpTRESRunMins=cpu=N(49269803),mem=N(61587254000),energy=N(0),node=N(1026454),billing=N(49269803),fs/disk=N(0),vmem=N(0),pages=N(0)
    MaxJobs=32(366) MaxJobsAccrue= MaxSubmitJobs=128(1383) MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESMinsPJ=
    MinPrioThresh=

ClusterName=magnus Account=pawsey0205 UserName= Partition= ID=6667
    SharesRaw/Norm/Level/Factor=125000/0.00/71140239/0.00
    UsageRaw/Norm/Efctv=147909552.00/0.00/0.00
    ParentAccount=root(1) Lft=3580 DefAssoc=No
    GrpJobs=N(2) GrpJobsAccrue=N(16)
    GrpSubmitJobs=N(12) GrpWall=N(51357.48)
    GrpTRES=cpu=N(96),mem=N(120000),energy=N(0),node=N(2),billing=N(96),fs/disk=N(0),vmem=N(0),pages=N(0)
    GrpTRESMins=cpu=N(2465159),mem=N(3081449000),energy=N(0),node=N(51357),billing=N(2465159),fs/disk=N(0),vmem=N(0),pages=N(0)
    GrpTRESRunMins=cpu=N(73977),mem=N(92472000),energy=N(0),node=N(1541),billing=N(73977),fs/disk=N(0),vmem=N(0),pages=N(0)
    MaxJobs=32(2) MaxJobsAccrue= MaxSubmitJobs=128(12) MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESMinsPJ=
    MinPrioThresh=

ClusterName=magnus Account=pawsey0205 UserName=bjin(24886) Partition=workq ID=7401
    SharesRaw/Norm/Level/Factor=2147483647/0.00/0/0.00
    UsageRaw/Norm/Efctv=147909552.00/0.00/1.00
    ParentAccount= Lft=3587 DefAssoc=Yes
    GrpJobs=N(2) GrpJobsAccrue=N(16)
    GrpSubmitJobs=N(12) GrpWall=N(51357.48)
    GrpTRES=cpu=N(96),mem=N(120000),energy=N(0),node=N(2),billing=N(96),fs/disk=N(0),vmem=N(0),pages=N(0)
    GrpTRESMins=cpu=N(2465159),mem=N(3081449000),energy=N(0),node=N(51357),billing=N(2465159),fs/disk=N(0),vmem=N(0),pages=N(0)
    GrpTRESRunMins=cpu=N(73977),mem=N(92472000),energy=N(0),node=N(1541),billing=N(73977),fs/disk=N(0),vmem=N(0),pages=N(0)
    MaxJobs=32(2) MaxJobsAccrue=16(16) MaxSubmitJobs=512(12) MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESMinsPJ=
    MinPrioThresh=
Comment 1 Michael Hinton 2019-04-09 17:28:28 MDT
Hi Daniel,

Let me see if I understand correctly: you and Tim expect that only the highest-priority pending job should show `AssocMaxJo`, while all lesser-priority pending jobs should show `Priority`, right?

Could you attach your slurm.conf and if possible indicate the job arguments necessary to reproduce this?

Thanks,
Michael
Comment 3 Daniel Grimwood 2019-04-09 19:30:22 MDT
Hi Michael,

no none of them should be assocmaxjo.  The association and its parents have no fewer than maxjobs=32, with only 2 running.

With regards,
Daniel.
Comment 5 Michael Hinton 2019-04-11 13:58:37 MDT
Hi Daniel,

We are still trying to reproduce this. Could you show the output of the following commands?:

    sacctmgr show assoc
    sacctmgr show qos

Was the array job requeued? Is there anything else notable about the job?

Thanks,
Michael
Comment 6 Michael Hinton 2019-04-11 14:04:31 MDT
I should say, were any of the array jobs requeued and is there anything else notable about any of the array jobs?
Comment 7 Daniel Grimwood 2019-04-11 21:55:47 MDT
Hi Michael,

no the jobs haven't been requeued.
> sacct --start=2019-04-01 -j 3493679_73,3493679_74,3494357_62 --duplicates
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
3493679_73   LQE_Re60_+      workq pawsey0205         48  COMPLETED      0:0
3493679_73.+      batch            pawsey0205         24  COMPLETED      0:0
3493679_73.+     extern            pawsey0205         48  COMPLETED      0:0
3493679_73.0     python            pawsey0205         24  COMPLETED      0:0
3493679_73.1     python            pawsey0205         24  COMPLETED      0:0
3493679_74   LQE_Re60_+      workq pawsey0205         48  COMPLETED      0:0
3493679_74.+      batch            pawsey0205         24  COMPLETED      0:0
3493679_74.+     extern            pawsey0205         48  COMPLETED      0:0
3493679_74.0     python            pawsey0205         24  COMPLETED      0:0
3493679_74.1     python            pawsey0205         24  COMPLETED      0:0
3494357_62   LQE_Re110+      workq pawsey0205         48  COMPLETED      0:0
3494357_62.+      batch            pawsey0205         24  COMPLETED      0:0
3494357_62.+     extern            pawsey0205         48  COMPLETED      0:0

The user is being a bit specific about array indices in the jobscript, and apart from that it's normal...
#!/bin/bash --login
#SBATCH --array=72-77:1
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=23:58:00
#SBATCH --account=pawsey0205
#SBATCH --job-name=LQE_Re60_Iter1
#SBATCH --error=%x-%a.err
#SBATCH --output=%x-%a.out
#SBATCH --export=NONE

> sacctmgr show qos
      Name   Priority  GraceTime    Preempt PreemptMode                                    Flags UsageThres UsageFactor       GrpTRES   GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit     GrpWall       MaxTRES MaxTRESPerNode   MaxTRESMins     MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU     MaxTRESPA MaxJobsPA MaxSubmitPA       MinTRES
---------- ---------- ---------- ---------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- --------- ----------- -------------
    normal      10000   00:00:00                cluster                                                        1.000000                                                                                                                                                                                                                cpu=1
disciplin+          0   00:00:00                cluster                                                        1.000000                                                                                                                                                      0                                                         cpu=1
   longqos          0   00:00:00                cluster   PartitionTimeLimit,RequiresReservation               1.000000                                                                                                                    12-00:00:00                                                                                 cpu=1
   benthos       3000   00:00:00                cluster                                                        1.000000                                                                                                                                                                                                                cpu=1
   askaprt      20000   00:00:00                cluster                                                        1.000000                                                                                                 
pawseyove+      20000   00:00:00                cluster                        PartitionMaxNodes               1.000000                                                                                                 
  prohibit          0   00:00:00                cluster                                                        1.000000                                                                                                                                                      0
   highqos      20000   00:00:00                cluster                                                        1.000000                                                                                                 
    exempt      10000   00:00:00                cluster                                                        1.000000                                                                                                 

I am not keen on attaching the whole assocations output, is there anything in particular you would like that isn't already in the ticket?

With regards,
Daniel.
Comment 10 Michael Hinton 2019-04-12 14:42:33 MDT
Hi Daniel,

So it appears that when an array job is broken out into individual job records, only the reason field of the first job of the array (i.e. the meta job) gets updated in a properly and timely manner.

Did you at any point have a situation where the limit _was_ hit beforehand? I was able to reproduce what you saw by temporarily setting a MaxJobs limit low, and then increasing it again. The meta jobs properly reacted to the changes, while the non-meta jobs still showed the old reason of AssocMaxJobsLimit from when I set the limit low.

I haven't seen anything to indicate that this is anything more than a display issue that e.g. could affect scheduling.

I'll continue to look into it deeper to find the underlying cause.

Thanks,
-Michael
Comment 11 Michael Hinton 2019-04-12 14:49:16 MDT
(In reply to Michael Hinton from comment #10)
> I haven't seen anything to indicate that this is anything more than a
> display issue that e.g. could affect scheduling.
Sorry, the wording was a little ambiguous. This issue shouldn't affect scheduling. It's only a display issue.
Comment 12 Daniel Grimwood 2019-04-14 22:01:08 MDT
Thanks Michael.  Yes we're confident it's a display issue only.

We have not changed maxjobs recently.  The jobs were submitted a week after we upgraded from SLURM 17.11.9 to 18.08.6, and while doing so we increased maxsubmit and introduced maxaccrue but did not change maxjobs.

The only oddity is in how the user is submitting the array elements.  I've sorted the below output...
> sacct --start 2019-04-09 --end 2019-04-10 --format=jobid,state,submit -A pawsey0205 -a -X
       JobID      State              Submit
------------ ---------- -------------------
3487484_64    COMPLETED 2019-04-08T08:25:31
3487484_65    COMPLETED 2019-04-08T08:25:31
3487484_66    COMPLETED 2019-04-08T08:25:31
3487484_67    COMPLETED 2019-04-08T08:25:31
3487484_68    COMPLETED 2019-04-08T08:25:31
3487484_69    COMPLETED 2019-04-08T08:25:31
3487484_70    COMPLETED 2019-04-08T08:25:31
3487484_71    COMPLETED 2019-04-08T08:25:31
3490247_72   CANCELLED+ 2019-04-08T22:07:20
3490247_73   CANCELLED+ 2019-04-08T22:07:20
3490247_74   CANCELLED+ 2019-04-08T22:07:20
3490247_75   CANCELLED+ 2019-04-08T22:07:20
3490247_76   CANCELLED+ 2019-04-08T22:07:20
3490247_77   CANCELLED+ 2019-04-08T22:07:20
3493642_60   CANCELLED+ 2019-04-09T09:16:06
3493642_62   CANCELLED+ 2019-04-09T09:16:06
3493642_63   CANCELLED+ 2019-04-09T09:16:06
3493642_65   CANCELLED+ 2019-04-09T09:16:06
3493679_72    COMPLETED 2019-04-09T09:23:17
3493679_73    COMPLETED 2019-04-09T09:23:17
3493679_74    COMPLETED 2019-04-09T09:23:17
3493679_75    COMPLETED 2019-04-09T09:23:17
3493679_76    COMPLETED 2019-04-09T09:23:17
3493679_77    COMPLETED 2019-04-09T09:23:17
3494357_60    COMPLETED 2019-04-09T11:13:50
3494357_62    COMPLETED 2019-04-09T11:13:50
3494357_63    COMPLETED 2019-04-09T11:13:50
3494357_65    COMPLETED 2019-04-09T11:13:50
3495628_64    COMPLETED 2019-04-09T15:10:05
3495628_65    COMPLETED 2019-04-09T15:10:05
3495628_66    COMPLETED 2019-04-09T15:10:05
3495628_67    COMPLETED 2019-04-09T15:10:05
3495628_68    COMPLETED 2019-04-09T15:10:05
3495628_69    COMPLETED 2019-04-09T15:10:05
3495628_70    COMPLETED 2019-04-09T15:10:05
3495628_71    COMPLETED 2019-04-09T15:10:05

With regards,
Daniel.
Comment 13 Michael Hinton 2019-04-17 10:25:50 MDT
Hey Daniel,

So there are two separate issues here:

1) Why does the first job in the job array correctly show `Priority` but subsequent broken-out jobs don't?
2) Why was the reason ever set to AssocMaxJo in the first place?

For 1), the expected behavior is that the reasons for all the jobs in the job array should match (at least in the vast majority of cases). I've identified what's causing this, and expect to get a patch out shortly. I will keep you updated.

For 2), based on the information you've given, it's hard to say exactly what caused it. But the fix for 1) would effectively hide it. Since it's not really affecting anything other than display, it sounds like it's not at the center of this ticket.

Thanks!
Michael
Comment 16 Michael Hinton 2019-08-13 10:12:38 MDT
Hey Daniel,

I have a patch for issue 1) waiting for review. While we wait, would you like me to share it with you so you can try it out?

At this stage, the patch will likely go into 19.05 or 20.02 instead of 18.08.

Thanks,
-Michael
Comment 20 Daniel Grimwood 2019-08-16 01:10:38 MDT
Thanks Michael.

We'd prefer to wait for it to hit a release rather than applying the patch.  No worries about not going into 18.08.

With regards,
Daniel.
Comment 25 Michael Hinton 2019-12-10 09:30:09 MST
Hi Daniel,

The wrong array task reason bug has now been fixed in commit https://github.com/SchedMD/slurm/commit/ee3d4715f0071725a4b3ea260889427512778c08.

As for the completely random array task reason showing up in the first place: there are a few other customers that have experienced that issue as well. That is now being tracked in bug 8012, so if you see that again, please continue the discussion there. However, the commit above would probably mask that bug.

Thanks,
Michael
Comment 26 Douglas Wightman 2019-12-18 09:31:30 MST
*** Ticket 8012 has been marked as a duplicate of this ticket. ***
Comment 27 Broderick Gardner 2019-12-19 09:40:51 MST
*** Ticket 7966 has been marked as a duplicate of this ticket. ***