| Summary: | fairtree breaks sprio | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Bill Wichser <bill> | 
| Component: | Other | Assignee: | David Bigagli <david> | 
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | brian, da | 
| Version: | 14.11.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Princeton (PICSciE) | Slinky Site: | --- | 
| Alineos Sites: | --- | Atos/Eviden Sites: | --- | 
| Confidential Site: | --- | Coreweave sites: | --- | 
| Cray Sites: | --- | DS9 clusters: | --- | 
| Google sites: | --- | HPCnow Sites: | --- | 
| HPE Sites: | --- | IBM Sites: | --- | 
| NOAA SIte: | --- | NoveTech Sites: | --- | 
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- | 
| Recursion Pharma Sites: | --- | SFW Sites: | --- | 
| SNIC sites: | --- | Tzag Elita Sites: | --- | 
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- | 
| Emory-Cloud Sites: | --- | ||
| Attachments: | slurm.conf scontrol-show-config | ||
| Will you send your slurm.conf? Attaching to case...Bill Created attachment 1650 [details]
slurm.confI haven't been able to reproduce what you are seeing.
ex.
brian@compy:~/slurm/14.11/compy$ sprio
          JOBID   PRIORITY        AGE  FAIRSHARE    JOBSIZE        QOS
            184       7681         15       6667       1000          0
            185       7681         15       6667       1000          0
            186       7681         15       6667       1000          0
            187       7681         15       6667       1000          0
            188       7681         15       6667       1000          0
            190       2673          7       1667       1000          0
            192       2673          7       1667       1000          0
            195       2673          7       1667       1000          0
            197       2668          1       1667       1000          0
            198       2668          1       1667       1000          0
            199       2668          1       1667       1000          0
            200       2668          1       1667       1000          0
            201       2668          1       1667       1000          0
brian@compy:~/slurm/14.11/compy$ scontrol show job 184 | grep Prio
   Priority=7681 Nice=0 Account=test2 QOS=normal
Is sprio aliased to add default parameters by chance? 
Is the slurm.conf the same on both the client and the slurmctld side? 
sprio actually reads in the weights from the slurm.conf and uses them when normalizing the output (ex. sprio -n). I don't think this is your case since you would be seeing floating point numbers.
Will you also send the output of "scontrol show config".
FYI. I noticed that you have the NO_CONF_HASH. We highly recommend not setting that as it can mask other issues.sprio calls are to the binary and have nothing wrapped around it.
slurm.conf is handled by puppet here and is the same across all machines.
Weights are being read
[root@della4 siemann]# sprio -w
          JOBID   PRIORITY        AGE  FAIRSHARE    JOBSIZE        QOS
        Weights                  1000      10000      10000      10000
[root@della4 siemann]# sprio -n | head -10
          JOBID PRIORITY   AGE        FAIRSHARE  JOBSIZE    QOS       
        2802880 0.00000000 0.0010000  0.0000661  0.0000003  0.0000600 
        2802881 0.00000000 0.0010000  0.0000661  0.0000003  0.0000600 
        2802882 0.00000000 0.0010000  0.0000661  0.0000003  0.0000600 
        2802883 0.00000000 0.0010000  0.0000661  0.0000004  0.0000600 
        2802884 0.00000000 0.0010000  0.0000661  0.0000004  0.0000600 
        2811756 0.00000000 0.0010000  0.0000661  0.0000004  0.0000600 
        2811757 0.00000000 0.0010000  0.0000661  0.0000003  0.0000600 
        2811758 0.00000000 0.0010000  0.0000661  0.0000003  0.0000600 
        2811759 0.00000000 0.0010000  0.0000661  0.0000003  0.0000600 
[root@della4 siemann]# scontrol show job 2802880
JobId=2802880 JobName=sta1_eco_chip_tinyoram_stash
   UserId=wentzlaf(98836) GroupId=ee(30012)
   Priority=12946 Nice=0 Account=ee QOS=short
This was after a reconfig where I removed the NO_CONF_HASH.
I'll attach the show config.Created attachment 1652 [details]
scontrol-show-configWill you set your SlurmctldDebug to info and let it run for a little bit and then send the logs? Please leave the the Priority DebugFlag set. Will you also run: sshare -l sprio -l scontrol show jobs FYI. Bug 1469 is experiencing the same issue. And the output of "sacctmgr show assoc tree" as well please. Brian, I can probably give you access to the machine. Would that help? Bill On 2/19/2015 5:14 PM, bugs@schedmd.com wrote: > *Comment # 8 <http://bugs.schedmd.com/show_bug.cgi?id=1454#c8> on bug > 1454 <http://bugs.schedmd.com/show_bug.cgi?id=1454> from Brian > Christiansen <mailto:brian@schedmd.com> * > > And the output of "sacctmgr show assoc tree" as well please. > > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Ya, we can try that. Will I be able to get logs if I need too? I can just ask you if needed. What do you need from me? Strangely enough, today all is well on this della cluster. The tiger cluster continues to display the odd behavior though. I may need to update this case with new values for cluster #2. And a restart there of the slurmctl has fixed this one as well. Strange since I did this before and things did not self-correct. I'm going to sit on this over the weekend (but log a different problem in the meantime!) Thanks Okay, lets close this one now. No idea why it suddenly works after that last restart of the daemons. I did do this already when we first noticed the issue. So lets assume that this is some user caused issue and put it away! Thanks...Bill OK. Let us know if it pops up again. Hey Bill, Just letting you know that several fixes for the priority problem went into 14.11.6. https://github.com/SchedMD/slurm/commit/b2be6159517b197188b474a9477698f6f5edf480 https://github.com/SchedMD/slurm/commit/f61475e8005bb9d834f8e227b0dc4282cc1f0aee Thanks, Brian | 
A user has pointed out that since moving to the fairtree scheduler that sprio now displays useless info. Is this expected? [bill@tiger2 ~]$ sprio | head -10 JOBID PRIORITY AGE FAIRSHARE JOBSIZE PARTITION QOS 517845 1 1 0 0 0 1 521553 1 1 0 0 0 0 521804 1 1 0 0 0 0 551462 2 1 0 0 0 1 551501 2 1 0 0 0 1 551540 2 1 0 0 0 1 551579 2 1 0 0 0 1 551617 2 1 0 0 0 1 551618 2 1 0 0 0 1 [bill@tiger2 ~]$ scontrol show job 517845 JobId=517845 JobName=9376.myscript UserId=boxiao(88155) GroupId=chan(30078) Priority=1858 Nice=0 Account=chem QOS=tiger-medium