Ticket 1454

Summary:	fairtree breaks sprio
Product:	Slurm	Reporter:	Bill Wichser <bill>
Component:	Other	Assignee:	David Bigagli <david>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	brian, da
Version:	14.11.3
Hardware:	Linux
OS:	Linux
Site:	Princeton (PICSciE)	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf scontrol-show-config

Description Bill Wichser 2015-02-12 04:52:12 MST

A user has pointed out that since moving to the fairtree scheduler that sprio now displays useless info.  Is this expected?

[bill@tiger2 ~]$ sprio | head -10
          JOBID   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS
         517845          1          1          0          0          0          1
         521553          1          1          0          0          0          0
         521804          1          1          0          0          0          0
         551462          2          1          0          0          0          1
         551501          2          1          0          0          0          1
         551540          2          1          0          0          0          1
         551579          2          1          0          0          0          1
         551617          2          1          0          0          0          1
         551618          2          1          0          0          0          1
[bill@tiger2 ~]$ scontrol show job 517845
JobId=517845 JobName=9376.myscript
   UserId=boxiao(88155) GroupId=chan(30078)
   Priority=1858 Nice=0 Account=chem QOS=tiger-medium

Comment 1 Brian Christiansen 2015-02-17 09:21:53 MST

Will you send your slurm.conf?

Comment 2 Bill Wichser 2015-02-17 23:10:44 MST

Attaching to case...Bill

Comment 3 Bill Wichser 2015-02-17 23:11:40 MST

Created attachment 1650 [details]
slurm.conf

Comment 4 Brian Christiansen 2015-02-18 08:42:58 MST

I haven't been able to reproduce what you are seeing.

ex.
brian@compy:~/slurm/14.11/compy$ sprio
          JOBID   PRIORITY        AGE  FAIRSHARE    JOBSIZE        QOS
            184       7681         15       6667       1000          0
            185       7681         15       6667       1000          0
            186       7681         15       6667       1000          0
            187       7681         15       6667       1000          0
            188       7681         15       6667       1000          0
            190       2673          7       1667       1000          0
            192       2673          7       1667       1000          0
            195       2673          7       1667       1000          0
            197       2668          1       1667       1000          0
            198       2668          1       1667       1000          0
            199       2668          1       1667       1000          0
            200       2668          1       1667       1000          0
            201       2668          1       1667       1000          0
brian@compy:~/slurm/14.11/compy$ scontrol show job 184 | grep Prio
   Priority=7681 Nice=0 Account=test2 QOS=normal

Is sprio aliased to add default parameters by chance? 
Is the slurm.conf the same on both the client and the slurmctld side? 
sprio actually reads in the weights from the slurm.conf and uses them when normalizing the output (ex. sprio -n). I don't think this is your case since you would be seeing floating point numbers.

Will you also send the output of "scontrol show config".

FYI. I noticed that you have the NO_CONF_HASH. We highly recommend not setting that as it can mask other issues.

Comment 5 Bill Wichser 2015-02-18 23:25:19 MST

sprio calls are to the binary and have nothing wrapped around it.
slurm.conf is handled by puppet here and is the same across all machines.

Weights are being read
[root@della4 siemann]# sprio -w
          JOBID   PRIORITY        AGE  FAIRSHARE    JOBSIZE        QOS
        Weights                  1000      10000      10000      10000

[root@della4 siemann]# sprio -n | head -10
          JOBID PRIORITY   AGE        FAIRSHARE  JOBSIZE    QOS       
        2802880 0.00000000 0.0010000  0.0000661  0.0000003  0.0000600 
        2802881 0.00000000 0.0010000  0.0000661  0.0000003  0.0000600 
        2802882 0.00000000 0.0010000  0.0000661  0.0000003  0.0000600 
        2802883 0.00000000 0.0010000  0.0000661  0.0000004  0.0000600 
        2802884 0.00000000 0.0010000  0.0000661  0.0000004  0.0000600 
        2811756 0.00000000 0.0010000  0.0000661  0.0000004  0.0000600 
        2811757 0.00000000 0.0010000  0.0000661  0.0000003  0.0000600 
        2811758 0.00000000 0.0010000  0.0000661  0.0000003  0.0000600 
        2811759 0.00000000 0.0010000  0.0000661  0.0000003  0.0000600 



[root@della4 siemann]# scontrol show job 2802880
JobId=2802880 JobName=sta1_eco_chip_tinyoram_stash
   UserId=wentzlaf(98836) GroupId=ee(30012)
   Priority=12946 Nice=0 Account=ee QOS=short


This was after a reconfig where I removed the NO_CONF_HASH.

I'll attach the show config.

Comment 6 Bill Wichser 2015-02-18 23:26:02 MST

Created attachment 1652 [details]
scontrol-show-config

Comment 7 Brian Christiansen 2015-02-19 08:03:05 MST

Will you set your SlurmctldDebug to info and let it run for a little bit and then send the logs? Please leave the the Priority DebugFlag set. 

Will you also run:
sshare -l
sprio -l
scontrol show jobs

FYI. Bug 1469 is experiencing the same issue.

Comment 8 Brian Christiansen 2015-02-19 08:14:01 MST

And the output of "sacctmgr show assoc tree" as well please.

Comment 9 Bill Wichser 2015-02-19 11:51:08 MST

Brian,

I can probably give you access to the machine.  Would that help?

Bill

On 2/19/2015 5:14 PM, bugs@schedmd.com wrote:
> *Comment # 8 <http://bugs.schedmd.com/show_bug.cgi?id=1454#c8> on bug
> 1454 <http://bugs.schedmd.com/show_bug.cgi?id=1454> from Brian
> Christiansen <mailto:brian@schedmd.com> *
>
> And the output of "sacctmgr show assoc tree" as well please.
>
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 10 Brian Christiansen 2015-02-19 13:26:16 MST

Ya, we can try that. Will I be able to get logs if I need too? I can just ask you if needed. What do you need from me?

Comment 11 Bill Wichser 2015-02-20 00:51:38 MST

Strangely enough, today all is well on this della cluster.  The tiger cluster continues to display the odd behavior though.  I may need to update this case with new values for cluster #2.

Comment 12 Bill Wichser 2015-02-20 00:53:55 MST

And a restart there of the slurmctl has fixed this one as well.  Strange since I did this before and things did not self-correct.  I'm going to sit on this over the weekend (but log a different problem in the meantime!)

Thanks

Comment 13 Bill Wichser 2015-02-23 00:01:48 MST

Okay, lets close this one now.  No idea why it suddenly works after that last restart of the daemons.  I did do this already when we first noticed the issue.  So lets assume that this is some user caused issue and put it away!

Thanks...Bill

Comment 14 Brian Christiansen 2015-02-23 02:35:20 MST

OK. Let us know if it pops up again.

Comment 15 Brian Christiansen 2015-04-01 06:44:26 MDT

Hey Bill,

Just letting you know that several fixes for the priority problem went into 14.11.6.

https://github.com/SchedMD/slurm/commit/b2be6159517b197188b474a9477698f6f5edf480
https://github.com/SchedMD/slurm/commit/f61475e8005bb9d834f8e227b0dc4282cc1f0aee

Thanks,
Brian