Summary: | fairtree breaks sprio | ||
---|---|---|---|
Product: | Slurm | Reporter: | Bill Wichser <bill> |
Component: | Other | Assignee: | David Bigagli <david> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | brian, da |
Version: | 14.11.3 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Princeton (PICSciE) | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | Google sites: | --- |
HPCnow Sites: | --- | HPE Sites: | --- |
IBM Sites: | --- | NOAA SIte: | --- |
NoveTech Sites: | --- | Nvidia HWinf-CS Sites: | --- |
OCF Sites: | --- | Recursion Pharma Sites: | --- |
SFW Sites: | --- | SNIC sites: | --- |
Tzag Elita Sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
slurm.conf
scontrol-show-config |
Description
Bill Wichser
2015-02-12 04:52:12 MST
Will you send your slurm.conf? Attaching to case...Bill Created attachment 1650 [details]
slurm.conf
I haven't been able to reproduce what you are seeing. ex. brian@compy:~/slurm/14.11/compy$ sprio JOBID PRIORITY AGE FAIRSHARE JOBSIZE QOS 184 7681 15 6667 1000 0 185 7681 15 6667 1000 0 186 7681 15 6667 1000 0 187 7681 15 6667 1000 0 188 7681 15 6667 1000 0 190 2673 7 1667 1000 0 192 2673 7 1667 1000 0 195 2673 7 1667 1000 0 197 2668 1 1667 1000 0 198 2668 1 1667 1000 0 199 2668 1 1667 1000 0 200 2668 1 1667 1000 0 201 2668 1 1667 1000 0 brian@compy:~/slurm/14.11/compy$ scontrol show job 184 | grep Prio Priority=7681 Nice=0 Account=test2 QOS=normal Is sprio aliased to add default parameters by chance? Is the slurm.conf the same on both the client and the slurmctld side? sprio actually reads in the weights from the slurm.conf and uses them when normalizing the output (ex. sprio -n). I don't think this is your case since you would be seeing floating point numbers. Will you also send the output of "scontrol show config". FYI. I noticed that you have the NO_CONF_HASH. We highly recommend not setting that as it can mask other issues. sprio calls are to the binary and have nothing wrapped around it. slurm.conf is handled by puppet here and is the same across all machines. Weights are being read [root@della4 siemann]# sprio -w JOBID PRIORITY AGE FAIRSHARE JOBSIZE QOS Weights 1000 10000 10000 10000 [root@della4 siemann]# sprio -n | head -10 JOBID PRIORITY AGE FAIRSHARE JOBSIZE QOS 2802880 0.00000000 0.0010000 0.0000661 0.0000003 0.0000600 2802881 0.00000000 0.0010000 0.0000661 0.0000003 0.0000600 2802882 0.00000000 0.0010000 0.0000661 0.0000003 0.0000600 2802883 0.00000000 0.0010000 0.0000661 0.0000004 0.0000600 2802884 0.00000000 0.0010000 0.0000661 0.0000004 0.0000600 2811756 0.00000000 0.0010000 0.0000661 0.0000004 0.0000600 2811757 0.00000000 0.0010000 0.0000661 0.0000003 0.0000600 2811758 0.00000000 0.0010000 0.0000661 0.0000003 0.0000600 2811759 0.00000000 0.0010000 0.0000661 0.0000003 0.0000600 [root@della4 siemann]# scontrol show job 2802880 JobId=2802880 JobName=sta1_eco_chip_tinyoram_stash UserId=wentzlaf(98836) GroupId=ee(30012) Priority=12946 Nice=0 Account=ee QOS=short This was after a reconfig where I removed the NO_CONF_HASH. I'll attach the show config. Created attachment 1652 [details]
scontrol-show-config
Will you set your SlurmctldDebug to info and let it run for a little bit and then send the logs? Please leave the the Priority DebugFlag set. Will you also run: sshare -l sprio -l scontrol show jobs FYI. Bug 1469 is experiencing the same issue. And the output of "sacctmgr show assoc tree" as well please. Brian, I can probably give you access to the machine. Would that help? Bill On 2/19/2015 5:14 PM, bugs@schedmd.com wrote: > *Comment # 8 <http://bugs.schedmd.com/show_bug.cgi?id=1454#c8> on bug > 1454 <http://bugs.schedmd.com/show_bug.cgi?id=1454> from Brian > Christiansen <mailto:brian@schedmd.com> * > > And the output of "sacctmgr show assoc tree" as well please. > > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Ya, we can try that. Will I be able to get logs if I need too? I can just ask you if needed. What do you need from me? Strangely enough, today all is well on this della cluster. The tiger cluster continues to display the odd behavior though. I may need to update this case with new values for cluster #2. And a restart there of the slurmctl has fixed this one as well. Strange since I did this before and things did not self-correct. I'm going to sit on this over the weekend (but log a different problem in the meantime!) Thanks Okay, lets close this one now. No idea why it suddenly works after that last restart of the daemons. I did do this already when we first noticed the issue. So lets assume that this is some user caused issue and put it away! Thanks...Bill OK. Let us know if it pops up again. Hey Bill, Just letting you know that several fixes for the priority problem went into 14.11.6. https://github.com/SchedMD/slurm/commit/b2be6159517b197188b474a9477698f6f5edf480 https://github.com/SchedMD/slurm/commit/f61475e8005bb9d834f8e227b0dc4282cc1f0aee Thanks, Brian |