Ticket 152

Summary: Invalid age priority factor in slurm
Product: Slurm Reporter: Bill Brophy <bill.brophy>
Component: SchedulingAssignee: Moe Jette <jette>
Status: RESOLVED FIXED QA Contact:
Severity: 1 - System not usable    
Priority: --- CC: da
Version: 2.3.x   
Hardware: Linux   
OS: Linux   
Site: CEA Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Bill Brophy 2012-10-25 09:54:48 MDT
Sometimes, slurm jobs receive a maximum age priority factor just a few minutes or even seconds after they are submitted
even though we have configured a MaxAge of 7 days.(PriorityMaxAge=7-0 PriorityWeightAge=30000)

For example in this case the job 156696 gets the maximum age factor of 30000 25s after it was submitted:

[2012-10-23T14:54:23] Fairshare priority of job 156696 for user ***** in acct ***** is 2**(-0.048919/0.006359)
= 0.004832
[2012-10-23T14:54:23] Job 156696 priority: 0.00 + 483.18 + 0.00 + 0.00 + 200000.00 - 0 = 200483.18
[2012-10-23T14:54:23] _slurm_rpc_submit_batch_job JobId=156696 usec=1049
[2012-10-23T14:54:49] Fairshare priority of job 156696 for user ***** in acct ***** is 2**(-0.048909/0.006359)
= 0.004837
[2012-10-23T14:54:49] Job 156696 priority: 30000.00 + 483.68 + 0.00 + 0.00 + 200000.00 - 0 = 230483.68
[2012-10-23T14:54:56] sched: Allocate JobId=156696 NodeList=airain[1135,1139] #CPUs=32

This issue is very impacting as it can make the scheduling completely unfair between our users.

(I am not very fimiliar with fairshare logic, so I am not sure whether this is a real problem or just a misunderstanding  regarding how the factors are applied.)

Best Regards,
Bill
Comment 1 Danny Auble 2012-10-25 09:59:34 MDT
This was fixed in 2.4.3.  Here is the patch 04bbc0a16c329c82d7913829e223f7cb82535564, which should be able to be applied to CEA's version of Slurm.