Ticket 17379

Summary: Inconsistent Priority values as reported by sprio/squeue (believed incorrect) vs sacct (correct value?)
Product: Slurm Reporter: Robert Derrick <robd>
Component: User CommandsAssignee: Carlos Tripiana Montes <tripiana>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: heasterday, kauffman, mcoyne, rcwhite, sts
Version: 23.02.1   
Hardware: Linux   
OS: Linux   
Site: LANL Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: Other
Machine Name: Snow CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf from Snow (test case system)

Description Robert Derrick 2023-08-08 15:48:58 MDT
sn-rfe1: % nprio 6375178 ; squeue -O prioritylong -j 6375178 ; sjob 6375178 ; scontrol show job 6375178 -o | sed 's/^.*Priority=//' | sed 's/ Nice=.* EligibleTime=//' | sed 's/ .*$//'

From sprio:
          JOBID   PRIORITY        AGE  FAIRSHARE    JOBSIZE        QOS USER
        Weights                 14400      48000      14400      86400

        6375178      31757  0.0890878  0.1358919  0.0268817  0.2727273 laroche
        6375178      31757       1283       6523        387      23564 laroche

From squeue:
PRIORITY
31757

From sacct:
JobID          Priority    JobName               Start    Elapsed      State      NCPUS
------------ ---------- ---------- ------------------- ---------- ---------- ----------
6375178           30770    VTf00G0             Unknown   00:00:00    PENDING         

from scontrol show job reports
Priority:31757
EligibleTime:2023-08-05T19:11:46


We believe that the age calculation factor for sprio/squeue is being used to present the Priority value for those utilities, while sacct has a different, usually lower value, and the longer the job has been eligible, the greater the difference seems to be.

We expect all of the tools to report a consistent value, and this is important when we are trying to actually present to users the true Priority value of their pending job.
Comment 1 Carlos Tripiana Montes 2023-08-09 04:39:43 MDT
Hi Robert,

Would you mind to share the slurm.conf?

Also, I suspect the priority value from "sacct" never changes while the job is PENDING. I am correct? Meanwhile, the priority from sprio/squeue increases with time.

Thanks,
Carlos.
Comment 2 Robert Derrick 2023-08-09 08:40:33 MDT
Created attachment 31657 [details]
slurm.conf from Snow (test case system)

From Carlos:
> Also, I suspect the priority value from "sacct" never changes while the job is PENDING.
> I am correct? Meanwhile, the priority from sprio/squeue increases with time.

It is my belief when the job becomes Eligible, the Age Factor kicks in for sprio/squeue, and the Priority as reported by these two utilities starts to increase.

While the job is State/Reason = PENDING/!PRIORITY, the sacct Priority does not change due to AGE (although fluctuating FAIRSHARE factor can cause some variance). Then, when the Job moves to PENDING/PRIORITY, the Age Factor appears to start accumulating Priority based on the time in that State/Reason.

At that point, both Priorities increase monotonically and in step, although the difference between the two will remain constant.

Some of the above is conjecture, since I do not know of any way to query the details sacct's Priority calculations, but it seems to be borne out by what we see.

Regardless of what the final determination or action is, what I would like to know more is which of the two Priority calculations is being used to actually schedule the job -- the one that includes the AGE factor from Eligibility, or the one that starts at PENDING/PRIORITY.
Comment 3 Carlos Tripiana Montes 2023-08-09 10:38:13 MDT
I want to check whether this mismatch in how we present the priority to the users is a bug or not. To my understanding of the code, and following what is stated in our docs at [1], the right value includes AGE modifications. See the formulae:

Job_priority =
	site_factor +
	(PriorityWeightAge) * (age_factor) +
	(PriorityWeightAssoc) * (assoc_factor) +
	(PriorityWeightFairshare) * (fair-share_factor) +
	(PriorityWeightJobSize) * (job_size_factor) +
	(PriorityWeightPartition) * (partition_factor) +
	(PriorityWeightQOS) * (QOS_factor) +
	SUM(TRES_weight_cpu * TRES_factor_cpu,
	    TRES_weight_<type> * TRES_factor_<type>,
	    ...)
	- nice_factor

[1] https://slurm.schedmd.com/priority_multifactor.html#general
Comment 4 Carlos Tripiana Montes 2023-08-16 04:40:05 MDT
That's weird, at most.

I've checked on my end both looking at source and conducting tests and, once the job is created (sent) the database records its initial priority. No more

But during the time it waits pending, no other updates into the database are made by default for the priority. Even if you send a direct "scontrol update job=ID priority=new_value", it doesn't get updated. Even when it get into running state, it doesn't get updated. Even after ending the job.

I am still investigating this issue.

Regards,
Carlos.
Comment 5 Carlos Tripiana Montes 2023-09-12 06:16:45 MDT
Hey Robert,

What I m going to say may be a bit dumb, but it isn't really :).

If you directly look for word "priority" up in source folders:

src/plugins/accounting_storage
src/slurmdbd

and you only account for occurrences related to jobs, you will find the proof that sustains my findings in Comment 4: Only at job creation time, the priority is stored in the database.

So, all in all, right now I am unsure how your statement:

"At that point, both Priorities increase monotonically and in step, although the difference between the two will remain constant."

fits into what I am seeing wrote in the source, which is consistent with my cluster behaviour. Are you modifying the database jobs' priority directly? sacctmgr doesn't accept to modify such value, by the way.

At the end of the day, storing the initial priority is a way to keep track of how the accounting is affecting to the job's priority *before its lifetime and the custer state (everybody else using the cluster) affects to this value*. But if you think about it, only if you store the initial value you can correlate how the accounting is affecting, if you add the cluster state, you lose track of this information.

Regards,
Carlos.
Comment 6 S Senator 2023-09-12 09:17:08 MDT
> Are you modifying the database jobs' priority directly?
No, the only time we directly access the data base directly is heartbeat SQL *queries*.
Comment 7 Carlos Tripiana Montes 2023-10-06 03:00:51 MDT
Hi,

Have you been able to spot any new details on this issue?

We are running out of ideas from our side. After reading the source code, seems that priority in slurm accounting is only recorded with the initial priority when the job was created. So it is the Slurm's priority what it's recorded, not affected by the state of the cluster: that is to say no affected by fairshare, age, and other jobs in queue.

That makes sense, since this is an accounting. Storing the perturbed priority of a job during its lifespan would cause to loss the information of how the accounting affected those jobs. The correlation will be lost.

I want to clarify one thing here: Your initial issue was reported as something like "I can't understand why the priority in the accounting doesn't match the live priority the slurmctld is reporting when I query it via scontrol".

This difference between priorities has been already explained, so from that perspective the answer has been cleared. It remains unclear who is altering the priority in the accounting, and why. That should not be happening to my understanding, to being precise this seem to happen from outside slurm at some point.

Regards,
Carlos.
Comment 8 Carlos Tripiana Montes 2023-10-18 06:08:57 MDT
Closing as resolved/infogiven by now.

Please, reopen if needed.