Ticket 12796

Summary: slurmrestd association/qos grp/max
Product: Slurm Reporter: Matt Ezell <ezellma>
Component: slurmrestdAssignee: Nate Rini <nate>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: barlowat
Version: 20.11.8   
Hardware: Linux   
OS: Linux   
Site: ORNL-OLCF Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 21.08.6,22.05pre1
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Ticket Depends on:    
Ticket Blocks: 12753    
Attachments: patch for 21.08
patch for 21.08 (v2)

Description Matt Ezell 2021-11-01 13:04:48 MDT
ORNL is developing a tool to sync our account system with slurmdbd using slurmrestd.  There seems to be some inconsistency with "group" and "max" limits on associations and qos.

For associations, parse.c has:
_add_parse(UINT32, grp_jobs, "max/jobs/per/count")
_add_parse(UINT32, grp_jobs_accrue, "max/jobs/per/accruing")
_add_parse(UINT32, grp_submit_jobs, "max/jobs/per/submitted")
_add_parse(TRES_LIST, grp_tres, "max/tres/total")
_add_parse(UINT32, grp_wall, "max/per/account/wall_clock")

This appears to set the "grp" limits into the "max" URL. I don't think the actual max values are accessible. There is no grp url present, though I believe there likely should be.

For qos, parse.c has:
_add_parse(UINT32, grp_jobs_accrue, "limits/max/active_jobs/accruing")
_add_parse(UINT32, grp_jobs, "limits/max/active_jobs/count")
_add_parse(TRES_LIST, grp_tres, "limits/max/tres/total")
_add_parse(TRES_LIST, grp_tres_run_mins, "limits/max/tres/minutes/per/qos")

Again, I would think these should be under "grp" to be consistent with naming elsewhere.

It appears that the association's endpoint docs in Slurm version 20.11.8 are missing key pairs for “count”, “accruing”, and “submitted”, so if this is the correct way to access these values I think a documentation update is also in order.
Comment 1 Nate Rini 2021-11-01 13:36:30 MDT
(In reply to Matt Ezell from comment #0)
> ORNL is developing a tool to sync our account system with slurmdbd using
> slurmrestd.  There seems to be some inconsistency with "group" and "max"
> limits on associations and qos.
> 
> For associations, parse.c has:
> _add_parse(UINT32, grp_jobs, "max/jobs/per/count")
> _add_parse(UINT32, grp_jobs_accrue, "max/jobs/per/accruing")
> _add_parse(UINT32, grp_submit_jobs, "max/jobs/per/submitted")
> _add_parse(TRES_LIST, grp_tres, "max/tres/total")
> _add_parse(UINT32, grp_wall, "max/per/account/wall_clock")
> 
> This appears to set the "grp" limits into the "max" URL. I don't think the
> actual max values are accessible. There is no grp url present, though I
> believe there likely should be.
> 
> For qos, parse.c has:
> _add_parse(UINT32, grp_jobs_accrue, "limits/max/active_jobs/accruing")
> _add_parse(UINT32, grp_jobs, "limits/max/active_jobs/count")
> _add_parse(TRES_LIST, grp_tres, "limits/max/tres/total")
> _add_parse(TRES_LIST, grp_tres_run_mins, "limits/max/tres/minutes/per/qos")
> 
> Again, I would think these should be under "grp" to be consistent with
> naming elsewhere.

The hierarchy was set somewhat arbitrary based on what I thought logically break out the easiest for users. Can you give an example of your current output?

Are you suggesting:
> _add_parse(UINT32, grp_jobs, "max/jobs/per/count")
to
> > _add_parse(UINT32, grp_jobs, "limits/max/jobs/per/count")
instead?
 
> It appears that the association's endpoint docs in Slurm version 20.11.8 are
> missing key pairs for “count”, “accruing”, and “submitted”, so if this is
> the correct way to access these values I think a documentation update is
> also in order.

I suspect this is just a naming issue as slurmdb_assoc_usage_t is in the 'usage' branch of 'dbv0.0.38_association' object:
> accruing -> accrue_job_count
> submitted -> job_count
> count -> active_jobs
Comment 2 Matt Ezell 2021-11-01 13:43:52 MDT
> Are you suggesting
No, slurmdb_assoc_rec_t has the following fields:
grp_jobs
grp_jobs_accrue
grp_submit_jobs
grp_tres
grp_wall
max_jobs
max_jobs_accrue
max_submit_jobs
max_wall_pj

"grp" and "max" are different. I'm suggesting:
> _add_parse(UINT32, grp_jobs, "max/jobs/per/count")
to
> _add_parse(UINT32, grp_jobs, "grp/jobs/per/count")
> _add_parse(UINT32, max_jobs, "max/jobs/per/count")

we need to be able to control both independently and correctly.
Comment 6 Nate Rini 2021-11-01 14:06:17 MDT
(In reply to Matt Ezell from comment #2)
> we need to be able to control both independently and correctly.

I'm going to work on a patch set to get the struct and the parser back into sync. For whatever reason, looks like a few struct members didn't get added.
Comment 7 Nate Rini 2021-11-01 15:13:49 MDT
Created attachment 22064 [details]
patch for 21.08

Please try this patch. It should have all the missing fields in slurmdb_assoc_rec_t.
Comment 8 Matt Ezell 2021-11-01 15:43:55 MDT
(In reply to Nate Rini from comment #7)
> Created attachment 22064 [details]
> patch for 21.08
> 
> Please try this patch. It should have all the missing fields in
> slurmdb_assoc_rec_t.

Thanks.

I will say that having
GrpJobsAccrue = max/jobs/per/accruing
MaxJobsAccrue = max/accrue/per/total
is not straightforward at all.

I will also note that qos has the same issue (paths with max that bind to the grp variable, no access to the max variable).
Comment 9 Nate Rini 2021-11-01 15:47:54 MDT
Do you have a better naming scheme in mind? I'm open to suggestions.
Comment 10 Matt Ezell 2021-11-01 15:57:18 MDT
(In reply to Nate Rini from comment #9)
> Do you have a better naming scheme in mind? I'm open to suggestions.

One option is to avoid coming up with a scheme at all and just bind the name used in the Cli to the association instead of trying to split it into subpaths

GrpJobsAccrue = GrpJobsAccrue
MaxJobsAccrue = MaxJobsAccrue

If you prefer to split it out into subpaths, I would just recommend something semi-consistent between the two similar limits.
Comment 11 Nate Rini 2021-11-05 13:48:55 MDT
Created attachment 22147 [details]
patch for 21.08 (v2)

(In reply to Matt Ezell from comment #10)
> One option is to avoid coming up with a scheme at all and just bind the name
> used in the Cli to the association instead of trying to split it into
> subpaths
> If you prefer to split it out into subpaths, I would just recommend
> something semi-consistent between the two similar limits.

It's too late for 21.08 to change the naming schema, so I moved around the new fields something that I hope makes it easier to understand. While several fields got listed as moved in the rest release notes, a new field took the place of the original which I assume will work as originally intended by users given the field name.

If the naming schema is troublesome or confusing, please open an RFE ticket and we can look at moving fields around for 22.05.

Please give this patch set a try. It should sync both QOS and association to the internal structs to avoid missing fields. The NEWS entries in the patchset will likely need to be ignored for testing on <=21.08.3.
Comment 28 Nate Rini 2022-01-25 10:10:00 MST
Matt,

The patches for this issue are upstream for the upcoming 21.08.6 release:
> 2f29b40754 openapi/dbv0.0.37 - add missing fields to "dbv0.0.37_qos"
> f1b629d1f4 openapi/dbv0.0.37 - Rename incorrectly named fields in dbv0.0.37_qos
> 08a4f49ac9 openapi/dbv0.0.37 - add missing field to "dbv0.0.37_association"

Please feel free to test it.

Thanks,
--Nate
Comment 29 Nate Rini 2022-01-31 08:04:28 MST
Closing this ticket. Please report if there are any more issues after 21.08.6 and we can continue debugging.