8262 – sacct, exclusive field

Ticket 8262 - sacct, exclusive field

Summary: sacct, exclusive field

Status:	RESOLVED DUPLICATE of ticket 4161

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Accounting (show other tickets)
Version:	18.08.8
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Ben Roberts
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-12-20 15:13 MST by Shawn Needham
Modified:	2019-12-30 11:08 MST (History)
CC List:	0 users

See Also:
Site:	NOAA
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	ESRL
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Shawn Needham 2019-12-20 15:13:51 MST

This is just an informational ticket.

Is there a field one can use in sacct to see whether or not an exclusive job was run on a host. I know one can use sacct to examine jobs run against host(s) with the -N flag.

Thanks,
Shawn

Comment 1 Ben Roberts 2019-12-23 11:09:22 MST

Hi Shawn,

I've looked into what's recorded when you use the --exclusive flag and there isn't a field that shows this particular flag being specified for a job.  The way to tell if it was requested is by looking at the number of Requested/Allocated CPUs.  It sounds like this is what you said you are doing currently, or at least know about.  For reference, you can look at the requested and allocated cpus by specifying the format of the sacct output, like this:
sacct -j 48845 -o jobid,jobname,partition,account,reqcpus,alloccpus,state,exitcode

You would have to know the number of CPUs configured for the node to verify that all of them were requested and it wasn't just a partial node request.  

There are similar requests open, to have more information about jobs saved in the database.  You can see bug 7609 and bug 3010.  As you can see in those bugs a possible way to work around this in the mean time is to create a submit filter that looks for the --exclusive flag and adds an admin_comment to the job, which is recorded in the database and can be referenced later.  This isn't a very nice workaround and can get messy fast if you add comments for multiple things.  

This is probably not what you were hoping for, but I hope it helps.  Feel free to let me know if you have additional questions about this.

Thanks,
Ben

Comment 2 Shawn Needham 2019-12-27 10:26:23 MST

Hi Ben,

Thanks for the reply. Yes we were aware of the sacct approach with reqcpus
and alloccpus, and we also have the submit_filter logging the use of
exclusive. Agreed, we can see how the latter gets messy, and for the former
we'll just need to write a case statement to handle the 5 CPU host-type
variants at ESRL.

For this case, is there a way to monitor when/if a host leaves the "mixed"
state. Our use case for this request is interested in health-check hosts
that stay in "mixed" as a shared resource for a prolonged period of time.
We avoid the running NHC via cron on hosts purposefully.

Thanks,
Shawn

On Mon, Dec 23, 2019 at 11:09 AM <bugs@schedmd.com> wrote:

> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=8262#c1> on bug
> 8262 <https://bugs.schedmd.com/show_bug.cgi?id=8262> from Ben Roberts
> <ben@schedmd.com> *
>
> Hi Shawn,
>
> I've looked into what's recorded when you use the --exclusive flag and there
> isn't a field that shows this particular flag being specified for a job.  The
> way to tell if it was requested is by looking at the number of
> Requested/Allocated CPUs.  It sounds like this is what you said you are doing
> currently, or at least know about.  For reference, you can look at the
> requested and allocated cpus by specifying the format of the sacct output, like
> this:
> sacct -j 48845 -o
> jobid,jobname,partition,account,reqcpus,alloccpus,state,exitcode
>
> You would have to know the number of CPUs configured for the node to verify
> that all of them were requested and it wasn't just a partial node request.
>
> There are similar requests open, to have more information about jobs saved in
> the database.  You can see bug 7609 <https://bugs.schedmd.com/show_bug.cgi?id=7609> and bug 3010 <https://bugs.schedmd.com/show_bug.cgi?id=3010>.  As you can see in those bugs
> a possible way to work around this in the mean time is to create a submit
> filter that looks for the --exclusive flag and adds an admin_comment to the
> job, which is recorded in the database and can be referenced later.  This isn't
> a very nice workaround and can get messy fast if you add comments for multiple
> things.
>
> This is probably not what you were hoping for, but I hope it helps.  Feel free
> to let me know if you have additional questions about this.
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 3 Ben Roberts 2019-12-30 11:08:22 MST

Hi Shawn,

I think I understand what you're trying to get from this.  Unfortunately there isn't a state for a node that monitors whether it has a single job or multiple.  I tried to come up with some way of doing this, but currently nodes with multiple jobs that are fully allocated look the same as a node with a single job that occupies the whole node.  To be able to tell if there's an exclusive job it would require piecing together information from the jobs on the system.  

Having this data tracked would be an enhancement from the current functionality.  There is a request open in bug 4161 that aligns with what you are asking.  If a change like this is made we would want to do it in a way that makes the information available to both slurmctld, through a command like scontrol, as well as to slurmdbd, with a command like sacct.  

I'm going to mark this bug as a duplicate of bug 4161, but if you have any additional questions about this feel free to let me know.  

Thanks,
Ben

*** This ticket has been marked as a duplicate of ticket 4161 ***