Ticket 6982 - Simpler core-based accounting methods
Summary: Simpler core-based accounting methods
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 18.08.7
Hardware: Linux Linux
: 5 - Enhancement
Assignee: Unassigned Developer
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-05-08 16:17 MDT by Anthony DelSorbo
Modified: 2019-11-15 09:39 MST (History)
4 users (show)

See Also:
Site: NOAA
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: NESCC
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Anthony DelSorbo 2019-05-08 16:17:36 MDT
I am writing a reporting tool that collects sacct -X data from the database and need to know how to get True core-hours from a job's data.  My first inclination was to use elapsed raw to get time and ncpus to get cpu-seconds and go from there.  But, in ticket 5946, I learned that because the nodes have hyperthreading turned on, this wound up as doubling the number of cores (in ncpus), hence the values were twice as large as they should have been since we use core-hours not thread-hours.  So, with ticket 5946 we wound up hacking our configuration to fool the system into believing we only had 1/2 the number of cores so that it would do the right thing.

While this was OK, it didn't work to address all sites where they might be using threading and schedule to the thread - but we still had to account by the core.  So, I thought that billing under AllocTRES would do the trick.  This looked promising - after all, it says billing, so this must be what the developers intended. Duh - why didn't I think of that before?  So I changed my reporting tool to use the billing value instead.  So I run the report I developed and all looks promising.  But, when running the report on another system, I come to find out they had set their partition TRESBillingWeight to values in the 100's rather than leaving it set to 1.  So, as you can imagine the charging is now bloated on that system beyond what anyone can reasonably believe. 

In going back to the slurm.conf man page, I reread the meaning of TRESBillingWeights and (now) catch the part that states: "...TRESBillingWeights -- which is used for fairshare calculations...."  This is horribly confusing.  Why would you label something BILLING and have it affect fairshare? To me, billing means "you used 100 core hours, times $10.00 per core hour - here's your bill"

So now I'm faced with trying to find a consistent means of determining what the true core-hour usage is today on any system in our domain that is using Slurm.  Just as important, that value must be consistent whether hyperthreading is turned on or not, and the end result must not change for a given point of time in history regardless of the settings that may have changed in the configuration files.  That is, for example if hyperthreading was off a month ago and I run X jobs in Y time but I turn on hyperthreading today and run the exact same tests in the exact amount of time, the numbers must be the same.  If I run a report today on today's data as well as a report today on last month's data, the values must be the same.

So, how do I get consistent and reliable core-hour usage?

Tony.
Comment 1 Albert Gil 2019-05-09 02:32:00 MDT
Hi Tony,

> In going back to the slurm.conf man page, I reread the meaning of
> TRESBillingWeights and (now) catch the part that states:
> "...TRESBillingWeights -- which is used for fairshare calculations...." 
> This is horribly confusing.  Why would you label something BILLING and have
> it affect fairshare? To me, billing means "you used 100 core hours, times
> $10.00 per core hour - here's your bill"

You can see the "share" of a user or account like the amount (of money) that they "payed" and "own" from the cluster resources (TRES).
That amount gives them the right of using a % of the overall resources/TRES.
That usage can be seen as the "cost" of each job, so the "bill" of that cost/job.
If you think that way, the TRESBillingWeights can be seen as the specific "price" of each resource/TRES (actually each TRES have a different price and demand, right?).

As the fairshare algorithm tries to ensure that every user or account is always able to access to their % of resources in the middle term (while maximizing the overall usage of the resources/TRES at any point) it uses the "bills" of the user to track what % has been the user using/costing and increase or decrease the priority of their jobs to achieve the right %.

Doesn't it make sense for you?

> So now I'm faced with trying to find a consistent means of determining what
> the true core-hour usage is today on any system in our domain that is using
> Slurm.  Just as important, that value must be consistent whether
> hyperthreading is turned on or not, and the end result must not change for a
> given point of time in history regardless of the settings that may have
> changed in the configuration files.  That is, for example if hyperthreading
> was off a month ago and I run X jobs in Y time but I turn on hyperthreading
> today and run the exact same tests in the exact amount of time, the numbers
> must be the same.  If I run a report today on today's data as well as a
> report today on last month's data, the values must be the same.
> 
> So, how do I get consistent and reliable core-hour usage?

I guess that CPUTime from sacct with the right setup of the nodes is not working for you, right?
If it's not, then I need to go deeper on bug 5946 to understand the problem.


Albert
Comment 2 Anthony DelSorbo 2019-05-09 11:55:17 MDT
(In reply to Albert Gil from comment #1)

> > 
> > So, how do I get consistent and reliable core-hour usage?
> 
> I guess that CPUTime from sacct with the right setup of the nodes is not
> working for you, right?
> If it's not, then I need to go deeper on bug 5946 to understand the problem.
> 
Yes, please review 5946 to see what we had to do to fool slurm.  But, I don't like the hack we had to implement.  What I would prefer is to have a column that identifies the number of cores allocated to that job - not hyperthreads (or ncpus); something that is consistent across systems or the various sites we manage where hyperthreading may or may not be turned on.  It should not be affect my any other factor or weight - it is a raw value.  And, because it's not affected by any thing else, I can confidently re-query historical data and get the same answer, I got the first time I queried it.  Consider that - at that time - I may have had hyperthreading turned on or my TRESBillingWeights may have gone through a multitude of changes in the interim. 

I hope that makes sense.

Thanks,

Tony.
Comment 4 Albert Gil 2019-05-10 13:09:35 MDT
Hi Tony,

I think that I understand your problem and you just want a value in the DB that no matter id HT is enabled or not in the servers, this value represents always the Core-Hours allocated by a job.

If I'm right, then I think that CPUTime is your friend, and you just need to be sure that "Slurm CPUs" is always representing Cores.
Please note that "slurm CPUs" can represent Cores or Threads based on your setup / choice (https://slurm.schedmd.com/mc_support.html)

To try to show you how it works, let's follow an example of a node being changed and enabling and disabling HT on it, running jobs o it in both moments and see how it is accounted.

Let's have a single node cluster with only 1 socket and 2 cores on it (huge right? ;-).

First, with HT enabled "slurmd -C" is detecting it (ThreadsPerCore=2) and by default recomending CPUs to be threads:

# slurmd -C
NodeName=c1 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=2 RealMemory=7693

But, as we want "CPU accounting" (and scheduling) by Cores and not by Threads, we follow the advise on bug 5946 and we change CPUs on slurm.conf to match the number of Cores:

NodeName=c1 CPUs=2          SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=2 RealMemory=7693

Now we run a couple of jobs requesting 1 and 2 tasks (we only have two cores):

$ sbatch -n 1 --wrap "sleep 10"                                                            
Submitted batch job 10
$ sbatch -n 2 --wrap "sleep 10"                     
Submitted batch job 11

If we now query the CPUTime used, it is actually reporting the "CoresTime" (not "ThreadTime"):
 
$ sacct -X -j 10,11 --format JobID,AllocCPUS,Elapsed,CPUTime              
                 
       JobID  AllocCPUS    Elapsed    CPUTime 
------------ ---------- ---------- ---------- 
10                    1   00:00:10   00:00:10 
11                    2   00:00:10   00:00:20 


So far so good, but now we reboot our node and disable the HT.
What will happen?

First, slurmd -C noticed it (ThreadsPerCore=1 -> CPUs=2):

$ slurmd -C
NodeName=c1 CPUs=2 Boards=1 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=7693

We need to keep slurm.conf on sync with the hwd setup, so we also change it this time to match what slurmd -C says:

NodeName=c1 CPUs=2 Boards=1 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=7693


Actually, if you set this on your slurm.conf it will work the same if HT is enabled or disabled:

NodeName=c1 CPUs=2 SocketsPerBoard=1 CoresPerSocket=2 RealMemory=7693

As a sanity check, we repeat our historical request and we can see that CPUTime values are not changed at all:

$ sacct -X -j 10,11 --format JobID,AllocCPUS,Elapsed,CPUTime

       JobID  AllocCPUS    Elapsed    CPUTime 
------------ ---------- ---------- ---------- 
10                    1   00:00:10   00:00:10 
11                    2   00:00:10   00:00:20 

I understand that this was concerning you, right?
So, FIRST KEY POINT: No, none hwd or setup change will ever change the value of CPUTime in the DB.
Even if we messed up with CPUs config in the future.

But let's move on and run new jobs with same requirements than before:

$ sbatch -n 1 --wrap "sleep 10"
Submitted batch job 12
$ sbatch -n 2 --wrap "sleep 10"
Submitted batch job 13


Now, if we query for their CPUTime we can confirm that it is the same than we got with HT enabled:

$ sacct -X -j 10,11,12,13 --format JobID,AllocCPUS,Elapsed,CPUTime

       JobID  AllocCPUS    Elapsed    CPUTime 
------------ ---------- ---------- ---------- 
10                    1   00:00:10   00:00:10 
11                    2   00:00:10   00:00:20 
12                    1   00:00:10   00:00:10 
13                    2   00:00:10   00:00:20 

If I understood correctly, this was also a concern for you, right?
So, SECOND KEY POINT: CPUTime means the time that "slurm CPU" was allocated, and "slurm CPU" is by default the logical cpu (thread if HT is enabled, like htop), but can be forced to Cores with the advice on bug 5946 (specifying his value to the number of Cores in slurm.conf).


In summary,
If you follow bug 5946 and specify the CPUs of your nodes to the number of Cores, besides that nodes have HT enabled or not, the CPUTime value of sacct historically reliable and represents the Core-Hours of the jobs.
The key part is that Slurm works on SlurmCPU-Hour and SlurmCPU can configured by the cluster admins to represent Cores or Threads (or even one thing in one node and the other into another). If you ensure that CPUs on each node are representing properly Cores, no matter HT is enabled or not, I think that CPUTime is what your are looking for.

Does it make sense to you?
Albert
Comment 7 Albert Gil 2019-05-29 07:57:29 MDT
Hi Tonny,

I hope that comment 4 solved your problem... did it work for you?


Albert
Comment 8 Anthony DelSorbo 2019-05-29 13:54:59 MDT
(In reply to Albert Gil from comment #7)
> Hi Tonny,
> 
> I hope that comment 4 solved your problem... did it work for you?
> 
> 
> Albert

Albert, actually, no.  I still fail to see how this would work using _the exact same reporting tool_ across two sites where one would schedule by the core and the other schedule by the thread.  At both sites hyperthreading is on.  But one schedules by the core as slurm is tricked by the implementation of directions given via bug 5946, while the other one schedules by the thread (5946 not implemented).  Both report by the thread and since the first is only one thread per core it reports correctly.  But the second reports twice as much.  

So I need something in the database that tell me how many actual cores were being utilized by the job.
Comment 12 Albert Gil 2019-05-30 09:50:44 MDT
(In reply to Anthony DelSorbo from comment #8)
> Albert, actually, no.  I still fail to see how this would work using _the
> exact same reporting tool_ across two sites where one would schedule by the
> core and the other schedule by the thread. At both sites hyperthreading is on.


Ok, I misunderstood that you always want to schedule (and account/report) by Cores too, and that your problem was only that some nodes have HT on and some off.

Now I see that your have different sites / clusters some scheduling by thread some by core, and that you want accounting by Core in all of them.

I think that what your want is not possible right now, but let me discuss it internally and come back to you with a proper answer.


Albert
Comment 19 Albert Gil 2019-06-03 12:04:42 MDT
Hi Tony,

If now I understand it better, your scenario is:
- You have different Sites, some where Slurm schedules by Thread and some by Core.
- On all you Sites you may have Nodes with HT enabled (specially in Sites scheduling by Threads), and some with HT disabled (specially in Sites scheduling by Cores).
- The Sites and Nodes configuration may change over time.
- In all your Sites you want Accounting by Cores ("CoreTime").

We think that the right solution for your scenario is to enhance Slurm adding Core as a new TRES.
Then you would be able to query Elapsed and AllocTRES to obtain core=N in a similar way that you already can do for Nodes.
Not exactly like node because if you run a job with a single Thread in a Node with N Threads in TRES we account node=1, not node=1/N like it seems that you want for Cores (a division for ThreadsPerCore of that Node in that moment of time), right?
It may make sense, but it's a big enhancement in terms of code.

Currently, to workaround your scenario:
- Ensure that the CPUs of *all* your Nodes of one Site represent always a Core for sites scheduling by core, and always a Thread for sites scheduling by thread.
  - Sites scheduling by Core: CPUs = # of Cores
  - Sites scheduling by Threads: Not specifying CPUs will setup them to Threads by default.
- If you enable or disable HT of any node, it won't change the meaning of CPU for that node.
- Accounting for sites scheduling by Core:
  - If you follow the above recommendation CPUTime will be always what you want.
- Accounting for sites scheduling by Thread:
  - There is no a right and simple solution until we don't have Core as a TRES, but we can provide you some hints to implement a workaround to get or estimate this:
  - With "sacct" you can obtain the Threads used with AllocCPUs (or parsing "cpu=" on AllocTRES)
  - If all your nodes of the site have same ThreadsPerCore and it's historically constant:
    - Just divide for ThreadsPerCore and multiply per Elapsed to get the "CoreTime" that you want.
  - If nodes of the site have different ThreadsPerCore or they change over time:
    - With "sacct" you can obtain the Nodes where the job was run with NodeList
      - With "scontrol show hostnames $nodelist" you can convert the above nodelist (in compressed format) to an actual list of nodes
    - For each node you can obtain current ThreadsPerCore with "scontrol show node $node" and parsing "ThreadsPerCore="
    - For each node you can check if CPUs was changed over time with "sacctmgr show event where node=$node format=timestart,TRES"
    - With all that info you should be able to estimate the CoreTime quite accurate.


Hope that helps,
Albert
Comment 20 Jason Booth 2019-06-03 14:20:35 MDT
Hi Tony,

 I do need to add to Alberts comment that you should be able to get what you are after following his suggestions below. In addition, adding a TRES is something that is not trivial.  

> We think that the right solution for your scenario is to enhance Slurm adding Core as a new TRES.

Although this may be a possible solution it is important to point out that we have not scoped this out so we do not have a complete understanding of how involved this would be or if we have interest in doing this type of a feature. 

I would highly suggest you try the suggests that Albert proposed below and see if this will work for you.

-Jason
Comment 21 Albert Gil 2019-06-21 05:32:03 MDT
Hi Tony,

It is ok for you if we close this ticket as infogiven based on comment 19 and 20?

Regards,
Albert
Comment 22 Anthony DelSorbo 2019-06-25 07:41:08 MDT
(In reply to Albert Gil from comment #21)
> Hi Tony,
> 
> It is ok for you if we close this ticket as infogiven based on comment 19
> and 20?
> 
> Regards,
> Albert

Most everything you mention would rely on scontrol show job which is not feasible for historical accounting - especially if things change over time.  Without knowing whether threads or cores were being used over time, the data loses an element of integrity.  I would not be able to compare today's jobs run on cores to those of a year ago since I wouldn't know if the system was setup for cores or hyperthreading at that time.  Consider that there may not be historical continuity among support engineers as either they retired or moved on.  The purpose of the database is data integrity.  So, it sounds like there is a need for additional columns of information to be stored.  What is the level of effort needed for this enhancement?

Thanks,

Tony.
Comment 23 Albert Gil 2019-06-25 12:33:23 MDT
Hi Tony,
 
> Most everything you mention would rely on scontrol show job which is not
> feasible for historical accounting - especially if things change over time.

Actually I would say that most of the info is obtained by sacct, not scontrol (specially not scontrol show job).
And the rest of the info from sacctmgr (events of configuration changes).

The scontrol is only used for this convenient / auxiliary / helper command:

$ scontrol show hostnames $nodelist

Note that this command doesn't depend on any slurm config or change, it just transforms strings (aka node or hostnames) from/to "compressed format" (eg "node[1-5]") to/from list format (eg "node1 node2 node3 node4 node5").
 
> Without knowing whether threads or cores were being used over time, the data
> loses an element of integrity.  I would not be able to compare today's jobs
> run on cores to those of a year ago since I wouldn't know if the system was
> setup for cores or hyperthreading at that time.  Consider that there may not
> be historical continuity among support engineers as either they retired or
> moved on.  The purpose of the database is data integrity.  So, it sounds
> like there is a need for additional columns of information to be stored.

To keep track of historical changes on HT / CPUs Meaning please see the sacctmgr command that I mentioned in comment 19 in this paragraph:

>> - If nodes of the site have different ThreadsPerCore or they change over time:
>>     - With "sacct" you can obtain the Nodes where the job was run with NodeList
>>       - With "scontrol show hostnames $nodelist" you can convert the above nodelist (in compressed format) to an actual list of nodes
>>     - For each node you can obtain current ThreadsPerCore with "scontrol show node $node" and parsing "ThreadsPerCore="
>>     - For each node you can check if CPUs was changed over time with "sacctmgr show event where node=$node format=timestart,TRES"


Let me enhance my explanation with an example of how to use "sacctmgr show event" to register and query historical changes on HT setup.
Lets have a node with HT disabled and thus accounting by Core (CPU mean Core):

NodeName=DEFAULT Sockets=1 CPUs=2 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=1024

To register that in the event table, just put it down and restart the slurmd (or "scontrol reboot" the node):

$ scontrol update nodename=c1 state=down reason="Register No-HT"
(restart hte slurmd of the node and/or resume it)

In the table we can see how it is registered, with the TRES of that node in that time:

$ sacctmgr show event node=c1 format=nodename,start,reason,tres%30
       NodeName           TimeStart                          Reason                           TRES
--------------- -------------------  ------------------------------ ------------------------------
c1              2019-06-25T19:38:21                  Register No-HT         billing=2,cpu=2,mem=1G


Now, lets enable the HT but keep the accounting to Cores (CPUs=Cores) in slurm.conf:

NodeName=DEFAULT Sockets=1 CPUs=2 CoresPerSocket=2 ThreadsPerCore=2 RealMemory=1024

Stop the slumrd of the node and restart the slurmctld.
Register the node as down with the right reason:

$ scontrol update nodename=c1 state=down reason="Register HT CPUs as Cores"

Start the slurmd and see how it is registered:

$ sacctmgr show event All_time node=c1 format=nodename,start,reason,tres%30
       NodeName           TimeStart                          Reason                           TRES
--------------- -------------------  ------------------------------ ------------------------------
c1              2019-06-25T19:38:21                  Register No-HT         billing=2,cpu=2,mem=1G
c1              2019-06-25T19:40:44       Register HT CPUs as Cores         billing=2,cpu=2,mem=1G

Note that the TRES/cpu is still 2, so no change needs to be done in your accounting tool as CPUs (and CPUTime) still mean "Core".

Now lets change the setup and not only enable HT, but also make the accounting to Threads instead of Cores (CPUs=Threads), in slurm.conf:

NodeName=DEFAULT Sockets=1 CPUs=4 CoresPerSocket=2 ThreadsPerCore=2 RealMemory=1024

Again, stop the slumrd, restart the slurmctld and register this change:

$ scontrol update nodename=c1 state=down reason="Register HT CPUs as Threads"

Start the slurmd and see how it is registered:

$ sacctmgr show event All_time node=c1 format=nodename,start,reason,tres%30
       NodeName           TimeStart                          Reason                           TRES
--------------- -------------------  ------------------------------ ------------------------------
c1              2019-06-25T19:38:21                  Register No-HT         billing=2,cpu=2,mem=1G
c1              2019-06-25T19:40:44       Register HT CPUs as Cores         billing=2,cpu=2,mem=1G
c1              2019-06-25T19:41:39     Register HT CPUs as Threads         billing=4,cpu=4,mem=1G

As you can see, now CPUs have changed, thus the meaning of CPU and CPUTime of jobs after that change mean Thread and "ThreadTime".
So, your tool have to take this into account.


If you change the whole cluster to Cores to Threads, then you won't need to register all single nodes, but use automatic cluster regsitrations:

$ sacctmgr show event cluster format=clusternodes,start,end,reason,tres%30
       Cluster Nodes           TimeStart             TimeEnd                         Reason                           TRES 
-------------------- ------------------- ------------------- ------------------------------ ------------------------------ 
              c[1-4] 2018-12-21T17:35:39 2019-06-25T19:41:21        Cluster Registered TRES billing=8,cpu=8,energy=0,fs/d+ 
              c[1-4] 2019-06-25T19:41:21             Unknown        Cluster Registered TRES billing=16,cpu=16,energy=0,fs+ 


As you can see I've just 4 nodes, all them with the DEFAULT config used above.
Note that in the same time that we register c1 above, specifically when the controller was restarted, Slurm automatically registers a change in the TRES, and the number of CPUs is now 2 times the previous value.
Maybe checking this is enoght for you?


Anyway, we know that this is not a perfect solution for your use case, it needs an effort on the accounting tool and probably an effort on manual registration of configuration changes (in the events table of the database).
But I still think that it should allow you to obtain something really close to the accounting data that you are looking for. I think.

 
> What is the level of effort needed for this enhancement?

This needs some internal discussion.
I leave this to Tim and Brian (in CC).

I know that you already explain clearly that you want accounting always as "CoreTime", but as you have such heterogeneous setups, I would also recommend to discuss with them if you really need all this or the enhancement to obtain that CoreTime, or if maybe using Slurm's flexibility of CPUTime should allow you to account properly with only it, even if it doesn't mean "CoreTime" always, because it does mean "user reserved/consumed time" en each setup. But probably I'm missing key points of your setup and I'm wrong.


Hope that helps,
Albert
Comment 29 Tim Wickberg 2019-07-03 15:30:12 MDT
Tony -

I'm working through what a tenable solution would be here, but we still do not have agreement on the best path forward, and the back-and-forth on this issue I think has distracted from the underlying issue and the specific use case.

From my understanding, the central issue is that you'd like easy access to core-based job accounting. As the NOAA systems vary in whether HT is enabled or disabled (and the settings will occasionally change within the lifespan of any given cluster) this is difficult at present since our cpu (which in Slurm represents the threads) values will thus vary by a factor of two.

Is that an accurate summary?

And are there any concerns around establishing and updating limits on the system in terms of core-based usage? One issue with introducing any additional TRES values is that they're used as both a way to represent usage, and as a way to build resource limits, and thus introducing a Core TRES is not quite as simple as it may seem.

- Tim

(I'm changing this issue title as well; the previous discussion around how TRESBillingWeight works is not germane to the real issue at hand.)
Comment 30 Anthony DelSorbo 2019-07-04 07:02:10 MDT
(In reply to Tim Wickberg from comment #29)
> Tony -
> 
> I'm working through what a tenable solution would be here, but we still do
> not have agreement on the best path forward, and the back-and-forth on this
> issue I think has distracted from the underlying issue and the specific use
> case.
> 
> From my understanding, the central issue is that you'd like easy access to
> core-based job accounting. As the NOAA systems vary in whether HT is enabled
> or disabled (and the settings will occasionally change within the lifespan
> of any given cluster) this is difficult at present since our cpu (which in
> Slurm represents the threads) values will thus vary by a factor of two.
> 
> Is that an accurate summary?
> 
> And are there any concerns around establishing and updating limits on the
> system in terms of core-based usage? One issue with introducing any
> additional TRES values is that they're used as both a way to represent
> usage, and as a way to build resource limits, and thus introducing a Core
> TRES is not quite as simple as it may seem.
> 
> - Tim
> 
> (I'm changing this issue title as well; the previous discussion around how
> TRESBillingWeight works is not germane to the real issue at hand.)


Tim,

Thanks for the reply.  No issues on changing the title.  If you haven't already, you should also take a look at 5946 for additional info.

Your summary is appropriate, but I wouldn't classify it as "easy access".  The main issue is that there is no method of definitively knowing whether a job was submitted/run using threads or cores.  There's not enough information in the job's database records to make that determination.

Consider that today I have two environments, one at Boulder where they have multiple classes of nodes in different partitions and no HT.  At the NESCC (Fairmon), I have one class of systems and HT on all nodes.  I run the same accounting script at both sites.  So, in order to keep accounting correct at the NESCC, I have to implement the hack in 5946.

Now supposing that tomorrow we get a new class of nodes to add to the cluster at Boulder and we decide that HT should be on.  Do I implement the hack in 5946?  I can't be spending time to re-develop my accounting scripts every time there's a new system.  Nor can I count on a configuration file for my accounting scripts as that doesn't keep track of historical changes.  Further, suppose I submit a 4-cpu job to a 16 cpu node(32 threads).  Did the job use 4 cores or two cores?  How would I know?

But, if the data is there in the database with the job, I could then simply use that info - whether the job was run yesterday or a year ago, I would have sufficient details for consistent reporting.  Consider too, that the information might be used by users to compare their job runs today from that of a year ago.  Having the details available from that time is necessary for a valid comparison. Is a database column that identifies job layout per node and core the right approach?  

I hope that clears things up.
Comment 32 Tim Wickberg 2019-09-26 10:35:42 MDT
Apologies for the spam in the preceeding comment; that account has been banned.