Ticket 7531

Summary: Raw usage for fairshare not accumulating
Product: Slurm Reporter: Robert Olson <olson>
Component: AccountingAssignee: Jacob Jenson <jacob>
Status: RESOLVED INVALID QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 19.05.0   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf

Description Robert Olson 2019-08-06 17:25:52 MDT
Created attachment 11131 [details]
slurm.conf

I'm sure this is a configuration problem somewhere but I'm not seeing it. 

I have a small cluster set up on Centos7 with fair share accounting configured. However, after a large number of jobs have run I'm not seeing any usage accruing in sshare:

bash-4.2$ sshare
             Account       User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare 
-------------------- ---------- ---------- ----------- ----------- ------------- ---------- 
root                                          1.000000           0      1.000000   0.500000 
 webservice                           1024    0.500000           0      0.000000   1.000000 
  rast                                 512    0.250000           0      0.000000   1.000000 
  seed                                 512    0.250000           0      0.000000   1.000000 

The jobs are all run with an account specified with -A. I have 

JobAcctGatherType=jobacct_gather/linux

(slurm.config also attached). 

I'm not sure what would be useful to add to help track this down; let me know. Thank you!
Bob
Comment 1 Robert Olson 2019-08-07 09:41:47 MDT
Digging further. We have lots of data in the accounting database:

mysql> select * from maas_assoc_usage_day_table limit 10;
+---------------+------------+---------+----+---------+------------+------------+
| creation_time | mod_time   | deleted | id | id_tres | time_start | alloc_secs |
+---------------+------------+---------+----+---------+------------+------------+
|    1561093200 | 1561093200 |       0 |  0 |       1 | 1561006800 |       1515 |
|    1561179600 | 1561194000 |       0 |  0 |       1 | 1561093200 |    3249006 |
|    1561266000 | 1561276800 |       0 |  0 |       1 | 1561179600 |    6220434 |
|    1561352400 | 1561352400 |       0 |  0 |       1 | 1561266000 |    5295102 |
|    1561438800 | 1561446000 |       0 |  0 |       1 | 1561352400 |    6078030 |
|    1561525200 | 1561532400 |       0 |  0 |       1 | 1561438800 |    6220494 |
|    1561611600 | 1561626000 |       0 |  0 |       1 | 1561525200 |    6220620 |
|    1561698000 | 1561701600 |       0 |  0 |       1 | 1561611600 |    8992801 |
|    1561784400 | 1561784400 |       0 |  0 |       1 | 1561698000 |   14694290 |
|    1561870800 | 1561870800 |       0 |  0 |       1 | 1561784400 |    6543524 |
+---------------+------------+---------+----+---------+------------+------------+
10 rows in set (0.00 sec)

mysql> select count(*) from maas_assoc_usage_day_table ;
+----------+
| count(*) |
+----------+
|      192 |
+----------+
1 row in set (0.04 sec)

However, all entries are id=0. From a read of the source, that id should correspond to ids in the cluster assoc_table. There we don't have an id=0:


mysql> select id_assoc, user, acct, parent_acct from maas_assoc_table;
+----------+------+------------+-------------+
| id_assoc | user | acct       | parent_acct |
+----------+------+------------+-------------+
|        1 |      | root       |             |
|        2 | root | root       |             |
|        3 |      | webservice | root        |
|        4 |      | seed       | webservice  |
|        5 |      | rast       | webservice  |
+----------+------+------------+-------------+
5 rows in set (0.00 sec)

So this feels like it is indeed a configuration problem on the accounting somewhere. 

Usage is associated with the account properly:

bash-4.2$ sacct -A seed | head
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
512641_3           wrap        all       seed          4  COMPLETED      0:0 
512641_3.ba+      batch                  seed          4  COMPLETED      0:0 
512641_13          wrap        all       seed          4  COMPLETED      0:0 
512641_13.b+      batch                  seed          4  COMPLETED      0:0 
512641_14          wrap        all       seed          4  COMPLETED      0:0 
512641_14.b+      batch                  seed          4  COMPLETED      0:0 
512641_23          wrap        all       seed          4  COMPLETED      0:0 
512641_23.b+      batch                  seed          4  COMPLETED      0:0 

But not with the association:

bash-4.2$ sacct -x 4
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
bash-4.2$ sacct -x 0 | head
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
512641_3           wrap        all       seed          4  COMPLETED      0:0 
512641_3.ba+      batch                  seed          4  COMPLETED      0:0 
512641_13          wrap        all       seed          4  COMPLETED      0:0 
512641_13.b+      batch                  seed          4  COMPLETED      0:0 
512641_14          wrap        all       seed          4  COMPLETED      0:0 
512641_14.b+      batch                  seed          4  COMPLETED      0:0 
512641_23          wrap        all       seed          4  COMPLETED      0:0 
512641_23.b+      batch                  seed          4  COMPLETED      0:0
Comment 2 Robert Olson 2019-08-07 10:26:47 MDT
One final comment to anyone who might wander here - apparently the reason I was able to submit all these jobs was that AccountingStorageEnforce is not set by default.
Comment 3 Robert Olson 2019-08-07 10:28:14 MDT
I think the comment with my solution was lost - the problem was that I didn't associate users with the accounts, so even though the jobs were submitted with an account specified, the association wasn't found so the logging did not happen properly.