| Summary: | Raw usage for fairshare not accumulating | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Robert Olson <olson> |
| Component: | Accounting | Assignee: | Jacob Jenson <jacob> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | ||
| Version: | 19.05.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurm.conf | ||
Digging further. We have lots of data in the accounting database:
mysql> select * from maas_assoc_usage_day_table limit 10;
+---------------+------------+---------+----+---------+------------+------------+
| creation_time | mod_time | deleted | id | id_tres | time_start | alloc_secs |
+---------------+------------+---------+----+---------+------------+------------+
| 1561093200 | 1561093200 | 0 | 0 | 1 | 1561006800 | 1515 |
| 1561179600 | 1561194000 | 0 | 0 | 1 | 1561093200 | 3249006 |
| 1561266000 | 1561276800 | 0 | 0 | 1 | 1561179600 | 6220434 |
| 1561352400 | 1561352400 | 0 | 0 | 1 | 1561266000 | 5295102 |
| 1561438800 | 1561446000 | 0 | 0 | 1 | 1561352400 | 6078030 |
| 1561525200 | 1561532400 | 0 | 0 | 1 | 1561438800 | 6220494 |
| 1561611600 | 1561626000 | 0 | 0 | 1 | 1561525200 | 6220620 |
| 1561698000 | 1561701600 | 0 | 0 | 1 | 1561611600 | 8992801 |
| 1561784400 | 1561784400 | 0 | 0 | 1 | 1561698000 | 14694290 |
| 1561870800 | 1561870800 | 0 | 0 | 1 | 1561784400 | 6543524 |
+---------------+------------+---------+----+---------+------------+------------+
10 rows in set (0.00 sec)
mysql> select count(*) from maas_assoc_usage_day_table ;
+----------+
| count(*) |
+----------+
| 192 |
+----------+
1 row in set (0.04 sec)
However, all entries are id=0. From a read of the source, that id should correspond to ids in the cluster assoc_table. There we don't have an id=0:
mysql> select id_assoc, user, acct, parent_acct from maas_assoc_table;
+----------+------+------------+-------------+
| id_assoc | user | acct | parent_acct |
+----------+------+------------+-------------+
| 1 | | root | |
| 2 | root | root | |
| 3 | | webservice | root |
| 4 | | seed | webservice |
| 5 | | rast | webservice |
+----------+------+------------+-------------+
5 rows in set (0.00 sec)
So this feels like it is indeed a configuration problem on the accounting somewhere.
Usage is associated with the account properly:
bash-4.2$ sacct -A seed | head
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
512641_3 wrap all seed 4 COMPLETED 0:0
512641_3.ba+ batch seed 4 COMPLETED 0:0
512641_13 wrap all seed 4 COMPLETED 0:0
512641_13.b+ batch seed 4 COMPLETED 0:0
512641_14 wrap all seed 4 COMPLETED 0:0
512641_14.b+ batch seed 4 COMPLETED 0:0
512641_23 wrap all seed 4 COMPLETED 0:0
512641_23.b+ batch seed 4 COMPLETED 0:0
But not with the association:
bash-4.2$ sacct -x 4
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
bash-4.2$ sacct -x 0 | head
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
512641_3 wrap all seed 4 COMPLETED 0:0
512641_3.ba+ batch seed 4 COMPLETED 0:0
512641_13 wrap all seed 4 COMPLETED 0:0
512641_13.b+ batch seed 4 COMPLETED 0:0
512641_14 wrap all seed 4 COMPLETED 0:0
512641_14.b+ batch seed 4 COMPLETED 0:0
512641_23 wrap all seed 4 COMPLETED 0:0
512641_23.b+ batch seed 4 COMPLETED 0:0
One final comment to anyone who might wander here - apparently the reason I was able to submit all these jobs was that AccountingStorageEnforce is not set by default. I think the comment with my solution was lost - the problem was that I didn't associate users with the accounts, so even though the jobs were submitted with an account specified, the association wasn't found so the logging did not happen properly. |
Created attachment 11131 [details] slurm.conf I'm sure this is a configuration problem somewhere but I'm not seeing it. I have a small cluster set up on Centos7 with fair share accounting configured. However, after a large number of jobs have run I'm not seeing any usage accruing in sshare: bash-4.2$ sshare Account User RawShares NormShares RawUsage EffectvUsage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------- root 1.000000 0 1.000000 0.500000 webservice 1024 0.500000 0 0.000000 1.000000 rast 512 0.250000 0 0.000000 1.000000 seed 512 0.250000 0 0.000000 1.000000 The jobs are all run with an account specified with -A. I have JobAcctGatherType=jobacct_gather/linux (slurm.config also attached). I'm not sure what would be useful to add to help track this down; let me know. Thank you! Bob