Ticket 5924

Summary:	runaway jobs in slurmdb
Product:	Slurm	Reporter:	Ryan Day <day36>
Component:	Database	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	Steven.Silk, sts
Version:	17.11.9
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=5969
Site:	LLNL	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Ryan Day 2018-10-25 11:25:54 MDT

We apparently had some connectivity issues or something that led to a bunch of runaway jobs in the slurm_acct_db. It was around a big update of the cluster, and everything was on 17.02 at the time, so I'm not too interested in trying to figure out exactly what happened, but I do want to get it cleaned up so that sreport will give reasonable numbers. I ran an 'sacctmgr show runawayjobs', and that appears to have taken care of the runaway jobs in the output of sacct, but the sreport numbers are still weird. I noticed that there are a lot of jobs with time_eligible and time_start='0' and state='3' (Completed, I think) in the _job_table from around the same dates as the runaway jobs. Looking at other clusters, the only jobs wityh time_eligible='0' apear to be in state='4' or state='0'. Could those be incorrect and messing up the usage table rollups?

Comment 1 Marshall Garey 2018-10-25 15:16:49 MDT

First let's make sure that all the runaway jobs are actually cleared up, then we can look at other things. sacctmgr show runaway as it is currently implemented only looks at jobs in the pending, running, and suspended states. However, we have found instances of jobs in the completed or cancelled states that are runaway. We're working on fixing that internally (for 19.05 and maybe 18.08).

Can you run this mysql query and upload the results?

select id_job,time_start,time_end,state from <your_cluster_name>_job_table where time_end=0 and state > 2;

Comment 2 Ryan Day 2018-10-25 16:21:13 MDT

That looks clean:

MariaDB [slurm_acct_db]> select id_job,time_start,time_end,state from surface_job_table where time_end=0 and state > 2;
Empty set (0.00 sec)

Comment 5 Marshall Garey 2018-10-26 13:32:49 MDT

(In reply to Ryan Day from comment #0)
> I noticed
> that there are a lot of jobs with time_eligible and time_start='0' and
> state='3' (Completed, I think) in the _job_table from around the same dates
> as the runaway jobs. Looking at other clusters, the only jobs wityh
> time_eligible='0' apear to be in state='4' or state='0'. Could those be
> incorrect and messing up the usage table rollups?


time_eligible=0 and state=0 (pending) is fine and normal - it means that a job is pending and not eligible to run (for example, it has a job dependency).

time_eligible=0 and state=4 (cancelled) is also normal - it means that a job of the previous type was cancelled before it was ever eligible to run.

time_start=0 and state=3 (completed) seems bad to me. I don't have any jobs like that in any of my databases. Can you show the output of this query?

select id_job,time_eligible,time_start,time_end,state from surface_job_table where time_start=0 and state=3;


Can you also show an example of an sreport command that has bad numbers? If possible, can you get the results of an sacct command for that same data so we can see if sacct is reporting differently from sreport?

Comment 6 Ryan Day 2018-10-26 15:19:42 MDT

We had one user piling a *lot* of jobs in, so the results of that query are pretty long:

MariaDB [slurm_acct_db]> select id_job,time_eligible,time_start,time_end,state from surface_job_table where time_start=0 and state=3;
+--------+---------------+------------+------------+-------+
| id_job | time_eligible | time_start | time_end   | state |
+--------+---------------+------------+------------+-------+
|    307 |             0 |          0 | 1532710012 |     3 |
|    309 |             0 |          0 | 1532710012 |     3 |
|   2432 |             0 |          0 | 1532729834 |     3 |
|   4686 |             0 |          0 | 1532733088 |     3 |
|   7956 |             0 |          0 | 1532737695 |     3 |
|  11332 |             0 |          0 | 1532743030 |     3 |
|  12588 |             0 |          0 | 1532833916 |     3 |
|  13760 |             0 |          0 | 1533232922 |     3 |
|  13761 |             0 |          0 | 1533232923 |     3 |
|  13763 |             0 |          0 | 1533232923 |     3 |
|  13762 |             0 |          0 | 1533232923 |     3 |
<< I cut a bunch out here >>
|  45198 |             0 |          0 | 1533318743 |     3 |
|  45197 |             0 |          0 | 1533318705 |     3 |
|  45199 |             0 |          0 | 1533318742 |     3 |
|  45202 |             0 |          0 | 1533318709 |     3 |
|  45204 |             0 |          0 | 1533318709 |     3 |
|  45203 |             0 |          0 | 1533318714 |     3 |
+--------+---------------+------------+------------+-------+
18003 rows in set (0.05 sec)

MariaDB [slurm_acct_db]>

Most of those jobs belong to one user, isaac4, with the utah account. The sreport that turned this up still has him at ~1000% (before running the 'sacctmgr show runawayjobs', isaac4 was at ~1600%):

[day36@surface86:~]$ sreport -t percent user topusage start=10/1
--------------------------------------------------------------------------------
Top 10 Users 2018-10-01T00:00:00 - 2018-10-25T23:59:59 (2160000 secs)
Use reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster     Login     Proper Name         Account        Used   Energy 
--------- --------- --------------- --------------- ----------- -------- 
  surface    isaac4 Benjamin John +            utah     970.45%    0.00% 
  surface   silva50  James M. Silva              lc      74.54%    0.00% 
  surface      root            root            root      48.78%    0.00% 
  surface   silva50  James M. Silva           views      24.85%    0.00% 
  surface    wang83 Jonathan M. Wa+            uiuc      23.67%    0.00% 
  surface    tomaso Tomas Oppelstr+             cms       9.93%    0.00% 
  surface   kaplan7 Alan David Kap+            msdf       8.08%    0.00% 
  surface   lee1004     Donghwa Lee         micphys       6.98%    0.00% 
  surface   jortega Jason M. Ortega              me       5.47%    0.00% 
  surface  mundhetn Terrell Nathan+           hpcdl       1.63%    0.00% 
[day36@surface86:~]$

isaac4 has too many jobs in the last month to allow easy pasting of sacct data, but I can put them in as an attachment if you like (23697 jobs). Here's the sacct output for silva50:

[day36@surface86:~]$ sacct -u silva50 -S 10/1/18 -E 10/26/18 -o 'jobid,eligible,start,end,state'
       JobID            Eligible               Start                 End      State 
------------ ------------------- ------------------- ------------------- ---------- 
[day36@surface86:~]$

One hole in my hypothesis that the weird numbers in sreport are due to the 'time_eligible=0,state=3' jobs is that that no jobs that fit that description for silva50 in the surface_job_table, and there are no jobs at all for silva50 / views:

MariaDB [slurm_acct_db]> select user,acct,id_assoc from surface_assoc_table where user='silva50';
+---------+-------+----------+
| user    | acct  | id_assoc |
+---------+-------+----------+
| silva50 | lc    |     1718 |
| silva50 | views |     1798 |
+---------+-------+----------+
2 rows in set (0.00 sec)

MariaDB [slurm_acct_db]> select id_job,time_eligible,time_start,time_end,state from surface_job_table where id_assoc='1718';
+--------+---------------+------------+------------+-------+
| id_job | time_eligible | time_start | time_end   | state |
+--------+---------------+------------+------------+-------+
|     21 |    1532624402 | 1532624402 | 1532624803 |     3 |
|     29 |    1532624898 | 1532624898 | 1532625075 |     3 |
|     31 |    1532625425 |          0 | 1532625432 |     4 |
|     32 |    1532625442 |          0 | 1532625745 |     4 |
|     33 |    1532625777 |          0 | 1532626173 |     4 |
|     37 |    1532627653 | 1532627653 | 1532627653 |     3 |
+--------+---------------+------------+------------+-------+
6 rows in set (0.00 sec)

MariaDB [slurm_acct_db]> select id_job,time_eligible,time_start,time_end,state from surface_job_table where id_assoc='1798';
Empty set (0.02 sec)

MariaDB [slurm_acct_db]>

Comment 7 Ryan Day 2018-10-29 09:51:14 MDT

where is sreport actually getting its data for the 'user topusage' report? I had assumed it was the <cluster>_assoc_<day|hour|month>_table, but silva50/views doesn't have anything in those:

MariaDB [slurm_acct_db]> select user,acct,id_assoc from surface_assoc_table where user='silva50';
+---------+-------+----------+
| user    | acct  | id_assoc |
+---------+-------+----------+
| silva50 | lc    |     1718 |
| silva50 | views |     1798 |
+---------+-------+----------+
2 rows in set (0.00 sec)

MariaDB [slurm_acct_db]> select * from surface_assoc_usage_hour_table where id='1798' and id_tres='4';
Empty set (0.01 sec)

MariaDB [slurm_acct_db]> select * from surface_assoc_usage_day_table where id='1798' and id_tres='4';
Empty set (0.01 sec)

MariaDB [slurm_acct_db]> select * from surface_assoc_usage_month_table where id='1798' and id_tres='4';
Empty set (0.03 sec)

MariaDB [slurm_acct_db]>

but he still shows up with >20% in the topusage report (previous comment).

Comment 8 Ryan Day 2018-10-29 09:55:09 MDT

Nevermind. Silva50/views doesn't have anything for id_tres=4, but does have stuff for id_tres=1:

MariaDB [slurm_acct_db]> select * from surface_assoc_usage_day_table where id='1798' and id_tres='1';
+---------------+------------+---------+------+---------+------------+------------+
| creation_time | mod_time   | deleted | id   | id_tres | time_start | alloc_secs |
+---------------+------------+---------+------+---------+------------+------------+
|    1540314376 | 1540486862 |       0 | 1798 |       1 | 1532502000 |   18204760 |
|    1540314376 | 1540486862 |       0 | 1798 |       1 | 1532588400 |   51563988 |
|    1540314376 | 1540486862 |       0 | 1798 |       1 | 1532674800 |   54604800 |
|    1540314376 | 1540486862 |       0 | 1798 |       1 | 1532761200 |   54604800 |
...
|    1540278000 | 1540486862 |       0 | 1798 |       1 | 1540191600 |   54604800 |
|    1540461601 | 1540486862 |       0 | 1798 |       1 | 1540278000 |   54604800 |
|    1540461601 | 1540486862 |       0 | 1798 |       1 | 1540364400 |   54604800 |
|    1540537200 | 1540537200 |       0 | 1798 |       1 | 1540450800 |   20476800 |
+---------------+------------+---------+------+---------+------------+------------+
93 rows in set (0.00 sec)

The sreport for -T node is even weirder though:

[day36@surface86:~]$ sreport -t percent user topusage start=10/1 cl=surface
--------------------------------------------------------------------------------
Top 10 Users 2018-10-01T00:00:00 - 2018-10-28T23:59:59 (2419200 secs)
Use reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster     Login     Proper Name         Account        Used   Energy 
--------- --------- --------------- --------------- ----------- -------- 
  surface    isaac4 Benjamin John +            utah     866.70%    0.00% 
  surface   silva50  James M. Silva              lc      66.55%    0.00% 
  surface      root            root            root      43.56%    0.00% 
  surface   silva50  James M. Silva           views      22.18%    0.00% 
  surface    wang83 Jonathan M. Wa+            uiuc      21.14%    0.00% 
  surface    tomaso Tomas Oppelstr+             cms       8.98%    0.00% 
  surface   kaplan7 Alan David Kap+            msdf       7.21%    0.00% 
  surface   jortega Jason M. Ortega              me       6.45%    0.00% 
  surface   lee1004     Donghwa Lee         micphys       6.24%    0.00% 
  surface    gromit William D. Kra+              lc       1.60%    0.00% 
[day36@surface86:~]$ sreport -t percent -T node user topusage start=10/1 cl=surface
--------------------------------------------------------------------------------
Top 10 Users 2018-10-01T00:00:00 - 2018-10-28T23:59:59 (2419200 secs)
Use reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster     Login     Proper Name         Account      TRES Name       Used 
--------- --------- --------------- --------------- -------------- ---------- 
  surface    isaac4 Benjamin John +            utah           node 3249899206 
  surface   silva50  James M. Silva              lc           node      0.00% 
  surface      root            root            root           node 36147400.0 
  surface   silva50  James M. Silva           views           node      0.00% 
  surface    wang83 Jonathan M. Wa+            uiuc           node 7926475900 
  surface    tomaso Tomas Oppelstr+             cms           node 3366052000 
  surface   kaplan7 Alan David Kap+            msdf           node 2705004700 
  surface   jortega Jason M. Ortega              me           node 2417730600 
  surface   lee1004     Donghwa Lee         micphys           node 2338426400 
  surface    gromit

Comment 9 Marshall Garey 2018-10-29 11:51:34 MDT

Let's get those job records fixed first. I'm not certain that it will change the sreport output, but they need fixing regardless.

* Backup your database before making any changes by hand
* Do a select, but order by time_end so that you can get the earliest time_end. I said desc so that the last record would have the earliest timestamp for my convenience, but do it however you want.

select id_job,time_eligible,time_start,time_end,state from surface_job_table where time_start=0 and state=3 order by time_end desc;


* Run this query to fix the jobs with time_start=0 and state=3 (completed):

update surface_job_table set time_start=time_end where time_start=0 and time_end!=0 and state=3;

The time_end!=0 condition is probably not necessary, but I included it anyway.

* Reset the last rollup time (in <cluster>_last_ran_table) to before the earliest time_end of all those jobs.

update surface_last_ran_table set hourly_rollup=<timestamp of earliest time_end>
update surface_last_ran_table set daily_rollup=<timestamp of earliest time_end>
update surface_last_ran_table set monthly_rollup=<timestamp of earliest time_end>

* Wait for the rollup to happen (at the next hour).

* Re-run sreport. If there are still problems, I'd like to further investigate the possibility of bad data in the association usage tables.

Comment 10 Ryan Day 2018-10-29 15:15:49 MDT

Okay. I've cleaned up those time_start='0', state='3' entries in the surface_job_table, but it didn't fix the sreport output, and the surface_assoc_usage tables still show time allocated for, e.g. silva50/views/tres=1. So, it looks like the table rollups aren't getting rid of those entries.

Comment 11 Marshall Garey 2018-10-30 11:05:35 MDT

Yep, sreport topusage of nodes is definitely weird:

$ sreport -t percent -T node user topusage start=10/1
--------------------------------------------------------------------------------
Top 10 Users 2018-10-01T00:00:00 - 2018-10-29T23:59:59 (2505600 secs)
Usage reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster     Login     Proper Name         Account      TRES Name     Used 
--------- --------- --------------- --------------- -------------- -------- 
      byu  marshall        Marshall            test           node 48712300


Can you file that bug in a new ticket? I think it is independent of the rest of this bug.


===================================================================

This sreport command does query the assoc usage tables; and since those never get reset, I think those tables may have bad data from the runaway jobs.

* Once again, backup your database before making any changes.
* Run these queries to reset the association tables to a time that is before all those runaway jobs and the other jobs with time_start=0 and time_end!=0 and state=3 (I used Oct 1 as an example):

update surface_assoc_usage_hour_table set alloc_secs=0 where time_start>=UNIX_TIMESTAMP('2018-10-01');
update surface_assoc_usage_day_table set alloc_secs=0 where time_start>=UNIX_TIMESTAMP('2018-10-01');
update surface_assoc_usage_month_table set alloc_secs=0 where time_start>=UNIX_TIMESTAMP('2018-10-01');

* Reset the last ran tables to the same time stamp

update surface_last_ran_table set hourly_rollup=UNIX_TIMESTAMP('2018-10-01');
update surface_last_ran_table set daily_rollup=UNIX_TIMESTAMP('2018-10-01');
update surface_last_ran_table set monthly_rollup=UNIX_TIMESTAMP('2018-10-01');

* The association usage tables will get re-calculated when rollup happens at the beginning of the next hour, so these. Then, re-run your sreport query.

* Assuming this does fix it, we'll need to do something to automate this fix. I think modifying sacctmgr show runaway command might be a good way to do it.

Comment 12 Ryan Day 2018-10-31 09:40:08 MDT

Thank you Marshall! Setting alloc_secs to '0' in the assoc_usage tables and completely re-rolling everything fixed things up so that my sreport numbers are much happier now.

I also filed bug 5954 on the 'sreport -T nodes -t percent user topusage' output.

Comment 13 Marshall Garey 2018-10-31 16:21:54 MDT

Great, I'm glad that worked. Thanks for reporting that other issue.

I've created an internal bug (bug 5959) to make sure we reset the assoc usage tables as needed.

Closing as resolved/infogiven.

- Marshall