| Summary: | runaway jobs in slurmdb | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Ryan Day <day36> |
| Component: | Database | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | Steven.Silk, sts |
| Version: | 17.11.9 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=5969 | ||
| Site: | LLNL | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Ryan Day
2018-10-25 11:25:54 MDT
First let's make sure that all the runaway jobs are actually cleared up, then we can look at other things. sacctmgr show runaway as it is currently implemented only looks at jobs in the pending, running, and suspended states. However, we have found instances of jobs in the completed or cancelled states that are runaway. We're working on fixing that internally (for 19.05 and maybe 18.08). Can you run this mysql query and upload the results? select id_job,time_start,time_end,state from <your_cluster_name>_job_table where time_end=0 and state > 2; That looks clean: MariaDB [slurm_acct_db]> select id_job,time_start,time_end,state from surface_job_table where time_end=0 and state > 2; Empty set (0.00 sec) (In reply to Ryan Day from comment #0) > I noticed > that there are a lot of jobs with time_eligible and time_start='0' and > state='3' (Completed, I think) in the _job_table from around the same dates > as the runaway jobs. Looking at other clusters, the only jobs wityh > time_eligible='0' apear to be in state='4' or state='0'. Could those be > incorrect and messing up the usage table rollups? time_eligible=0 and state=0 (pending) is fine and normal - it means that a job is pending and not eligible to run (for example, it has a job dependency). time_eligible=0 and state=4 (cancelled) is also normal - it means that a job of the previous type was cancelled before it was ever eligible to run. time_start=0 and state=3 (completed) seems bad to me. I don't have any jobs like that in any of my databases. Can you show the output of this query? select id_job,time_eligible,time_start,time_end,state from surface_job_table where time_start=0 and state=3; Can you also show an example of an sreport command that has bad numbers? If possible, can you get the results of an sacct command for that same data so we can see if sacct is reporting differently from sreport? We had one user piling a *lot* of jobs in, so the results of that query are pretty long:
MariaDB [slurm_acct_db]> select id_job,time_eligible,time_start,time_end,state from surface_job_table where time_start=0 and state=3;
+--------+---------------+------------+------------+-------+
| id_job | time_eligible | time_start | time_end | state |
+--------+---------------+------------+------------+-------+
| 307 | 0 | 0 | 1532710012 | 3 |
| 309 | 0 | 0 | 1532710012 | 3 |
| 2432 | 0 | 0 | 1532729834 | 3 |
| 4686 | 0 | 0 | 1532733088 | 3 |
| 7956 | 0 | 0 | 1532737695 | 3 |
| 11332 | 0 | 0 | 1532743030 | 3 |
| 12588 | 0 | 0 | 1532833916 | 3 |
| 13760 | 0 | 0 | 1533232922 | 3 |
| 13761 | 0 | 0 | 1533232923 | 3 |
| 13763 | 0 | 0 | 1533232923 | 3 |
| 13762 | 0 | 0 | 1533232923 | 3 |
<< I cut a bunch out here >>
| 45198 | 0 | 0 | 1533318743 | 3 |
| 45197 | 0 | 0 | 1533318705 | 3 |
| 45199 | 0 | 0 | 1533318742 | 3 |
| 45202 | 0 | 0 | 1533318709 | 3 |
| 45204 | 0 | 0 | 1533318709 | 3 |
| 45203 | 0 | 0 | 1533318714 | 3 |
+--------+---------------+------------+------------+-------+
18003 rows in set (0.05 sec)
MariaDB [slurm_acct_db]>
Most of those jobs belong to one user, isaac4, with the utah account. The sreport that turned this up still has him at ~1000% (before running the 'sacctmgr show runawayjobs', isaac4 was at ~1600%):
[day36@surface86:~]$ sreport -t percent user topusage start=10/1
--------------------------------------------------------------------------------
Top 10 Users 2018-10-01T00:00:00 - 2018-10-25T23:59:59 (2160000 secs)
Use reported in Percentage of Total
--------------------------------------------------------------------------------
Cluster Login Proper Name Account Used Energy
--------- --------- --------------- --------------- ----------- --------
surface isaac4 Benjamin John + utah 970.45% 0.00%
surface silva50 James M. Silva lc 74.54% 0.00%
surface root root root 48.78% 0.00%
surface silva50 James M. Silva views 24.85% 0.00%
surface wang83 Jonathan M. Wa+ uiuc 23.67% 0.00%
surface tomaso Tomas Oppelstr+ cms 9.93% 0.00%
surface kaplan7 Alan David Kap+ msdf 8.08% 0.00%
surface lee1004 Donghwa Lee micphys 6.98% 0.00%
surface jortega Jason M. Ortega me 5.47% 0.00%
surface mundhetn Terrell Nathan+ hpcdl 1.63% 0.00%
[day36@surface86:~]$
isaac4 has too many jobs in the last month to allow easy pasting of sacct data, but I can put them in as an attachment if you like (23697 jobs). Here's the sacct output for silva50:
[day36@surface86:~]$ sacct -u silva50 -S 10/1/18 -E 10/26/18 -o 'jobid,eligible,start,end,state'
JobID Eligible Start End State
------------ ------------------- ------------------- ------------------- ----------
[day36@surface86:~]$
One hole in my hypothesis that the weird numbers in sreport are due to the 'time_eligible=0,state=3' jobs is that that no jobs that fit that description for silva50 in the surface_job_table, and there are no jobs at all for silva50 / views:
MariaDB [slurm_acct_db]> select user,acct,id_assoc from surface_assoc_table where user='silva50';
+---------+-------+----------+
| user | acct | id_assoc |
+---------+-------+----------+
| silva50 | lc | 1718 |
| silva50 | views | 1798 |
+---------+-------+----------+
2 rows in set (0.00 sec)
MariaDB [slurm_acct_db]> select id_job,time_eligible,time_start,time_end,state from surface_job_table where id_assoc='1718';
+--------+---------------+------------+------------+-------+
| id_job | time_eligible | time_start | time_end | state |
+--------+---------------+------------+------------+-------+
| 21 | 1532624402 | 1532624402 | 1532624803 | 3 |
| 29 | 1532624898 | 1532624898 | 1532625075 | 3 |
| 31 | 1532625425 | 0 | 1532625432 | 4 |
| 32 | 1532625442 | 0 | 1532625745 | 4 |
| 33 | 1532625777 | 0 | 1532626173 | 4 |
| 37 | 1532627653 | 1532627653 | 1532627653 | 3 |
+--------+---------------+------------+------------+-------+
6 rows in set (0.00 sec)
MariaDB [slurm_acct_db]> select id_job,time_eligible,time_start,time_end,state from surface_job_table where id_assoc='1798';
Empty set (0.02 sec)
MariaDB [slurm_acct_db]>
where is sreport actually getting its data for the 'user topusage' report? I had assumed it was the <cluster>_assoc_<day|hour|month>_table, but silva50/views doesn't have anything in those: MariaDB [slurm_acct_db]> select user,acct,id_assoc from surface_assoc_table where user='silva50'; +---------+-------+----------+ | user | acct | id_assoc | +---------+-------+----------+ | silva50 | lc | 1718 | | silva50 | views | 1798 | +---------+-------+----------+ 2 rows in set (0.00 sec) MariaDB [slurm_acct_db]> select * from surface_assoc_usage_hour_table where id='1798' and id_tres='4'; Empty set (0.01 sec) MariaDB [slurm_acct_db]> select * from surface_assoc_usage_day_table where id='1798' and id_tres='4'; Empty set (0.01 sec) MariaDB [slurm_acct_db]> select * from surface_assoc_usage_month_table where id='1798' and id_tres='4'; Empty set (0.03 sec) MariaDB [slurm_acct_db]> but he still shows up with >20% in the topusage report (previous comment). Nevermind. Silva50/views doesn't have anything for id_tres=4, but does have stuff for id_tres=1: MariaDB [slurm_acct_db]> select * from surface_assoc_usage_day_table where id='1798' and id_tres='1'; +---------------+------------+---------+------+---------+------------+------------+ | creation_time | mod_time | deleted | id | id_tres | time_start | alloc_secs | +---------------+------------+---------+------+---------+------------+------------+ | 1540314376 | 1540486862 | 0 | 1798 | 1 | 1532502000 | 18204760 | | 1540314376 | 1540486862 | 0 | 1798 | 1 | 1532588400 | 51563988 | | 1540314376 | 1540486862 | 0 | 1798 | 1 | 1532674800 | 54604800 | | 1540314376 | 1540486862 | 0 | 1798 | 1 | 1532761200 | 54604800 | ... | 1540278000 | 1540486862 | 0 | 1798 | 1 | 1540191600 | 54604800 | | 1540461601 | 1540486862 | 0 | 1798 | 1 | 1540278000 | 54604800 | | 1540461601 | 1540486862 | 0 | 1798 | 1 | 1540364400 | 54604800 | | 1540537200 | 1540537200 | 0 | 1798 | 1 | 1540450800 | 20476800 | +---------------+------------+---------+------+---------+------------+------------+ 93 rows in set (0.00 sec) The sreport for -T node is even weirder though: [day36@surface86:~]$ sreport -t percent user topusage start=10/1 cl=surface -------------------------------------------------------------------------------- Top 10 Users 2018-10-01T00:00:00 - 2018-10-28T23:59:59 (2419200 secs) Use reported in Percentage of Total -------------------------------------------------------------------------------- Cluster Login Proper Name Account Used Energy --------- --------- --------------- --------------- ----------- -------- surface isaac4 Benjamin John + utah 866.70% 0.00% surface silva50 James M. Silva lc 66.55% 0.00% surface root root root 43.56% 0.00% surface silva50 James M. Silva views 22.18% 0.00% surface wang83 Jonathan M. Wa+ uiuc 21.14% 0.00% surface tomaso Tomas Oppelstr+ cms 8.98% 0.00% surface kaplan7 Alan David Kap+ msdf 7.21% 0.00% surface jortega Jason M. Ortega me 6.45% 0.00% surface lee1004 Donghwa Lee micphys 6.24% 0.00% surface gromit William D. Kra+ lc 1.60% 0.00% [day36@surface86:~]$ sreport -t percent -T node user topusage start=10/1 cl=surface -------------------------------------------------------------------------------- Top 10 Users 2018-10-01T00:00:00 - 2018-10-28T23:59:59 (2419200 secs) Use reported in Percentage of Total -------------------------------------------------------------------------------- Cluster Login Proper Name Account TRES Name Used --------- --------- --------------- --------------- -------------- ---------- surface isaac4 Benjamin John + utah node 3249899206 surface silva50 James M. Silva lc node 0.00% surface root root root node 36147400.0 surface silva50 James M. Silva views node 0.00% surface wang83 Jonathan M. Wa+ uiuc node 7926475900 surface tomaso Tomas Oppelstr+ cms node 3366052000 surface kaplan7 Alan David Kap+ msdf node 2705004700 surface jortega Jason M. Ortega me node 2417730600 surface lee1004 Donghwa Lee micphys node 2338426400 surface gromit Let's get those job records fixed first. I'm not certain that it will change the sreport output, but they need fixing regardless. * Backup your database before making any changes by hand * Do a select, but order by time_end so that you can get the earliest time_end. I said desc so that the last record would have the earliest timestamp for my convenience, but do it however you want. select id_job,time_eligible,time_start,time_end,state from surface_job_table where time_start=0 and state=3 order by time_end desc; * Run this query to fix the jobs with time_start=0 and state=3 (completed): update surface_job_table set time_start=time_end where time_start=0 and time_end!=0 and state=3; The time_end!=0 condition is probably not necessary, but I included it anyway. * Reset the last rollup time (in <cluster>_last_ran_table) to before the earliest time_end of all those jobs. update surface_last_ran_table set hourly_rollup=<timestamp of earliest time_end> update surface_last_ran_table set daily_rollup=<timestamp of earliest time_end> update surface_last_ran_table set monthly_rollup=<timestamp of earliest time_end> * Wait for the rollup to happen (at the next hour). * Re-run sreport. If there are still problems, I'd like to further investigate the possibility of bad data in the association usage tables. Okay. I've cleaned up those time_start='0', state='3' entries in the surface_job_table, but it didn't fix the sreport output, and the surface_assoc_usage tables still show time allocated for, e.g. silva50/views/tres=1. So, it looks like the table rollups aren't getting rid of those entries. Yep, sreport topusage of nodes is definitely weird:
$ sreport -t percent -T node user topusage start=10/1
--------------------------------------------------------------------------------
Top 10 Users 2018-10-01T00:00:00 - 2018-10-29T23:59:59 (2505600 secs)
Usage reported in Percentage of Total
--------------------------------------------------------------------------------
Cluster Login Proper Name Account TRES Name Used
--------- --------- --------------- --------------- -------------- --------
byu marshall Marshall test node 48712300
Can you file that bug in a new ticket? I think it is independent of the rest of this bug.
===================================================================
This sreport command does query the assoc usage tables; and since those never get reset, I think those tables may have bad data from the runaway jobs.
* Once again, backup your database before making any changes.
* Run these queries to reset the association tables to a time that is before all those runaway jobs and the other jobs with time_start=0 and time_end!=0 and state=3 (I used Oct 1 as an example):
update surface_assoc_usage_hour_table set alloc_secs=0 where time_start>=UNIX_TIMESTAMP('2018-10-01');
update surface_assoc_usage_day_table set alloc_secs=0 where time_start>=UNIX_TIMESTAMP('2018-10-01');
update surface_assoc_usage_month_table set alloc_secs=0 where time_start>=UNIX_TIMESTAMP('2018-10-01');
* Reset the last ran tables to the same time stamp
update surface_last_ran_table set hourly_rollup=UNIX_TIMESTAMP('2018-10-01');
update surface_last_ran_table set daily_rollup=UNIX_TIMESTAMP('2018-10-01');
update surface_last_ran_table set monthly_rollup=UNIX_TIMESTAMP('2018-10-01');
* The association usage tables will get re-calculated when rollup happens at the beginning of the next hour, so these. Then, re-run your sreport query.
* Assuming this does fix it, we'll need to do something to automate this fix. I think modifying sacctmgr show runaway command might be a good way to do it.
Thank you Marshall! Setting alloc_secs to '0' in the assoc_usage tables and completely re-rolling everything fixed things up so that my sreport numbers are much happier now. I also filed bug 5954 on the 'sreport -T nodes -t percent user topusage' output. Great, I'm glad that worked. Thanks for reporting that other issue. I've created an internal bug (bug 5959) to make sure we reset the assoc usage tables as needed. Closing as resolved/infogiven. - Marshall |