Ticket 4253

Summary: sacctmgr show runawayjobs not finding runawayjobs
Product: Slurm Reporter: Doug Jacobsen <dmjacobsen>
Component: slurmdbdAssignee: Marshall Garey <marshall>
Status: RESOLVED TIMEDOUT QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: jfbotts
Version: 17.02.7   
Hardware: Cray XC   
OS: Linux   
Site: NERSC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Doug Jacobsen 2017-10-11 10:41:49 MDT
Hello,

In the continuing misery of bug 3365, bug 4202, bug 4249, we are finding that some of the jobs that were running at the time of the slurmctld crash (bug 3365 recent comment), do not have completion records, however have state COMPLETED.

For some reason sacctmgr show runawayjobs doesn't see them:

edison01:~ # sacctmgr show runawayjobs
Runaway Jobs: No runaway jobs found
edison01:~ #


edison01:~ # sacct -j 7269673,7269678,7269679,7269680,7269682,7269683,7269684,7269685 --format=job,start,end,state,nnodes,cputimeraw
       JobID               Start                 End      State   NNodes CPUTimeRAW
------------ ------------------- ------------------- ---------- -------- ----------
7269673      2017-10-09T21:59:37             Unknown  COMPLETED     5000 30794640000
7269678      2017-10-09T22:00:50             Unknown  COMPLETED        4   24621696
7269679      2017-10-09T22:00:50             Unknown  COMPLETED        4   24621696
7269680      2017-10-09T22:00:50             Unknown  COMPLETED        4   24621696
7269682      2017-10-09T22:06:22             Unknown  COMPLETED      200 1227897600
7269683      2017-10-09T22:06:22             Unknown  COMPLETED      200 1227897600
7269684      2017-10-09T22:06:22             Unknown  COMPLETED      200 1227897600
7269685      2017-10-09T22:06:22             Unknown  COMPLETED      200 1227897600
edison01:~ #


edison01:~ # squeue -j 7269673,7269678,7269679,7269680,7269682,7269683,7269684,7269685
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
edison01:~ #


(scontrol job also doesn't find them)

Thankfully one is a 5000 node job and so was obviously discoverable when trying to understand how edison delivered 2x the max possible hours yesterday.

I'm planning on manually setting end times for these jobs (also notice there are no step records).

I'm comfortable assigning the lack of records is due to the slurmctld crash, but would like guidance as to the best way to fix the problem.

My plan is to send an end time, and then ensure the last ran table re-rolls usage from the 9th.

however, the bigger concern is why sacctmgr show runawayjobs (which I rely upon to find these cases) isn't detecting these.

Thanks,
Doug
Comment 1 Doug Jacobsen 2017-10-11 11:13:35 MDT
this is what i did:

MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269673;
+------------+---------+------------+----------+
| job_db_inx | id_job  | time_start | time_end |
+------------+---------+------------+----------+
|   23430280 | 7269673 | 1507611577 |        0 |
+------------+---------+------------+----------+
1 row in set (0.00 sec)

MariaDB [edison_slurm_acct_db]> update edison_job_table set time_end=1507611966 where job_db_inx=23430280;
Query OK, 1 row affected (0.09 sec)
Rows matched: 1  Changed: 1  Warnings: 0

MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269673;
+------------+---------+------------+------------+
| job_db_inx | id_job  | time_start | time_end   |
+------------+---------+------------+------------+
|   23430280 | 7269673 | 1507611577 | 1507611966 |
+------------+---------+------------+------------+
1 row in set (0.00 sec)

MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269678;
+------------+---------+------------+----------+
| job_db_inx | id_job  | time_start | time_end |
+------------+---------+------------+----------+
|   23430281 | 7269678 | 1507611650 |        0 |
+------------+---------+------------+----------+
1 row in set (0.00 sec)

MariaDB [edison_slurm_acct_db]> update edison_job_table set time_end=1507611856 where job_db_inx=23430281;
Query OK, 1 row affected (0.10 sec)
Rows matched: 1  Changed: 1  Warnings: 0

MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269678;
+------------+---------+------------+------------+
| job_db_inx | id_job  | time_start | time_end   |
+------------+---------+------------+------------+
|   23430281 | 7269678 | 1507611650 | 1507611856 |
+------------+---------+------------+------------+
1 row in set (0.00 sec)

MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269679;
+------------+---------+------------+----------+
| job_db_inx | id_job  | time_start | time_end |
+------------+---------+------------+----------+
|   23430282 | 7269679 | 1507611650 |        0 |
+------------+---------+------------+----------+
1 row in set (0.00 sec)

MariaDB [edison_slurm_acct_db]> update edison_job_table set time_end=1507611889 where job_db_inx=23430282;
Query OK, 1 row affected (0.08 sec)
Rows matched: 1  Changed: 1  Warnings: 0

MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269679;
+------------+---------+------------+------------+
| job_db_inx | id_job  | time_start | time_end   |
+------------+---------+------------+------------+
|   23430282 | 7269679 | 1507611650 | 1507611889 |
+------------+---------+------------+------------+
1 row in set (0.00 sec)

MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269680;
+------------+---------+------------+----------+
| job_db_inx | id_job  | time_start | time_end |
+------------+---------+------------+----------+
|   23430283 | 7269680 | 1507611650 |        0 |
+------------+---------+------------+----------+
1 row in set (0.00 sec)

MariaDB [edison_slurm_acct_db]> update edison_job_table set time_end=1507612489 where job_db_inx=23430283;
Query OK, 1 row affected (0.04 sec)
Rows matched: 1  Changed: 1  Warnings: 0

MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269680;
+------------+---------+------------+------------+
| job_db_inx | id_job  | time_start | time_end   |
+------------+---------+------------+------------+
|   23430283 | 7269680 | 1507611650 | 1507612489 |
+------------+---------+------------+------------+
1 row in set (0.00 sec)

MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269682;
+------------+---------+------------+----------+
| job_db_inx | id_job  | time_start | time_end |
+------------+---------+------------+----------+
|   23430285 | 7269682 | 1507611982 |        0 |
+------------+---------+------------+----------+
1 row in set (0.01 sec)

MariaDB [edison_slurm_acct_db]> update edison_job_table set time_end=1507612014 where job_db_inx=23430285;
Query OK, 1 row affected (0.29 sec)
Rows matched: 1  Changed: 1  Warnings: 0

MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269682;
+------------+---------+------------+------------+
| job_db_inx | id_job  | time_start | time_end   |
+------------+---------+------------+------------+
|   23430285 | 7269682 | 1507611982 | 1507612014 |
+------------+---------+------------+------------+
1 row in set (0.00 sec)

MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269683;
+------------+---------+------------+----------+
| job_db_inx | id_job  | time_start | time_end |
+------------+---------+------------+----------+
|   23430286 | 7269683 | 1507611982 |        0 |
+------------+---------+------------+----------+
1 row in set (0.00 sec)

MariaDB [edison_slurm_acct_db]> update edison_job_table set time_end=1507612014 where job_db_inx=23430286;
Query OK, 1 row affected (0.04 sec)
Rows matched: 1  Changed: 1  Warnings: 0

MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269683;
+------------+---------+------------+------------+
| job_db_inx | id_job  | time_start | time_end   |
+------------+---------+------------+------------+
|   23430286 | 7269683 | 1507611982 | 1507612014 |
+------------+---------+------------+------------+
1 row in set (0.00 sec)

MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269684;
+------------+---------+------------+----------+
| job_db_inx | id_job  | time_start | time_end |
+------------+---------+------------+----------+
|   23430287 | 7269684 | 1507611982 |        0 |
+------------+---------+------------+----------+
1 row in set (0.00 sec)

MariaDB [edison_slurm_acct_db]> update edison_job_table set time_end=1507612052 where job_db_inx=23430287;
Query OK, 1 row affected (0.07 sec)
Rows matched: 1  Changed: 1  Warnings: 0

MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269684;
+------------+---------+------------+------------+
| job_db_inx | id_job  | time_start | time_end   |
+------------+---------+------------+------------+
|   23430287 | 7269684 | 1507611982 | 1507612052 |
+------------+---------+------------+------------+
1 row in set (0.00 sec)

MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269685;
+------------+---------+------------+----------+
| job_db_inx | id_job  | time_start | time_end |
+------------+---------+------------+----------+
|   23430288 | 7269685 | 1507611982 |        0 |
+------------+---------+------------+----------+
1 row in set (0.00 sec)

MariaDB [edison_slurm_acct_db]> update edison_job_table set time_end=1507612012 where job_db_inx=23430288;
Query OK, 1 row affected (0.06 sec)
Rows matched: 1  Changed: 1  Warnings: 0

MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269685
    -> ;
+------------+---------+------------+------------+
| job_db_inx | id_job  | time_start | time_end   |
+------------+---------+------------+------------+
|   23430288 | 7269685 | 1507611982 | 1507612012 |
+------------+---------+------------+------------+
1 row in set (0.00 sec)



A long running roll is still running and looking at the queries it is performing its not even close to this time range yet.  So I'm not updating the edison_last_ran_table.
Comment 4 Marshall Garey 2017-10-12 17:23:02 MDT
Doug,

> For some reason sacctmgr show runawayjobs doesn't see them

The reason those jobs aren't getting found by sacctmgr show runawayjobs is because they're in the completed state. sacctmgr show runawayjobs looks for jobs in the running or pending states. I'm not exactly sure why they got stored in the database without end times. I noticed that all those jobs are in the slurmdbd log from bug 4249, and the slurmdbd being down is likely related. Can you upload a more recent slurmdbd log?

> I'm planning on manually setting end times for these jobs (also notice there
> are no step records).
> 
> I'm comfortable assigning the lack of records is due to the slurmctld crash,
> but would like guidance as to the best way to fix the problem.
> 
> My plan is to send an end time, and then ensure the last ran table re-rolls
> usage from the 9th.

That's what I would do as well.
Comment 12 Marshall Garey 2017-10-13 13:58:17 MDT
Doug,

I was able to reproduce the situation where jobs got to a completed state without an end time. It likely happened because

- the job was short and finished before the job record was sent from the agent queue to the database,
- the end of the job was put in the agent queue after the reservation that failed in bug 4249,
- because the reservation was failing, the agent queue was holding onto the end time of the job, hence the lack of an end time.

If this theory is correct, since the issue in bug 4249 was resolved, the end times for those jobs should be in the database, and you shouldn't need to manually update the database nor re-roll.

Since you already updated the end times in the database, you still shouldn't need to change last_ran_table, since rolling up happens every hour anyway.

So basically, sacctmgr showrunawayjobs is working as expected, and those records probably would have been updated correctly with end times on their own after resolving bug 4249.

Is this still an issue, or has it been resolved?
Comment 13 Marshall Garey 2017-10-19 10:59:57 MDT
Hi Doug, I'm wondering if this has been resolved and if you have any further questions. If it is resolved and you don't have any more questions, I'll close the bug.
Comment 14 Marshall Garey 2017-10-26 14:11:57 MDT
I'm marking this as resolved/timed out. Please reopen if you have further questions.
Comment 15 Doug Jacobsen 2017-10-26 17:11:36 MDT
OK, sounds good.  The only thing I would say is that I think runawayjobs
should show any job that doesn't have an end time but isn't in slurmctld's
active memory, regardless of whether or not the COMPLETED state is set.

Thanks,
Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Thu, Oct 26, 2017 at 1:11 PM, <bugs@schedmd.com> wrote:

> Marshall Garey <marshall@schedmd.com> changed bug 4253
> <https://bugs.schedmd.com/show_bug.cgi?id=4253>
> What Removed Added
> Status UNCONFIRMED RESOLVED
> Resolution --- TIMEDOUT
>
> *Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=4253#c14> on bug
> 4253 <https://bugs.schedmd.com/show_bug.cgi?id=4253> from Marshall Garey
> <marshall@schedmd.com> *
>
> I'm marking this as resolved/timed out. Please reopen if you have further
> questions.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 16 Marshall Garey 2017-11-02 11:24:57 MDT
(In reply to Doug Jacobsen from comment #15)
> OK, sounds good.  The only thing I would say is that I think runawayjobs
> should show any job that doesn't have an end time but isn't in slurmctld's
> active memory, regardless of whether or not the COMPLETED state is set.

No problem, I'll keep this ticket on hand and look into it as soon as I have time. I'm not sure how easy this will be, though.