| Summary: | sacctmgr show runawayjobs not finding runawayjobs | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Doug Jacobsen <dmjacobsen> |
| Component: | slurmdbd | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED TIMEDOUT | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | jfbotts |
| Version: | 17.02.7 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| Site: | NERSC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Doug Jacobsen
2017-10-11 10:41:49 MDT
this is what i did:
MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269673;
+------------+---------+------------+----------+
| job_db_inx | id_job | time_start | time_end |
+------------+---------+------------+----------+
| 23430280 | 7269673 | 1507611577 | 0 |
+------------+---------+------------+----------+
1 row in set (0.00 sec)
MariaDB [edison_slurm_acct_db]> update edison_job_table set time_end=1507611966 where job_db_inx=23430280;
Query OK, 1 row affected (0.09 sec)
Rows matched: 1 Changed: 1 Warnings: 0
MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269673;
+------------+---------+------------+------------+
| job_db_inx | id_job | time_start | time_end |
+------------+---------+------------+------------+
| 23430280 | 7269673 | 1507611577 | 1507611966 |
+------------+---------+------------+------------+
1 row in set (0.00 sec)
MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269678;
+------------+---------+------------+----------+
| job_db_inx | id_job | time_start | time_end |
+------------+---------+------------+----------+
| 23430281 | 7269678 | 1507611650 | 0 |
+------------+---------+------------+----------+
1 row in set (0.00 sec)
MariaDB [edison_slurm_acct_db]> update edison_job_table set time_end=1507611856 where job_db_inx=23430281;
Query OK, 1 row affected (0.10 sec)
Rows matched: 1 Changed: 1 Warnings: 0
MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269678;
+------------+---------+------------+------------+
| job_db_inx | id_job | time_start | time_end |
+------------+---------+------------+------------+
| 23430281 | 7269678 | 1507611650 | 1507611856 |
+------------+---------+------------+------------+
1 row in set (0.00 sec)
MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269679;
+------------+---------+------------+----------+
| job_db_inx | id_job | time_start | time_end |
+------------+---------+------------+----------+
| 23430282 | 7269679 | 1507611650 | 0 |
+------------+---------+------------+----------+
1 row in set (0.00 sec)
MariaDB [edison_slurm_acct_db]> update edison_job_table set time_end=1507611889 where job_db_inx=23430282;
Query OK, 1 row affected (0.08 sec)
Rows matched: 1 Changed: 1 Warnings: 0
MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269679;
+------------+---------+------------+------------+
| job_db_inx | id_job | time_start | time_end |
+------------+---------+------------+------------+
| 23430282 | 7269679 | 1507611650 | 1507611889 |
+------------+---------+------------+------------+
1 row in set (0.00 sec)
MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269680;
+------------+---------+------------+----------+
| job_db_inx | id_job | time_start | time_end |
+------------+---------+------------+----------+
| 23430283 | 7269680 | 1507611650 | 0 |
+------------+---------+------------+----------+
1 row in set (0.00 sec)
MariaDB [edison_slurm_acct_db]> update edison_job_table set time_end=1507612489 where job_db_inx=23430283;
Query OK, 1 row affected (0.04 sec)
Rows matched: 1 Changed: 1 Warnings: 0
MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269680;
+------------+---------+------------+------------+
| job_db_inx | id_job | time_start | time_end |
+------------+---------+------------+------------+
| 23430283 | 7269680 | 1507611650 | 1507612489 |
+------------+---------+------------+------------+
1 row in set (0.00 sec)
MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269682;
+------------+---------+------------+----------+
| job_db_inx | id_job | time_start | time_end |
+------------+---------+------------+----------+
| 23430285 | 7269682 | 1507611982 | 0 |
+------------+---------+------------+----------+
1 row in set (0.01 sec)
MariaDB [edison_slurm_acct_db]> update edison_job_table set time_end=1507612014 where job_db_inx=23430285;
Query OK, 1 row affected (0.29 sec)
Rows matched: 1 Changed: 1 Warnings: 0
MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269682;
+------------+---------+------------+------------+
| job_db_inx | id_job | time_start | time_end |
+------------+---------+------------+------------+
| 23430285 | 7269682 | 1507611982 | 1507612014 |
+------------+---------+------------+------------+
1 row in set (0.00 sec)
MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269683;
+------------+---------+------------+----------+
| job_db_inx | id_job | time_start | time_end |
+------------+---------+------------+----------+
| 23430286 | 7269683 | 1507611982 | 0 |
+------------+---------+------------+----------+
1 row in set (0.00 sec)
MariaDB [edison_slurm_acct_db]> update edison_job_table set time_end=1507612014 where job_db_inx=23430286;
Query OK, 1 row affected (0.04 sec)
Rows matched: 1 Changed: 1 Warnings: 0
MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269683;
+------------+---------+------------+------------+
| job_db_inx | id_job | time_start | time_end |
+------------+---------+------------+------------+
| 23430286 | 7269683 | 1507611982 | 1507612014 |
+------------+---------+------------+------------+
1 row in set (0.00 sec)
MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269684;
+------------+---------+------------+----------+
| job_db_inx | id_job | time_start | time_end |
+------------+---------+------------+----------+
| 23430287 | 7269684 | 1507611982 | 0 |
+------------+---------+------------+----------+
1 row in set (0.00 sec)
MariaDB [edison_slurm_acct_db]> update edison_job_table set time_end=1507612052 where job_db_inx=23430287;
Query OK, 1 row affected (0.07 sec)
Rows matched: 1 Changed: 1 Warnings: 0
MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269684;
+------------+---------+------------+------------+
| job_db_inx | id_job | time_start | time_end |
+------------+---------+------------+------------+
| 23430287 | 7269684 | 1507611982 | 1507612052 |
+------------+---------+------------+------------+
1 row in set (0.00 sec)
MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269685;
+------------+---------+------------+----------+
| job_db_inx | id_job | time_start | time_end |
+------------+---------+------------+----------+
| 23430288 | 7269685 | 1507611982 | 0 |
+------------+---------+------------+----------+
1 row in set (0.00 sec)
MariaDB [edison_slurm_acct_db]> update edison_job_table set time_end=1507612012 where job_db_inx=23430288;
Query OK, 1 row affected (0.06 sec)
Rows matched: 1 Changed: 1 Warnings: 0
MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269685
-> ;
+------------+---------+------------+------------+
| job_db_inx | id_job | time_start | time_end |
+------------+---------+------------+------------+
| 23430288 | 7269685 | 1507611982 | 1507612012 |
+------------+---------+------------+------------+
1 row in set (0.00 sec)
A long running roll is still running and looking at the queries it is performing its not even close to this time range yet. So I'm not updating the edison_last_ran_table.
Doug, > For some reason sacctmgr show runawayjobs doesn't see them The reason those jobs aren't getting found by sacctmgr show runawayjobs is because they're in the completed state. sacctmgr show runawayjobs looks for jobs in the running or pending states. I'm not exactly sure why they got stored in the database without end times. I noticed that all those jobs are in the slurmdbd log from bug 4249, and the slurmdbd being down is likely related. Can you upload a more recent slurmdbd log? > I'm planning on manually setting end times for these jobs (also notice there > are no step records). > > I'm comfortable assigning the lack of records is due to the slurmctld crash, > but would like guidance as to the best way to fix the problem. > > My plan is to send an end time, and then ensure the last ran table re-rolls > usage from the 9th. That's what I would do as well. Doug, I was able to reproduce the situation where jobs got to a completed state without an end time. It likely happened because - the job was short and finished before the job record was sent from the agent queue to the database, - the end of the job was put in the agent queue after the reservation that failed in bug 4249, - because the reservation was failing, the agent queue was holding onto the end time of the job, hence the lack of an end time. If this theory is correct, since the issue in bug 4249 was resolved, the end times for those jobs should be in the database, and you shouldn't need to manually update the database nor re-roll. Since you already updated the end times in the database, you still shouldn't need to change last_ran_table, since rolling up happens every hour anyway. So basically, sacctmgr showrunawayjobs is working as expected, and those records probably would have been updated correctly with end times on their own after resolving bug 4249. Is this still an issue, or has it been resolved? Hi Doug, I'm wondering if this has been resolved and if you have any further questions. If it is resolved and you don't have any more questions, I'll close the bug. I'm marking this as resolved/timed out. Please reopen if you have further questions. OK, sounds good. The only thing I would say is that I think runawayjobs should show any job that doesn't have an end time but isn't in slurmctld's active memory, regardless of whether or not the COMPLETED state is set. Thanks, Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Thu, Oct 26, 2017 at 1:11 PM, <bugs@schedmd.com> wrote: > Marshall Garey <marshall@schedmd.com> changed bug 4253 > <https://bugs.schedmd.com/show_bug.cgi?id=4253> > What Removed Added > Status UNCONFIRMED RESOLVED > Resolution --- TIMEDOUT > > *Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=4253#c14> on bug > 4253 <https://bugs.schedmd.com/show_bug.cgi?id=4253> from Marshall Garey > <marshall@schedmd.com> * > > I'm marking this as resolved/timed out. Please reopen if you have further > questions. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > (In reply to Doug Jacobsen from comment #15) > OK, sounds good. The only thing I would say is that I think runawayjobs > should show any job that doesn't have an end time but isn't in slurmctld's > active memory, regardless of whether or not the COMPLETED state is set. No problem, I'll keep this ticket on hand and look into it as soon as I have time. I'm not sure how easy this will be, though. |