Hello, In the continuing misery of bug 3365, bug 4202, bug 4249, we are finding that some of the jobs that were running at the time of the slurmctld crash (bug 3365 recent comment), do not have completion records, however have state COMPLETED. For some reason sacctmgr show runawayjobs doesn't see them: edison01:~ # sacctmgr show runawayjobs Runaway Jobs: No runaway jobs found edison01:~ # edison01:~ # sacct -j 7269673,7269678,7269679,7269680,7269682,7269683,7269684,7269685 --format=job,start,end,state,nnodes,cputimeraw JobID Start End State NNodes CPUTimeRAW ------------ ------------------- ------------------- ---------- -------- ---------- 7269673 2017-10-09T21:59:37 Unknown COMPLETED 5000 30794640000 7269678 2017-10-09T22:00:50 Unknown COMPLETED 4 24621696 7269679 2017-10-09T22:00:50 Unknown COMPLETED 4 24621696 7269680 2017-10-09T22:00:50 Unknown COMPLETED 4 24621696 7269682 2017-10-09T22:06:22 Unknown COMPLETED 200 1227897600 7269683 2017-10-09T22:06:22 Unknown COMPLETED 200 1227897600 7269684 2017-10-09T22:06:22 Unknown COMPLETED 200 1227897600 7269685 2017-10-09T22:06:22 Unknown COMPLETED 200 1227897600 edison01:~ # edison01:~ # squeue -j 7269673,7269678,7269679,7269680,7269682,7269683,7269684,7269685 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) edison01:~ # (scontrol job also doesn't find them) Thankfully one is a 5000 node job and so was obviously discoverable when trying to understand how edison delivered 2x the max possible hours yesterday. I'm planning on manually setting end times for these jobs (also notice there are no step records). I'm comfortable assigning the lack of records is due to the slurmctld crash, but would like guidance as to the best way to fix the problem. My plan is to send an end time, and then ensure the last ran table re-rolls usage from the 9th. however, the bigger concern is why sacctmgr show runawayjobs (which I rely upon to find these cases) isn't detecting these. Thanks, Doug
this is what i did: MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269673; +------------+---------+------------+----------+ | job_db_inx | id_job | time_start | time_end | +------------+---------+------------+----------+ | 23430280 | 7269673 | 1507611577 | 0 | +------------+---------+------------+----------+ 1 row in set (0.00 sec) MariaDB [edison_slurm_acct_db]> update edison_job_table set time_end=1507611966 where job_db_inx=23430280; Query OK, 1 row affected (0.09 sec) Rows matched: 1 Changed: 1 Warnings: 0 MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269673; +------------+---------+------------+------------+ | job_db_inx | id_job | time_start | time_end | +------------+---------+------------+------------+ | 23430280 | 7269673 | 1507611577 | 1507611966 | +------------+---------+------------+------------+ 1 row in set (0.00 sec) MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269678; +------------+---------+------------+----------+ | job_db_inx | id_job | time_start | time_end | +------------+---------+------------+----------+ | 23430281 | 7269678 | 1507611650 | 0 | +------------+---------+------------+----------+ 1 row in set (0.00 sec) MariaDB [edison_slurm_acct_db]> update edison_job_table set time_end=1507611856 where job_db_inx=23430281; Query OK, 1 row affected (0.10 sec) Rows matched: 1 Changed: 1 Warnings: 0 MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269678; +------------+---------+------------+------------+ | job_db_inx | id_job | time_start | time_end | +------------+---------+------------+------------+ | 23430281 | 7269678 | 1507611650 | 1507611856 | +------------+---------+------------+------------+ 1 row in set (0.00 sec) MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269679; +------------+---------+------------+----------+ | job_db_inx | id_job | time_start | time_end | +------------+---------+------------+----------+ | 23430282 | 7269679 | 1507611650 | 0 | +------------+---------+------------+----------+ 1 row in set (0.00 sec) MariaDB [edison_slurm_acct_db]> update edison_job_table set time_end=1507611889 where job_db_inx=23430282; Query OK, 1 row affected (0.08 sec) Rows matched: 1 Changed: 1 Warnings: 0 MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269679; +------------+---------+------------+------------+ | job_db_inx | id_job | time_start | time_end | +------------+---------+------------+------------+ | 23430282 | 7269679 | 1507611650 | 1507611889 | +------------+---------+------------+------------+ 1 row in set (0.00 sec) MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269680; +------------+---------+------------+----------+ | job_db_inx | id_job | time_start | time_end | +------------+---------+------------+----------+ | 23430283 | 7269680 | 1507611650 | 0 | +------------+---------+------------+----------+ 1 row in set (0.00 sec) MariaDB [edison_slurm_acct_db]> update edison_job_table set time_end=1507612489 where job_db_inx=23430283; Query OK, 1 row affected (0.04 sec) Rows matched: 1 Changed: 1 Warnings: 0 MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269680; +------------+---------+------------+------------+ | job_db_inx | id_job | time_start | time_end | +------------+---------+------------+------------+ | 23430283 | 7269680 | 1507611650 | 1507612489 | +------------+---------+------------+------------+ 1 row in set (0.00 sec) MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269682; +------------+---------+------------+----------+ | job_db_inx | id_job | time_start | time_end | +------------+---------+------------+----------+ | 23430285 | 7269682 | 1507611982 | 0 | +------------+---------+------------+----------+ 1 row in set (0.01 sec) MariaDB [edison_slurm_acct_db]> update edison_job_table set time_end=1507612014 where job_db_inx=23430285; Query OK, 1 row affected (0.29 sec) Rows matched: 1 Changed: 1 Warnings: 0 MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269682; +------------+---------+------------+------------+ | job_db_inx | id_job | time_start | time_end | +------------+---------+------------+------------+ | 23430285 | 7269682 | 1507611982 | 1507612014 | +------------+---------+------------+------------+ 1 row in set (0.00 sec) MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269683; +------------+---------+------------+----------+ | job_db_inx | id_job | time_start | time_end | +------------+---------+------------+----------+ | 23430286 | 7269683 | 1507611982 | 0 | +------------+---------+------------+----------+ 1 row in set (0.00 sec) MariaDB [edison_slurm_acct_db]> update edison_job_table set time_end=1507612014 where job_db_inx=23430286; Query OK, 1 row affected (0.04 sec) Rows matched: 1 Changed: 1 Warnings: 0 MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269683; +------------+---------+------------+------------+ | job_db_inx | id_job | time_start | time_end | +------------+---------+------------+------------+ | 23430286 | 7269683 | 1507611982 | 1507612014 | +------------+---------+------------+------------+ 1 row in set (0.00 sec) MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269684; +------------+---------+------------+----------+ | job_db_inx | id_job | time_start | time_end | +------------+---------+------------+----------+ | 23430287 | 7269684 | 1507611982 | 0 | +------------+---------+------------+----------+ 1 row in set (0.00 sec) MariaDB [edison_slurm_acct_db]> update edison_job_table set time_end=1507612052 where job_db_inx=23430287; Query OK, 1 row affected (0.07 sec) Rows matched: 1 Changed: 1 Warnings: 0 MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269684; +------------+---------+------------+------------+ | job_db_inx | id_job | time_start | time_end | +------------+---------+------------+------------+ | 23430287 | 7269684 | 1507611982 | 1507612052 | +------------+---------+------------+------------+ 1 row in set (0.00 sec) MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269685; +------------+---------+------------+----------+ | job_db_inx | id_job | time_start | time_end | +------------+---------+------------+----------+ | 23430288 | 7269685 | 1507611982 | 0 | +------------+---------+------------+----------+ 1 row in set (0.00 sec) MariaDB [edison_slurm_acct_db]> update edison_job_table set time_end=1507612012 where job_db_inx=23430288; Query OK, 1 row affected (0.06 sec) Rows matched: 1 Changed: 1 Warnings: 0 MariaDB [edison_slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from edison_job_table where id_job=7269685 -> ; +------------+---------+------------+------------+ | job_db_inx | id_job | time_start | time_end | +------------+---------+------------+------------+ | 23430288 | 7269685 | 1507611982 | 1507612012 | +------------+---------+------------+------------+ 1 row in set (0.00 sec) A long running roll is still running and looking at the queries it is performing its not even close to this time range yet. So I'm not updating the edison_last_ran_table.
Doug, > For some reason sacctmgr show runawayjobs doesn't see them The reason those jobs aren't getting found by sacctmgr show runawayjobs is because they're in the completed state. sacctmgr show runawayjobs looks for jobs in the running or pending states. I'm not exactly sure why they got stored in the database without end times. I noticed that all those jobs are in the slurmdbd log from bug 4249, and the slurmdbd being down is likely related. Can you upload a more recent slurmdbd log? > I'm planning on manually setting end times for these jobs (also notice there > are no step records). > > I'm comfortable assigning the lack of records is due to the slurmctld crash, > but would like guidance as to the best way to fix the problem. > > My plan is to send an end time, and then ensure the last ran table re-rolls > usage from the 9th. That's what I would do as well.
Doug, I was able to reproduce the situation where jobs got to a completed state without an end time. It likely happened because - the job was short and finished before the job record was sent from the agent queue to the database, - the end of the job was put in the agent queue after the reservation that failed in bug 4249, - because the reservation was failing, the agent queue was holding onto the end time of the job, hence the lack of an end time. If this theory is correct, since the issue in bug 4249 was resolved, the end times for those jobs should be in the database, and you shouldn't need to manually update the database nor re-roll. Since you already updated the end times in the database, you still shouldn't need to change last_ran_table, since rolling up happens every hour anyway. So basically, sacctmgr showrunawayjobs is working as expected, and those records probably would have been updated correctly with end times on their own after resolving bug 4249. Is this still an issue, or has it been resolved?
Hi Doug, I'm wondering if this has been resolved and if you have any further questions. If it is resolved and you don't have any more questions, I'll close the bug.
I'm marking this as resolved/timed out. Please reopen if you have further questions.
OK, sounds good. The only thing I would say is that I think runawayjobs should show any job that doesn't have an end time but isn't in slurmctld's active memory, regardless of whether or not the COMPLETED state is set. Thanks, Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Thu, Oct 26, 2017 at 1:11 PM, <bugs@schedmd.com> wrote: > Marshall Garey <marshall@schedmd.com> changed bug 4253 > <https://bugs.schedmd.com/show_bug.cgi?id=4253> > What Removed Added > Status UNCONFIRMED RESOLVED > Resolution --- TIMEDOUT > > *Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=4253#c14> on bug > 4253 <https://bugs.schedmd.com/show_bug.cgi?id=4253> from Marshall Garey > <marshall@schedmd.com> * > > I'm marking this as resolved/timed out. Please reopen if you have further > questions. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
(In reply to Doug Jacobsen from comment #15) > OK, sounds good. The only thing I would say is that I think runawayjobs > should show any job that doesn't have an end time but isn't in slurmctld's > active memory, regardless of whether or not the COMPLETED state is set. No problem, I'll keep this ticket on hand and look into it as soon as I have time. I'm not sure how easy this will be, though.