Ticket 5195

Summary: 300 Jobs showing as RUNNING in sacct, yet jobs are no longer in the queue
Product: Slurm Reporter: UAB Research Computing <RC_LICENSES>
Component: AccountingAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: bart
Version: 17.02.2   
Hardware: Linux   
OS: Linux   
Site: UAB Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description UAB Research Computing 2018-05-21 09:23:14 MDT
Howdy, we have identified at least 300 jobs showing as RUNNING in sacct yet they have long since left the job queue.

$ sacctmgr show runaway
Runaway Jobs: No runaway jobs found

Yet I see 300 of these for a particular user (user name redacted):
$ squeue --user *****
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

$ sacct -u ***** --format=jobid,start,end,elapsed,state |grep RUN  | head
1001131_587+ 2018-02-07T16:05:18             Unknown 102-17:12:23    RUNNING
1001131_592+ 2018-02-07T16:05:18             Unknown 102-17:12:23    RUNNING
1001131_595+ 2018-02-07T16:05:18             Unknown 102-17:12:23    RUNNING
1001131_599+ 2018-02-07T16:05:18             Unknown 102-17:12:23    RUNNING
1001131_600+ 2018-02-07T16:05:18             Unknown 102-17:12:23    RUNNING
1001131_602+ 2018-02-07T16:05:18             Unknown 102-17:12:23    RUNNING
...


Any suggestion on how to clear the RUNNING state on these jobs? 

Also, is there a convenient way to set the completed time so something sane, like the maximum runtime allowed for the job so that our accounting isn't skewed by these results?
Comment 1 Dominik Bartkiewicz 2018-05-21 09:57:26 MDT
Hi

Try to use 'sacctmgr show runawayjobs', it should clean this jobs.

From sacctmgr man:
...
RunawayJobs
              Used  only  with the list or show command to report current jobs
              that have been orphanded on the local cluster and are  now  run‐
              away.   If there are jobs in this state it will also give you an
              option to "fix" them.


Dominik
Comment 2 UAB Research Computing 2018-05-21 10:11:32 MDT
Thanks, it reports that there are no runaway jobs:

$ sudo sacctmgr show RunAwayJobs
Runaway Jobs: No runaway jobs found
Comment 3 Dominik Bartkiewicz 2018-05-22 07:07:06 MDT
Hi

Could you send me mysql record of one of the affected jobs?

eg.:
select * from test_job_table where id_job=<job_id>

Good news, the current version has fixed couple bugs which could cause runaway of job.

ps.
sorry for the first comment I missed the first part of your comment.
And I wrongly assume that this is standard case.


Dominik
Comment 4 UAB Research Computing 2018-05-22 08:29:27 MDT
No worries.

Here's the output:
MariaDB [slurm_acct_db]> select * from slurm_cluster_job_table where id_job="1001131";
+------------+------------+---------+----------+---------------+----------------+-----------------+--------------------+----------+------------+------------+-----------+----------+----------+--------------+---------------+----------+---------+--------+---------+----------+---------+----------+-------------+---------------------+----------+-------------+----------+-----------+----------+-------+-----------+-------------+---------------+------------+------------+----------------+----------+------------+-----------+-------+-------------+-----------------+-----------------+
| job_db_inx | mod_time   | deleted | account  | admin_comment | array_task_str | array_max_tasks | array_task_pending | cpus_req | derived_ec | derived_es | exit_code | job_name | id_assoc | id_array_job | id_array_task | id_block | id_job  | id_qos | id_resv | id_wckey | id_user | id_group | kill_requid | mem_req             | nodelist | nodes_alloc | node_inx | partition | priority | state | timelimit | time_submit | time_eligible | time_start | time_end   | time_suspended | gres_req | gres_alloc | gres_used | wckey | track_steps | tres_alloc      | tres_req        |
+------------+------------+---------+----------+---------------+----------------+-----------------+--------------------+----------+------------+------------+-----------+----------+----------+--------------+---------------+----------+---------+--------+---------+----------+---------+----------+-------------+---------------------+----------+-------------+----------+-----------+----------+-------+-----------+-------------+---------------+------------+------------+----------------+----------+------------+-----------+-------+-------------+-----------------+-----------------+
|    2479906 | 1518041640 |       0 | user1 | NULL          | NULL           |               0 |                  0 |        1 |        256 | NULL       |       256 | prisk16  |      197 |      1001131 |          5000 | NULL     | 1001131 |      1 |       0 |        0 |   10623 |    10623 |          -1 | 9223372036854807808 | c0043    |           1 | 42       | medium    |     6320 |     5 |      1440 |  1518041085 |    1518041086 | 1518041283 | 1518041289 |              0 |          |            |           |       |           0 | 1=1,2=32000,4=1 | 1=1,2=32000,4=1 |
+------------+------------+---------+----------+---------------+----------------+-----------------+--------------------+----------+------------+------------+-----------+----------+----------+--------------+---------------+----------+---------+--------+---------+----------+---------+----------+-------------+---------------------+----------+-------------+----------+-----------+----------+-------+-----------+-------------+---------------+------------+------------+----------------+----------+------------+-----------+-------+-------------+-----------------+-----------------+
1 row in set (0.00 sec)
Comment 5 UAB Research Computing 2018-05-22 08:39:18 MDT
I noticed that all of those stuck in a RUNNING state are in "step?" .0 

$ sacct -P -u user1 --format=jobidraw,jobid,start,end,elapsed,state |grep -b1 RUN  | head
39-1001718|1001131_587|2018-02-07T16:05:12|Unknown|103-16:32:33|FAILED
107:1001718.0|1001131_587.0|2018-02-07T16:05:18|Unknown|103-16:32:27|RUNNING
180-1001723|1001131_592|2018-02-07T16:05:12|Unknown|103-16:32:33|FAILED
248:1001723.0|1001131_592.0|2018-02-07T16:05:18|Unknown|103-16:32:27|RUNNING
321-1001726|1001131_595|2018-02-07T16:05:12|Unknown|103-16:32:33|FAILED
389:1001726.0|1001131_595.0|2018-02-07T16:05:18|Unknown|103-16:32:27|RUNNING
462-1001730|1001131_599|2018-02-07T16:05:12|Unknown|103-16:32:33|FAILED
530:1001730.0|1001131_599.0|2018-02-07T16:05:18|Unknown|103-16:32:27|RUNNING
603-1001731|1001131_600|2018-02-07T16:05:12|Unknown|103-16:32:33|FAILED
671:1001731.0|1001131_600.0|2018-02-07T16:05:18|Unknown|103-16:32:27|RUNNING
Comment 6 UAB Research Computing 2018-06-04 15:46:44 MDT
Any thoughts on how I can clean up these runaway job steps?
Comment 7 Dominik Bartkiewicz 2018-06-05 08:05:25 MDT
Hi

Sorry for the delay.

Could you send me the output of two extra MySQL queries?

select * from slurm_cluster_job_table where id_array_job=1001131 and id_array_task=107;
select * from slurm_cluster_step_table where job_db_inx=<job_db_inx from upper query> and  id_step=0 limit 1;

Dominik
Comment 8 UAB Research Computing 2018-06-05 08:29:52 MDT
Here you go:

MariaDB [slurm_acct_db]> select * from slurm_cluster_job_table where id_array_job=1001131 and
    -> id_array_task=107;
+------------+------------+---------+----------+---------------+----------------+-----------------+--------------------+----------+------------+------------+-----------+----------+----------+--------------+---------------+----------+---------+--------+---------+----------+---------+----------+-------------+---------------------+----------+-------------+----------+-----------+----------+-------+-----------+-------------+---------------+------------+------------+----------------+----------+------------+-----------+-------+-------------+-----------------+-----------------+
| job_db_inx | mod_time   | deleted | account  | admin_comment | array_task_str | array_max_tasks | array_task_pending | cpus_req | derived_ec | derived_es | exit_code | job_name | id_assoc | id_array_job | id_array_task | id_block | id_job  | id_qos | id_resv | id_wckey | id_user | id_group | kill_requid | mem_req             | nodelist | nodes_alloc | node_inx | partition | priority | state | timelimit | time_submit | time_eligible | time_start | time_end   | time_suspended | gres_req | gres_alloc | gres_used | wckey | track_steps | tres_alloc      | tres_req        |
+------------+------------+---------+----------+---------------+----------------+-----------------+--------------------+----------+------------+------------+-----------+----------+----------+--------------+---------------+----------+---------+--------+---------+----------+---------+----------+-------------+---------------------+----------+-------------+----------+-----------+----------+-------+-----------+-------------+---------------+------------+------------+----------------+----------+------------+-----------+-------+-------------+-----------------+-----------------+
|    2479912 | 1518041251 |       0 | jrichman | NULL          | NULL           |               0 |                  0 |        1 |        256 | NULL       |       256 | prisk16  |      197 |      1001131 |           107 | NULL     | 1001238 |      1 |       0 |        0 |   10623 |    10623 |          -1 | 9223372036854807808 | c0071    |           1 | 70       | medium    |     6320 |     5 |      1440 |  1518041085 |    1518041086 | 1518041088 | 1518041113 |              0 |          |            |           |       |           0 | 1=1,2=32000,4=1 | 1=1,2=32000,4=1 |


MariaDB [slurm_acct_db]> select * from slurm_cluster_step_table where job_db_inx=2479912 and  id_step=0 limit 1;
+------------+---------+-----------+---------+-------------+----------+-------------+----------+-------+-----------+----------+-----------+------------+------------+----------------+----------+-----------+---------+----------+-----------+----------------+----------------+-----------+---------+--------------+--------------+---------+-----------+----------------+----------------+-----------+---------+--------------+--------------+---------+-------------+-----------------+-----------------+-------------+-----------------+---------------+--------------------+--------------------+---------------+----------------+---------------------+---------------------+----------------+-----------------+
| job_db_inx | deleted | exit_code | id_step | kill_requid | nodelist | nodes_alloc | node_inx | state | step_name | task_cnt | task_dist | time_start | time_end   | time_suspended | user_sec | user_usec | sys_sec | sys_usec | max_pages | max_pages_task | max_pages_node | ave_pages | max_rss | max_rss_task | max_rss_node | ave_rss | max_vsize | max_vsize_task | max_vsize_node | ave_vsize | min_cpu | min_cpu_task | min_cpu_node | ave_cpu | act_cpufreq | consumed_energy | req_cpufreq_min | req_cpufreq | req_cpufreq_gov | max_disk_read | max_disk_read_task | max_disk_read_node | ave_disk_read | max_disk_write | max_disk_write_task | max_disk_write_node | ave_disk_write | tres_alloc      |
+------------+---------+-----------+---------+-------------+----------+-------------+----------+-------+-----------+----------+-----------+------------+------------+----------------+----------+-----------+---------+----------+-----------+----------------+----------------+-----------+---------+--------------+--------------+---------+-----------+----------------+----------------+-----------+---------+--------------+--------------+---------+-------------+-----------------+-----------------+-------------+-----------------+---------------+--------------------+--------------------+---------------+----------------+---------------------+---------------------+----------------+-----------------+
|    2479912 |       0 |       256 |       0 |          -1 | c0071    |           1 | 70       |     5 | R         |        1 |         1 | 1518041103 | 1518041112 |              0 |        0 |    266058 |       0 |   100301 |         0 |              0 |              0 |         0 |    1216 |            0 |            0 |    1216 |    230916 |              0 |              0 |    230916 |       0 |            0 |            0 |       0 |     1294628 |               0 |      4294967294 |  4294967294 |      4294967294 |             0 |              65534 |         4294967294 |             0 |              0 |               65534 |          4294967294 |              0 | 1=1,2=32000,4=1 |
Comment 9 Dominik Bartkiewicz 2018-06-05 08:49:15 MDT
Hi

When I was reading this output I have realised that I send you wrong id_array_task this should be:

select * from slurm_cluster_job_table where id_array_job=1001131 and id_array_task=600;
select * from slurm_cluster_step_table where job_db_inx=<job_db_inx from upper query> and  id_step=0 limit 1;


Dominik
Comment 10 UAB Research Computing 2018-06-05 09:02:08 MDT
Oops, here's the updated query results:

MariaDB [slurm_acct_db]> select * from slurm_cluster_job_table where id_array_job=1001131 and
    -> id_array_task=600;
+------------+------------+---------+----------+---------------+----------------+-----------------+--------------------+----------+------------+------------+-----------+----------+----------+--------------+---------------+----------+---------+--------+---------+----------+---------+----------+-------------+---------------------+----------+-------------+----------+-----------+----------+-------+-----------+-------------+---------------+------------+----------+----------------+----------+------------+-----------+-------+-------------+-----------------+-----------------+
| job_db_inx | mod_time   | deleted | account  | admin_comment | array_task_str | array_max_tasks | array_task_pending | cpus_req | derived_ec | derived_es | exit_code | job_name | id_assoc | id_array_job | id_array_task | id_block | id_job  | id_qos | id_resv | id_wckey | id_user | id_group | kill_requid | mem_req             | nodelist | nodes_alloc | node_inx | partition | priority | state | timelimit | time_submit | time_eligible | time_start | time_end | time_suspended | gres_req | gres_alloc | gres_used | wckey | track_steps | tres_alloc      | tres_req        |
+------------+------------+---------+----------+---------------+----------------+-----------------+--------------------+----------+------------+------------+-----------+----------+----------+--------------+---------------+----------+---------+--------+---------+----------+---------+----------+-------------+---------------------+----------+-------------+----------+-----------+----------+-------+-----------+-------------+---------------+------------+----------+----------------+----------+------------+-----------+-------+-------------+-----------------+-----------------+
|    2480409 | 1518041190 |       0 | jrichman | NULL          | NULL           |               0 |                  0 |        1 |          0 | NULL       |         0 | prisk16  |      197 |      1001131 |           600 | NULL     | 1001731 |      1 |       0 |        0 |   10623 |    10623 |          -1 | 9223372036854807808 | c0083    |           1 | 82       | medium    |     6320 |     5 |      1440 |  1518041085 |    1518041086 | 1518041112 |        0 |              0 |          |            |           |       |           0 | 1=1,2=32000,4=1 | 1=1,2=32000,4=1 |


MariaDB [slurm_acct_db]> select * from slurm_cluster_step_table where job_db_inx=2480409 and  id_step=0 limit 1;
+------------+---------+-----------+---------+-------------+----------+-------------+----------+-------+-----------+----------+-----------+------------+----------+----------------+----------+-----------+---------+----------+-----------+----------------+----------------+-----------+---------+--------------+--------------+---------+-----------+----------------+----------------+-----------+------------+--------------+--------------+---------+-------------+-----------------+-----------------+-------------+-----------------+---------------+--------------------+--------------------+---------------+----------------+---------------------+---------------------+----------------+-----------------+
| job_db_inx | deleted | exit_code | id_step | kill_requid | nodelist | nodes_alloc | node_inx | state | step_name | task_cnt | task_dist | time_start | time_end | time_suspended | user_sec | user_usec | sys_sec | sys_usec | max_pages | max_pages_task | max_pages_node | ave_pages | max_rss | max_rss_task | max_rss_node | ave_rss | max_vsize | max_vsize_task | max_vsize_node | ave_vsize | min_cpu    | min_cpu_task | min_cpu_node | ave_cpu | act_cpufreq | consumed_energy | req_cpufreq_min | req_cpufreq | req_cpufreq_gov | max_disk_read | max_disk_read_task | max_disk_read_node | ave_disk_read | max_disk_write | max_disk_write_task | max_disk_write_node | ave_disk_write | tres_alloc      |
+------------+---------+-----------+---------+-------------+----------+-------------+----------+-------+-----------+----------+-----------+------------+----------+----------------+----------+-----------+---------+----------+-----------+----------------+----------------+-----------+---------+--------------+--------------+---------+-----------+----------------+----------------+-----------+------------+--------------+--------------+---------+-------------+-----------------+-----------------+-------------+-----------------+---------------+--------------------+--------------------+---------------+----------------+---------------------+---------------------+----------------+-----------------+
|    2480409 |       0 |         0 |       0 |          -1 | c0083    |           1 | 82       |     1 | R         |        1 |         1 | 1518041118 |        0 |              0 |        0 |         0 |       0 |        0 |         0 |              0 |              0 |         0 |       0 |            0 |            0 |       0 |         0 |              0 |              0 |         0 | 4294967294 |            0 |            0 |       0 |           0 |               0 |      4294967294 |  4294967294 |      4294967294 |             0 |                  0 |                  0 |             0 |              0 |                   0 |                   0 |              0 | 1=1,2=32000,4=1 |
Comment 14 Dominik Bartkiewicz 2018-06-07 12:28:54 MDT
Hi

Could you send me slurmdbd log from 2018-02-07? 

If log file is unavailable, maybe you remember similar pattern like "Deadlock found when trying to get lock" in slurmdbd.log.

Dominik
Comment 15 UAB Research Computing 2018-06-11 15:33:03 MDT
sudo grep 2018-02-07 /var/log/slurmdbd-20180404                                                                                                        1 ↵  31637  16:31:40
[2018-02-07T12:48:15.294] error: mysql_real_connect failed: 2003 Can't connect to MySQL server on 'master' (113)
[2018-02-07T12:48:15.326] error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
[2018-02-07T12:48:23.332] error: mysql_real_connect failed: 2003 Can't connect to MySQL server on 'master' (113)
[2018-02-07T12:48:23.332] error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
[2018-02-07T12:48:31.338] error: mysql_real_connect failed: 2003 Can't connect to MySQL server on 'master' (113)
[2018-02-07T12:48:34.768] error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
[2018-02-07T12:49:04.434] error: chdir(/var/log): Permission denied
[2018-02-07T12:49:04.434] chdir to /var/tmp
[2018-02-07T12:49:05.153] slurmdbd version 17.02.2 started
[2018-02-07T12:49:06.325] Terminate signal (SIGINT or SIGTERM) received
[2018-02-07T14:46:06.674] error: chdir(/var/log): Permission denied
[2018-02-07T14:46:06.674] chdir to /var/tmp
[2018-02-07T14:46:06.872] slurmdbd version 17.02.2 started
[2018-02-07T17:00:00.627] error: We have more allocated time than is possible (150436790528 > 91869138000) for cluster slurm_cluster(25519205) from 2018-02-07T16:00:00 - 2018-02-07T17:00:00 tres 2
[2018-02-07T17:00:00.627] error: We have more time than is possible (91869138000+9077730000+0)(100946868000) > 91869138000 for cluster slurm_cluster(25519205) from 2018-02-07T16:00:00 - 2018-02-07T17:00:00 tres 2
[2018-02-07T18:00:00.558] error: We have more allocated time than is possible (156030123488 > 91869138000) for cluster slurm_cluster(25519205) from 2018-02-07T17:00:00 - 2018-02-07T18:00:00 tres 2
[2018-02-07T18:00:00.558] error: We have more time than is possible (91869138000+9280800000+0)(101149938000) > 91869138000 for cluster slurm_cluster(25519205) from 2018-02-07T17:00:00 - 2018-02-07T18:00:00 tres 2
[2018-02-07T19:00:00.811] error: We have more allocated time than is possible (153417836784 > 91869138000) for cluster slurm_cluster(25519205) from 2018-02-07T18:00:00 - 2018-02-07T19:00:00 tres 2
[2018-02-07T19:00:00.811] error: We have more time than is possible (91869138000+9280800000+0)(101149938000) > 91869138000 for cluster slurm_cluster(25519205) from 2018-02-07T18:00:00 - 2018-02-07T19:00:00 tres 2
[2018-02-07T20:00:00.553] error: We have more allocated time than is possible (149169110272 > 91869138000) for cluster slurm_cluster(25519205) from 2018-02-07T19:00:00 - 2018-02-07T20:00:00 tres 2
[2018-02-07T20:00:00.553] error: We have more time than is possible (91869138000+9280800000+0)(101149938000) > 91869138000 for cluster slurm_cluster(25519205) from 2018-02-07T19:00:00 - 2018-02-07T20:00:00 tres 2
[2018-02-07T21:00:00.769] error: We have more allocated time than is possible (146499260912 > 91869138000) for cluster slurm_cluster(25519205) from 2018-02-07T20:00:00 - 2018-02-07T21:00:00 tres 2
[2018-02-07T21:00:00.769] error: We have more time than is possible (91869138000+9280800000+0)(101149938000) > 91869138000 for cluster slurm_cluster(25519205) from 2018-02-07T20:00:00 - 2018-02-07T21:00:00 tres 2
[2018-02-07T22:00:01.005] error: We have more allocated time than is possible (144580179584 > 91869138000) for cluster slurm_cluster(25519205) from 2018-02-07T21:00:00 - 2018-02-07T22:00:00 tres 2
[2018-02-07T22:00:01.005] error: We have more time than is possible (91869138000+9280800000+0)(101149938000) > 91869138000 for cluster slurm_cluster(25519205) from 2018-02-07T21:00:00 - 2018-02-07T22:00:00 tres 2
[2018-02-07T23:00:00.521] error: We have more allocated time than is possible (146453365376 > 91869138000) for cluster slurm_cluster(25519205) from 2018-02-07T22:00:00 - 2018-02-07T23:00:00 tres 2
[2018-02-07T23:00:00.521] error: We have more time than is possible (91869138000+9280800000+0)(101149938000) > 91869138000 for cluster slurm_cluster(25519205) from 2018-02-07T22:00:00 - 2018-02-07T23:00:00 tres 2
[2018-02-08T00:00:00.751] error: We have more allocated time than is possible (144993052800 > 91869138000) for cluster slurm_cluster(25519205) from 2018-02-07T23:00:00 - 2018-02-08T00:00:00 tres 2
[2018-02-08T00:00:00.751] error: We have more time than is possible (91869138000+9280800000+0)(101149938000) > 91869138000 for cluster slurm_cluster(25519205) from 2018-02-07T23:00:00 - 2018-02-08T00:00:00 tres 2
Comment 16 UAB Research Computing 2018-06-11 16:54:54 MDT
This appears to be following a reboot:

reboot   system boot  3.10.0-693.17.1. Wed Feb  7 12:47 - 16:34 (124+02:47)

We found that Slurmdbd was failing to start after reboot due to the database not being ready.

We fixed that by creating a copy of the systemd service file in /etc/systemd/system/slurmdbd.service and adding

[Service]
...
/etc/systemd/system/slurmdbd.service:Restart=always
/etc/systemd/system/slurmdbd.service:RestartSec=5

That said, this event appears been triggered by slurmdbd being down at the time.
Comment 18 Dominik Bartkiewicz 2018-06-13 03:59:58 MDT
Hi

I still can't reproduce this.
You are using quite old slurm version. Current 17.11 version contains several security patches and tens of bugs fixes,
I can't point out exactly commit which can fix this, but this could be the symptom of one of older bugs that are already fixed.

Dominik
Comment 19 UAB Research Computing 2018-06-13 12:58:43 MDT
Thanks for the feedback. We are planning to upgrade to 17.11 thus summer during our maintenance window.
Comment 22 Dominik Bartkiewicz 2018-06-28 07:21:22 MDT
Hi

Can we close this ticket?
If this occurs in current slurm version, we will try to find and fix the source of this problem.

Dominik
Comment 23 UAB Research Computing 2018-06-28 10:16:40 MDT
Yes, please close it