| Summary: | 300 Jobs showing as RUNNING in sacct, yet jobs are no longer in the queue | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | UAB Research Computing <RC_LICENSES> |
| Component: | Accounting | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | bart |
| Version: | 17.02.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | UAB | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Hi
Try to use 'sacctmgr show runawayjobs', it should clean this jobs.
From sacctmgr man:
...
RunawayJobs
Used only with the list or show command to report current jobs
that have been orphanded on the local cluster and are now run‐
away. If there are jobs in this state it will also give you an
option to "fix" them.
Dominik
Thanks, it reports that there are no runaway jobs: $ sudo sacctmgr show RunAwayJobs Runaway Jobs: No runaway jobs found Hi Could you send me mysql record of one of the affected jobs? eg.: select * from test_job_table where id_job=<job_id> Good news, the current version has fixed couple bugs which could cause runaway of job. ps. sorry for the first comment I missed the first part of your comment. And I wrongly assume that this is standard case. Dominik No worries. Here's the output: MariaDB [slurm_acct_db]> select * from slurm_cluster_job_table where id_job="1001131"; +------------+------------+---------+----------+---------------+----------------+-----------------+--------------------+----------+------------+------------+-----------+----------+----------+--------------+---------------+----------+---------+--------+---------+----------+---------+----------+-------------+---------------------+----------+-------------+----------+-----------+----------+-------+-----------+-------------+---------------+------------+------------+----------------+----------+------------+-----------+-------+-------------+-----------------+-----------------+ | job_db_inx | mod_time | deleted | account | admin_comment | array_task_str | array_max_tasks | array_task_pending | cpus_req | derived_ec | derived_es | exit_code | job_name | id_assoc | id_array_job | id_array_task | id_block | id_job | id_qos | id_resv | id_wckey | id_user | id_group | kill_requid | mem_req | nodelist | nodes_alloc | node_inx | partition | priority | state | timelimit | time_submit | time_eligible | time_start | time_end | time_suspended | gres_req | gres_alloc | gres_used | wckey | track_steps | tres_alloc | tres_req | +------------+------------+---------+----------+---------------+----------------+-----------------+--------------------+----------+------------+------------+-----------+----------+----------+--------------+---------------+----------+---------+--------+---------+----------+---------+----------+-------------+---------------------+----------+-------------+----------+-----------+----------+-------+-----------+-------------+---------------+------------+------------+----------------+----------+------------+-----------+-------+-------------+-----------------+-----------------+ | 2479906 | 1518041640 | 0 | user1 | NULL | NULL | 0 | 0 | 1 | 256 | NULL | 256 | prisk16 | 197 | 1001131 | 5000 | NULL | 1001131 | 1 | 0 | 0 | 10623 | 10623 | -1 | 9223372036854807808 | c0043 | 1 | 42 | medium | 6320 | 5 | 1440 | 1518041085 | 1518041086 | 1518041283 | 1518041289 | 0 | | | | | 0 | 1=1,2=32000,4=1 | 1=1,2=32000,4=1 | +------------+------------+---------+----------+---------------+----------------+-----------------+--------------------+----------+------------+------------+-----------+----------+----------+--------------+---------------+----------+---------+--------+---------+----------+---------+----------+-------------+---------------------+----------+-------------+----------+-----------+----------+-------+-----------+-------------+---------------+------------+------------+----------------+----------+------------+-----------+-------+-------------+-----------------+-----------------+ 1 row in set (0.00 sec) I noticed that all of those stuck in a RUNNING state are in "step?" .0 $ sacct -P -u user1 --format=jobidraw,jobid,start,end,elapsed,state |grep -b1 RUN | head 39-1001718|1001131_587|2018-02-07T16:05:12|Unknown|103-16:32:33|FAILED 107:1001718.0|1001131_587.0|2018-02-07T16:05:18|Unknown|103-16:32:27|RUNNING 180-1001723|1001131_592|2018-02-07T16:05:12|Unknown|103-16:32:33|FAILED 248:1001723.0|1001131_592.0|2018-02-07T16:05:18|Unknown|103-16:32:27|RUNNING 321-1001726|1001131_595|2018-02-07T16:05:12|Unknown|103-16:32:33|FAILED 389:1001726.0|1001131_595.0|2018-02-07T16:05:18|Unknown|103-16:32:27|RUNNING 462-1001730|1001131_599|2018-02-07T16:05:12|Unknown|103-16:32:33|FAILED 530:1001730.0|1001131_599.0|2018-02-07T16:05:18|Unknown|103-16:32:27|RUNNING 603-1001731|1001131_600|2018-02-07T16:05:12|Unknown|103-16:32:33|FAILED 671:1001731.0|1001131_600.0|2018-02-07T16:05:18|Unknown|103-16:32:27|RUNNING Any thoughts on how I can clean up these runaway job steps? Hi Sorry for the delay. Could you send me the output of two extra MySQL queries? select * from slurm_cluster_job_table where id_array_job=1001131 and id_array_task=107; select * from slurm_cluster_step_table where job_db_inx=<job_db_inx from upper query> and id_step=0 limit 1; Dominik Here you go:
MariaDB [slurm_acct_db]> select * from slurm_cluster_job_table where id_array_job=1001131 and
-> id_array_task=107;
+------------+------------+---------+----------+---------------+----------------+-----------------+--------------------+----------+------------+------------+-----------+----------+----------+--------------+---------------+----------+---------+--------+---------+----------+---------+----------+-------------+---------------------+----------+-------------+----------+-----------+----------+-------+-----------+-------------+---------------+------------+------------+----------------+----------+------------+-----------+-------+-------------+-----------------+-----------------+
| job_db_inx | mod_time | deleted | account | admin_comment | array_task_str | array_max_tasks | array_task_pending | cpus_req | derived_ec | derived_es | exit_code | job_name | id_assoc | id_array_job | id_array_task | id_block | id_job | id_qos | id_resv | id_wckey | id_user | id_group | kill_requid | mem_req | nodelist | nodes_alloc | node_inx | partition | priority | state | timelimit | time_submit | time_eligible | time_start | time_end | time_suspended | gres_req | gres_alloc | gres_used | wckey | track_steps | tres_alloc | tres_req |
+------------+------------+---------+----------+---------------+----------------+-----------------+--------------------+----------+------------+------------+-----------+----------+----------+--------------+---------------+----------+---------+--------+---------+----------+---------+----------+-------------+---------------------+----------+-------------+----------+-----------+----------+-------+-----------+-------------+---------------+------------+------------+----------------+----------+------------+-----------+-------+-------------+-----------------+-----------------+
| 2479912 | 1518041251 | 0 | jrichman | NULL | NULL | 0 | 0 | 1 | 256 | NULL | 256 | prisk16 | 197 | 1001131 | 107 | NULL | 1001238 | 1 | 0 | 0 | 10623 | 10623 | -1 | 9223372036854807808 | c0071 | 1 | 70 | medium | 6320 | 5 | 1440 | 1518041085 | 1518041086 | 1518041088 | 1518041113 | 0 | | | | | 0 | 1=1,2=32000,4=1 | 1=1,2=32000,4=1 |
MariaDB [slurm_acct_db]> select * from slurm_cluster_step_table where job_db_inx=2479912 and id_step=0 limit 1;
+------------+---------+-----------+---------+-------------+----------+-------------+----------+-------+-----------+----------+-----------+------------+------------+----------------+----------+-----------+---------+----------+-----------+----------------+----------------+-----------+---------+--------------+--------------+---------+-----------+----------------+----------------+-----------+---------+--------------+--------------+---------+-------------+-----------------+-----------------+-------------+-----------------+---------------+--------------------+--------------------+---------------+----------------+---------------------+---------------------+----------------+-----------------+
| job_db_inx | deleted | exit_code | id_step | kill_requid | nodelist | nodes_alloc | node_inx | state | step_name | task_cnt | task_dist | time_start | time_end | time_suspended | user_sec | user_usec | sys_sec | sys_usec | max_pages | max_pages_task | max_pages_node | ave_pages | max_rss | max_rss_task | max_rss_node | ave_rss | max_vsize | max_vsize_task | max_vsize_node | ave_vsize | min_cpu | min_cpu_task | min_cpu_node | ave_cpu | act_cpufreq | consumed_energy | req_cpufreq_min | req_cpufreq | req_cpufreq_gov | max_disk_read | max_disk_read_task | max_disk_read_node | ave_disk_read | max_disk_write | max_disk_write_task | max_disk_write_node | ave_disk_write | tres_alloc |
+------------+---------+-----------+---------+-------------+----------+-------------+----------+-------+-----------+----------+-----------+------------+------------+----------------+----------+-----------+---------+----------+-----------+----------------+----------------+-----------+---------+--------------+--------------+---------+-----------+----------------+----------------+-----------+---------+--------------+--------------+---------+-------------+-----------------+-----------------+-------------+-----------------+---------------+--------------------+--------------------+---------------+----------------+---------------------+---------------------+----------------+-----------------+
| 2479912 | 0 | 256 | 0 | -1 | c0071 | 1 | 70 | 5 | R | 1 | 1 | 1518041103 | 1518041112 | 0 | 0 | 266058 | 0 | 100301 | 0 | 0 | 0 | 0 | 1216 | 0 | 0 | 1216 | 230916 | 0 | 0 | 230916 | 0 | 0 | 0 | 0 | 1294628 | 0 | 4294967294 | 4294967294 | 4294967294 | 0 | 65534 | 4294967294 | 0 | 0 | 65534 | 4294967294 | 0 | 1=1,2=32000,4=1 |
Hi When I was reading this output I have realised that I send you wrong id_array_task this should be: select * from slurm_cluster_job_table where id_array_job=1001131 and id_array_task=600; select * from slurm_cluster_step_table where job_db_inx=<job_db_inx from upper query> and id_step=0 limit 1; Dominik Oops, here's the updated query results:
MariaDB [slurm_acct_db]> select * from slurm_cluster_job_table where id_array_job=1001131 and
-> id_array_task=600;
+------------+------------+---------+----------+---------------+----------------+-----------------+--------------------+----------+------------+------------+-----------+----------+----------+--------------+---------------+----------+---------+--------+---------+----------+---------+----------+-------------+---------------------+----------+-------------+----------+-----------+----------+-------+-----------+-------------+---------------+------------+----------+----------------+----------+------------+-----------+-------+-------------+-----------------+-----------------+
| job_db_inx | mod_time | deleted | account | admin_comment | array_task_str | array_max_tasks | array_task_pending | cpus_req | derived_ec | derived_es | exit_code | job_name | id_assoc | id_array_job | id_array_task | id_block | id_job | id_qos | id_resv | id_wckey | id_user | id_group | kill_requid | mem_req | nodelist | nodes_alloc | node_inx | partition | priority | state | timelimit | time_submit | time_eligible | time_start | time_end | time_suspended | gres_req | gres_alloc | gres_used | wckey | track_steps | tres_alloc | tres_req |
+------------+------------+---------+----------+---------------+----------------+-----------------+--------------------+----------+------------+------------+-----------+----------+----------+--------------+---------------+----------+---------+--------+---------+----------+---------+----------+-------------+---------------------+----------+-------------+----------+-----------+----------+-------+-----------+-------------+---------------+------------+----------+----------------+----------+------------+-----------+-------+-------------+-----------------+-----------------+
| 2480409 | 1518041190 | 0 | jrichman | NULL | NULL | 0 | 0 | 1 | 0 | NULL | 0 | prisk16 | 197 | 1001131 | 600 | NULL | 1001731 | 1 | 0 | 0 | 10623 | 10623 | -1 | 9223372036854807808 | c0083 | 1 | 82 | medium | 6320 | 5 | 1440 | 1518041085 | 1518041086 | 1518041112 | 0 | 0 | | | | | 0 | 1=1,2=32000,4=1 | 1=1,2=32000,4=1 |
MariaDB [slurm_acct_db]> select * from slurm_cluster_step_table where job_db_inx=2480409 and id_step=0 limit 1;
+------------+---------+-----------+---------+-------------+----------+-------------+----------+-------+-----------+----------+-----------+------------+----------+----------------+----------+-----------+---------+----------+-----------+----------------+----------------+-----------+---------+--------------+--------------+---------+-----------+----------------+----------------+-----------+------------+--------------+--------------+---------+-------------+-----------------+-----------------+-------------+-----------------+---------------+--------------------+--------------------+---------------+----------------+---------------------+---------------------+----------------+-----------------+
| job_db_inx | deleted | exit_code | id_step | kill_requid | nodelist | nodes_alloc | node_inx | state | step_name | task_cnt | task_dist | time_start | time_end | time_suspended | user_sec | user_usec | sys_sec | sys_usec | max_pages | max_pages_task | max_pages_node | ave_pages | max_rss | max_rss_task | max_rss_node | ave_rss | max_vsize | max_vsize_task | max_vsize_node | ave_vsize | min_cpu | min_cpu_task | min_cpu_node | ave_cpu | act_cpufreq | consumed_energy | req_cpufreq_min | req_cpufreq | req_cpufreq_gov | max_disk_read | max_disk_read_task | max_disk_read_node | ave_disk_read | max_disk_write | max_disk_write_task | max_disk_write_node | ave_disk_write | tres_alloc |
+------------+---------+-----------+---------+-------------+----------+-------------+----------+-------+-----------+----------+-----------+------------+----------+----------------+----------+-----------+---------+----------+-----------+----------------+----------------+-----------+---------+--------------+--------------+---------+-----------+----------------+----------------+-----------+------------+--------------+--------------+---------+-------------+-----------------+-----------------+-------------+-----------------+---------------+--------------------+--------------------+---------------+----------------+---------------------+---------------------+----------------+-----------------+
| 2480409 | 0 | 0 | 0 | -1 | c0083 | 1 | 82 | 1 | R | 1 | 1 | 1518041118 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4294967294 | 0 | 0 | 0 | 0 | 0 | 4294967294 | 4294967294 | 4294967294 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1=1,2=32000,4=1 |
Hi Could you send me slurmdbd log from 2018-02-07? If log file is unavailable, maybe you remember similar pattern like "Deadlock found when trying to get lock" in slurmdbd.log. Dominik sudo grep 2018-02-07 /var/log/slurmdbd-20180404 1 ↵ 31637 16:31:40 [2018-02-07T12:48:15.294] error: mysql_real_connect failed: 2003 Can't connect to MySQL server on 'master' (113) [2018-02-07T12:48:15.326] error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. [2018-02-07T12:48:23.332] error: mysql_real_connect failed: 2003 Can't connect to MySQL server on 'master' (113) [2018-02-07T12:48:23.332] error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. [2018-02-07T12:48:31.338] error: mysql_real_connect failed: 2003 Can't connect to MySQL server on 'master' (113) [2018-02-07T12:48:34.768] error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. [2018-02-07T12:49:04.434] error: chdir(/var/log): Permission denied [2018-02-07T12:49:04.434] chdir to /var/tmp [2018-02-07T12:49:05.153] slurmdbd version 17.02.2 started [2018-02-07T12:49:06.325] Terminate signal (SIGINT or SIGTERM) received [2018-02-07T14:46:06.674] error: chdir(/var/log): Permission denied [2018-02-07T14:46:06.674] chdir to /var/tmp [2018-02-07T14:46:06.872] slurmdbd version 17.02.2 started [2018-02-07T17:00:00.627] error: We have more allocated time than is possible (150436790528 > 91869138000) for cluster slurm_cluster(25519205) from 2018-02-07T16:00:00 - 2018-02-07T17:00:00 tres 2 [2018-02-07T17:00:00.627] error: We have more time than is possible (91869138000+9077730000+0)(100946868000) > 91869138000 for cluster slurm_cluster(25519205) from 2018-02-07T16:00:00 - 2018-02-07T17:00:00 tres 2 [2018-02-07T18:00:00.558] error: We have more allocated time than is possible (156030123488 > 91869138000) for cluster slurm_cluster(25519205) from 2018-02-07T17:00:00 - 2018-02-07T18:00:00 tres 2 [2018-02-07T18:00:00.558] error: We have more time than is possible (91869138000+9280800000+0)(101149938000) > 91869138000 for cluster slurm_cluster(25519205) from 2018-02-07T17:00:00 - 2018-02-07T18:00:00 tres 2 [2018-02-07T19:00:00.811] error: We have more allocated time than is possible (153417836784 > 91869138000) for cluster slurm_cluster(25519205) from 2018-02-07T18:00:00 - 2018-02-07T19:00:00 tres 2 [2018-02-07T19:00:00.811] error: We have more time than is possible (91869138000+9280800000+0)(101149938000) > 91869138000 for cluster slurm_cluster(25519205) from 2018-02-07T18:00:00 - 2018-02-07T19:00:00 tres 2 [2018-02-07T20:00:00.553] error: We have more allocated time than is possible (149169110272 > 91869138000) for cluster slurm_cluster(25519205) from 2018-02-07T19:00:00 - 2018-02-07T20:00:00 tres 2 [2018-02-07T20:00:00.553] error: We have more time than is possible (91869138000+9280800000+0)(101149938000) > 91869138000 for cluster slurm_cluster(25519205) from 2018-02-07T19:00:00 - 2018-02-07T20:00:00 tres 2 [2018-02-07T21:00:00.769] error: We have more allocated time than is possible (146499260912 > 91869138000) for cluster slurm_cluster(25519205) from 2018-02-07T20:00:00 - 2018-02-07T21:00:00 tres 2 [2018-02-07T21:00:00.769] error: We have more time than is possible (91869138000+9280800000+0)(101149938000) > 91869138000 for cluster slurm_cluster(25519205) from 2018-02-07T20:00:00 - 2018-02-07T21:00:00 tres 2 [2018-02-07T22:00:01.005] error: We have more allocated time than is possible (144580179584 > 91869138000) for cluster slurm_cluster(25519205) from 2018-02-07T21:00:00 - 2018-02-07T22:00:00 tres 2 [2018-02-07T22:00:01.005] error: We have more time than is possible (91869138000+9280800000+0)(101149938000) > 91869138000 for cluster slurm_cluster(25519205) from 2018-02-07T21:00:00 - 2018-02-07T22:00:00 tres 2 [2018-02-07T23:00:00.521] error: We have more allocated time than is possible (146453365376 > 91869138000) for cluster slurm_cluster(25519205) from 2018-02-07T22:00:00 - 2018-02-07T23:00:00 tres 2 [2018-02-07T23:00:00.521] error: We have more time than is possible (91869138000+9280800000+0)(101149938000) > 91869138000 for cluster slurm_cluster(25519205) from 2018-02-07T22:00:00 - 2018-02-07T23:00:00 tres 2 [2018-02-08T00:00:00.751] error: We have more allocated time than is possible (144993052800 > 91869138000) for cluster slurm_cluster(25519205) from 2018-02-07T23:00:00 - 2018-02-08T00:00:00 tres 2 [2018-02-08T00:00:00.751] error: We have more time than is possible (91869138000+9280800000+0)(101149938000) > 91869138000 for cluster slurm_cluster(25519205) from 2018-02-07T23:00:00 - 2018-02-08T00:00:00 tres 2 This appears to be following a reboot: reboot system boot 3.10.0-693.17.1. Wed Feb 7 12:47 - 16:34 (124+02:47) We found that Slurmdbd was failing to start after reboot due to the database not being ready. We fixed that by creating a copy of the systemd service file in /etc/systemd/system/slurmdbd.service and adding [Service] ... /etc/systemd/system/slurmdbd.service:Restart=always /etc/systemd/system/slurmdbd.service:RestartSec=5 That said, this event appears been triggered by slurmdbd being down at the time. Hi I still can't reproduce this. You are using quite old slurm version. Current 17.11 version contains several security patches and tens of bugs fixes, I can't point out exactly commit which can fix this, but this could be the symptom of one of older bugs that are already fixed. Dominik Thanks for the feedback. We are planning to upgrade to 17.11 thus summer during our maintenance window. Hi Can we close this ticket? If this occurs in current slurm version, we will try to find and fix the source of this problem. Dominik Yes, please close it |
Howdy, we have identified at least 300 jobs showing as RUNNING in sacct yet they have long since left the job queue. $ sacctmgr show runaway Runaway Jobs: No runaway jobs found Yet I see 300 of these for a particular user (user name redacted): $ squeue --user ***** JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) $ sacct -u ***** --format=jobid,start,end,elapsed,state |grep RUN | head 1001131_587+ 2018-02-07T16:05:18 Unknown 102-17:12:23 RUNNING 1001131_592+ 2018-02-07T16:05:18 Unknown 102-17:12:23 RUNNING 1001131_595+ 2018-02-07T16:05:18 Unknown 102-17:12:23 RUNNING 1001131_599+ 2018-02-07T16:05:18 Unknown 102-17:12:23 RUNNING 1001131_600+ 2018-02-07T16:05:18 Unknown 102-17:12:23 RUNNING 1001131_602+ 2018-02-07T16:05:18 Unknown 102-17:12:23 RUNNING ... Any suggestion on how to clear the RUNNING state on these jobs? Also, is there a convenient way to set the completed time so something sane, like the maximum runtime allowed for the job so that our accounting isn't skewed by these results?