Ticket 8131

Summary:	A lot of runaway jobs
Product:	Slurm	Reporter:	Hjalti Sveinsson <hjalti.sveinsson>
Component:	slurmdbd	Assignee:	Ben Roberts <ben>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	18.08.7
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=8241
Site:	deCODE	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	runawayjobs slurmctld log slurmdbd log file sdiag

Description Hjalti Sveinsson 2019-11-21 05:58:22 MST

Created attachment 12366 [details]
runawayjobs

Hello,

currently we are seeing a lot of jobs in runaway state on our cluster. I have uploaded a file with these jobs.

root@ru-lhpc-head:~# cat /root/deCODE-runawayjobs-11-21-2019.txt | grep ^3 | wc -l
4399

What can explain this?

regards,
Hjalti Sveinsson

Comment 1 Ben Roberts 2019-11-21 10:29:01 MST

Hi Hjalti,

Can I have you send me the slucmtcld and slurmdbd log files that cover at least the beginning of the day today?  I would also like to see the output of 'sdiag.' 

Thanks,
Ben

Comment 2 Hjalti Sveinsson 2019-11-22 08:59:57 MST

Created attachment 12383 [details]
slurmctld log

Comment 3 Ben Roberts 2019-11-22 09:57:36 MST

Thank you for sending the slurmctld logs.  I can see in these logs that there are issues that show up related to connecting to slurmdbd.  

[2019-11-21T13:03:28.809] error: slurmdbd: Failure sending message: 11: Connection refused
[2019-11-21T13:03:29.097] error: slurmdbd: Sending PersistInit msg: Connection refused
[2019-11-21T13:03:29.097] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused


I looked at several jobs in the list of runaway jobs and I can see that they did complete, but it looks like they were unable to be recorded in the database because of these connection errors.  I only saw the slurmctld.log file in the tarball you sent.  I would like to see if there's anything in the slurmdbd logs that might show what is happening.  Can you send those logs as well?  

I don't see any of these connection refused messages since yesterday at 16:02:25.  The 'sdiag' output has a field to show the "DBD Agent queue size."  I would like to make sure that number is not steadily growing or maxed out at a very large number currently.  Can you send the sdiag output, and if possible send the output from a few runs, a few minutes apart?  I want to make sure that you don't still have job data that is going to be lost.  

Thank you,
Ben

Comment 4 Hjalti Sveinsson 2019-11-25 05:14:22 MST

Created attachment 12389 [details]
slurmdbd log file

Comment 5 Hjalti Sveinsson 2019-11-25 07:25:15 MST

Created attachment 12391 [details]
sdiag

Comment 6 Ben Roberts 2019-11-25 11:22:12 MST

Hi Hjalti,

Thanks for responding with the other information I requested.  In going over the logs it looks like the issue is probably related to load on slurmdbd.  The logs are filled with messages that look like this:
[2019-11-21T00:00:02.414] debug:  REQUEST_PERSIST_INIT: CLUSTER:lhpc VERSION:8448 UID:1065 IP:172.17.148.155 CONN:11


These log entries look like they're coming from sacct requests.  There are two users who are primarily generating these requests: UID 3342 and 1065.  The sdiag output confirms this in the section that shows the number of RPC's for each user.  The sdiag output you sent covers a 2 minute window and shows that these two users each already have thousands of requests:
Remote Procedure Call statistics by user
        bioprod         (    1065) count:3324   ave_time:69922  total_time:232423666
        helgain         (    3342) count:2478   ave_time:45172  total_time:111936428
        root            (       0) count:1752   ave_time:165721 total_time:290343705


The sdiag output tracks requests to slurmctld and the logs I mentioned above are showing requests to slurmdbd, but it indicates that they probably have some sort of script that is generating a lot of traffic.  The slurmdbd logs show that these requests are coming in from a lot of different IPs, so my guess is that it's something that runs inside their jobs.  

As another data point, I trimmed the logs to just have the entries from 00:00:00 on 11/21 through 15:51:26 on 11/22.  Then I looked at the counts and there are roughly 365,000 of these requests in that ~40 hour window.

$ wc -l slurmdbd-last-10000000-lines.txt 
365359 slurmdbd-last-10000000-lines.txt

$ grep -c REQUEST_PERSIST_INIT slurmdbd-last-10000000-lines.txt 
365083


This comes out to about 2.5 requests per second.  If these were all evenly distributed it would probably be fine, but there are times where there are a large number of requests coming in at once that is causing slurmdbd to lose it's connection to the database.  Here's an example of the logs when this happens.

[2019-11-21T14:13:39.301] error: mysql_query failed: 2013 Lost connection to MySQL server during query
delete from "lhpc_job_table" where time_submit <= 1512086400 && time_end != 0 LIMIT 50000
[2019-11-21T14:13:39.301] error: Couldn't remove old data from job_table table
[2019-11-21T14:13:39.303] Warning: Note very large processing time from daily_rollup for lhpc: usec=3668990577 began=13:12:30.312
[2019-11-21T14:13:39.303] error: Cluster lhpc rollup failed
[2019-11-21T14:13:39.303] error: mysql_commit failed: 2006 MySQL server has gone away
[2019-11-21T14:13:39.303] error: rollback failed
[2019-11-21T14:13:39.927] debug:  Problem sending response to connection 18(172.17.147.210) uid(598)
[2019-11-21T14:13:39.927] debug:  cluster lhpc has disconnected




Can you get in touch with these users to confirm whether they have something in their job script that is generating a lot of requests to slurmctld and slurmdbd?  If you can share what their job scripts are doing we can see if there is some way to get the information they're after in a more efficient way.  

I'm also curious to know if you use Bright Cluster Manager.  We have seen some cases where their internal accounting can generate a lot of traffic.  

Thank you,
Ben

Comment 7 Hjalti Sveinsson 2019-12-05 03:59:02 MST

Hi, 

yeah we know about these queries to the database. We have an inhouse written wrapper for job submission to the cluster that uses squeue and sacct frequently to check the status of running jobs. That is the reason for all the queries.

It seems that this is happening while the archive is done on slurmdbd,

F.e. we had a crash today at 01:26

[2019-12-05T01:26:00.255] error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction
update cluster_table set mod_time=1575507359, control_host='', control_port=0 where name='lhpc' && control_host='172.17.147.210' && control_port=6817;
[2019-12-05T01:26:00.275] fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
[2019-12-05T01:26:11.588] debug:  Log file re-opened
[2019-12-05T01:26:11.592] debug:  Munge authentication plugin loaded
[2019-12-05T01:26:12.409] Accounting storage MYSQL plugin loaded
[2019-12-05T01:26:12.540] debug:  post user: couldn't get a uid for user charlottea
[2019-12-05T01:26:12.587] debug:  post user: couldn't get a uid for user gudrunja
[2019-12-05T01:26:12.595] debug:  post user: couldn't get a uid for user hakong
[2019-12-05T01:26:12.603] debug:  post user: couldn't get a uid for user heidab
[2019-12-05T01:26:12.648] debug:  post user: couldn't get a uid for user lucasw
[2019-12-05T01:26:12.683] debug:  post user: couldn't get a uid for user pauli
[2019-12-05T01:26:12.684] debug:  post user: couldn't get a uid for user pauln
[2019-12-05T01:26:12.695] debug:  post user: couldn't get a uid for user sebastianr
[2019-12-05T01:26:12.765] debug:  post user: couldn't get a uid for user yingh
[2019-12-05T01:26:12.765] slurmdbd version 18.08.7 started

Before that was this in the log,

[2019-12-05T00:55:59.002] error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction
update "lhpc_job_table" set nodelist='lhpc-0512', account='bioprod', `partition`='cpu_hog', node_inx='255', gres_req='', gres_alloc='', array_task_str=NULL, array_task_pending=0, tres_alloc='1=1,2=4096,3=18446744073709551614,4=1,5=1', tres_req='1=1,2=4096,4=1,5=1', work_dir='/nfs/odinn/datavault/damp/tmp/vol16/ont_build38_v3/ONTLLgpc/vol01/191007_PAD83258', time_start=1575506252, job_name='191007_PAD83258_complete', state=greatest(state, 1), nodes_alloc=1, id_qos=1, id_assoc=5, id_resv=0, timelimit=1440, mem_req=4096, id_array_job=0, id_array_task=4294967294, pack_job_id=0, pack_job_offset=4294967294, time_eligible=1575506252, mod_time=UNIX_TIMESTAMP() where job_db_inx=247777793
[2019-12-05T00:55:59.002] debug:  Problem sending response to connection 12(172.17.147.210) uid(598)
[2019-12-05T00:55:59.010] debug:  cluster lhpc has disconnected

[2019-12-05T00:37:16.642] debug:  Storing job_table archive for lhpc at /nfs/fs1/slurm_archive_data//lhpc_job_table_archive_2018-01-09T00:00:00_2018-01-09T23:59:59
[2019-12-05T00:37:16.843] 0(as_mysql_archive.c:3446) query
delete from "lhpc_job_table" where time_submit <= 1515542399 && time_end != 0 LIMIT 50000

[2019-12-05T00:27:28.851] debug:  Purging lhpc_job_table before 1515542399
[2019-12-05T00:27:28.851] 0(as_mysql_archive.c:3150) query
select account, array_max_tasks, nodes_alloc, id_assoc, id_array_job, id_array_task, id_block, derived_ec, derived_es, exit_code, timelimit, time_eligible, time_end, id_group, job_db_inx, id_job, kill_requid, mcs_label, job_name, nodelist, node_inx, pack_job_id, pack_job_offset, `partition`, priority, id_qos, cpus_req, mem_req, id_resv, time_start, state, time_submit, time_suspended, track_steps, id_user, wckey, id_wckey, work_dir, tres_alloc, tres_req from "lhpc_job_table" where time_submit < 1515542399 && time_end != 0 order by time_submit asc

[2019-12-05T00:11:09.086] 0(as_mysql_archive.c:3240) query
select time_submit from "lhpc_job_table" where time_submit <= 1515542399 && time_end != 0 order by time_submit asc LIMIT 1

[2019-12-05T00:00:00.465] 0(as_mysql_archive.c:3240) query
select time_start from "lhpc_step_table" where time_start <= 1523231999 && time_end != 0 order by time_start asc LIMIT 1

And now if I run "sacctmgr show runawayjobs" we have a few jobs there:

[root@ru-lhpc-head-db-01 slurm_archive_data]# sacctmgr show runawayjobs
NOTE: Runaway jobs are jobs that don't exist in the controller but have a start time and no end time in the database
          ID       Name  Partition    Cluster      State           TimeStart             TimeEnd
------------ ---------- ---------- ---------- ---------- ------------------- -------------------
41570975          assoc    cpu_hog       lhpc    RUNNING 2019-12-04T17:31:08             Unknown
41573856     likelyhood    cpu_hog       lhpc    RUNNING 2019-12-04T17:59:20             Unknown
41599905      popsvar16    cpu_hog       lhpc    RUNNING 2019-12-04T23:43:16             Unknown

Would you like to fix these runaway jobs?
(This will set the end time for each job to the latest out of the start, eligible, or submit times, and set the state to completed.
Once corrected, this will trigger the rollup to reroll usage from before the oldest runaway job.)

 (You have 30 seconds to decide)
(N/y):

So it seems that the archiving mechanism in slurm is causing these issues.

Comment 8 Hjalti Sveinsson 2019-12-05 04:01:19 MST

And no we don't use Bright Cluster manager.

Comment 9 Ben Roberts 2019-12-05 14:17:45 MST

Hi Hjalti,

I'm sorry to hear that this gave you problems again.  When that connection timed out that would cause the pending updates to the database to fail and when these updates are for jobs they would then show up as runaway jobs.  There may be some tuning we can do to the database to keep it from timing out.  Can I have you log into the database and send the output of the following queries:
show variables like 'innodb_buffer_pool_size';
show variables like 'innodb_lock_wait_timeout';


How much memory do you have on the system running MySQL that you could allocate to the buffer pool?  I'm also curious if there is something we can do to optimize the job submission wrapper you're using.  Are you able to send that for me to look at?  

The archiving can add demands to an already busy database.  Looking at your previous logs, I do see that there was still a good amount of traffic around 01:26.  I assume this is from a script and wouldn't necessarily be reliant on users submitting jobs at that time.  Is that right?  

Thanks,
Ben

Comment 12 Ben Roberts 2019-12-17 15:53:15 MST

Hi Hjalti,

I wanted to follow up and see if you've had a chance to look at the recommendations I sent in my last update.  One of my colleagues also pointed out that there were some updates made related to archiving in 18.08.8 that might be helpful in your case.  Here are the relevant entries from the NEWS file:
 -- Upon archive file name collision, create new archive file instead of
    overwriting the old one to prevent lost records.
 -- Limit archive files to 50000 records per file so that archiving large
    databases will succeed.


I still think it would be a good idea to optimize the database settings and any scripts, but moving to 18.08.8 would also be worthwhile.  

Thanks,
Ben

Comment 13 Hjalti Sveinsson 2019-12-20 05:43:45 MST

(In reply to Ben Roberts from comment #9)
> Hi Hjalti,
> 
> I'm sorry to hear that this gave you problems again.  When that connection
> timed out that would cause the pending updates to the database to fail and
> when these updates are for jobs they would then show up as runaway jobs. 
> There may be some tuning we can do to the database to keep it from timing
> out.  Can I have you log into the database and send the output of the
> following queries:
> show variables like 'innodb_buffer_pool_size';
> show variables like 'innodb_lock_wait_timeout';
> 
> 
> How much memory do you have on the system running MySQL that you could
> allocate to the buffer pool?  I'm also curious if there is something we can
> do to optimize the job submission wrapper you're using.  Are you able to
> send that for me to look at?  
> 
> The archiving can add demands to an already busy database.  Looking at your
> previous logs, I do see that there was still a good amount of traffic around
> 01:26.  I assume this is from a script and wouldn't necessarily be reliant
> on users submitting jobs at that time.  Is that right?  
> 
> Thanks,
> Ben

Hi, 

Here are the results from your queries, sorry for taking such a long time to answer.

MariaDB [(none)]> show variables like 'innodb_buffer_pool_size';
+-------------------------+-------------+
| Variable_name           | Value       |
+-------------------------+-------------+
| innodb_buffer_pool_size | 10737418240 |
+-------------------------+-------------+
1 row in set (0.00 sec)

MariaDB [(none)]> show variables like 'innodb_lock_wait_timeout';
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| innodb_lock_wait_timeout | 1800  |
+--------------------------+-------+
1 row in set (0.00 sec)

Other thing I noticed, our slurm database is getting quite large, is there any way for us to truncate the database since I guess that the next slurm upgrade we need to do will take a long time finishing.

MariaDB [(none)]> SELECT table_schema AS "Database",  ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS "Size (MB)"  FROM information_schema.TABLES  GROUP BY table_schema;
+--------------------+-----------+
| Database           | Size (MB) |
+--------------------+-----------+
| information_schema |      0.07 |
| mysql              |      0.63 |
| performance_schema |      0.00 |
| slurm_acc          | 198458.36 |
+--------------------+-----------+
4 rows in set (0.20 sec)

Strange that it says 198GB since the data directory for mariadb says it's using 317GB,

[root@ru-lhpc-head-db-01 mysql]# du -h /u0/mysql
1008K   /u0/mysql/mysql
212K    /u0/mysql/performance_schema
360K    /u0/mysql/slurm_acc
317G    /u0/mysql

We have 64GB of RAM on the Slurmdbd node.

I will have to get confirmation if it's okay to send the job submission code to you guys. If it's okay, i'll create a tarball and send you the code, it would be best if I could send you directly in a private message so it's not for everyone to see.

You say 18.08.8, I was really thinking of doing an upgrade early 2020 to 19.x.x, would you not recommend that?

Comment 14 Ben Roberts 2019-12-20 09:06:10 MST

Thanks for the information Hjalti. It does look like your buffer pool size and timeout values have already been increased from the default values and should be ok.

Slurm does have a mechanism in place to remove old data from the database. In your slurmdbd.conf you can define Purge*After parameters, specifying that you want to purge events, jobs, reservations, steps, suspend times, transactions and usage after a specified amount of time. When you specify a time period the purge will happen at the beginning of each of those periods. For example, if you specify that you only want to keep the last 12 months of data, then anything older than that will be purged at the beginning of the next month. If you want this to happen sooner than the beginning of the next month you can specify 365 days for the purge to happen every night.

If you haven't done a purge yet and there are years of accumulated data, I would advise you to purge it in smaller chunks. If you have 5 years of data and you only want to keep the last 2 years worth then I would recommend purging anything older than 4 years the first time, anything older than 3 the next and then leaving it at 2 years. If you have a very large number of jobs going through your system then you might want to make those increments be 6 or 3 months. As I mentioned before, you can convert those time periods to days for the purges to happen nightly until you are at the time period you wish to stay at.

In addition to purging the data, you can configure slurmdbd to archive the data to a file before it is removed from the database with the Archive* parameters. You can read more about archiving this data along with purging it in the slurmdbd.conf documentation page:
https://slurm.schedmd.com/slurmdbd.conf.html

I'll send you an email you can respond to if you are able to send a copy of your submission script.

You can certainly go to 19.05 instead of 18.08.8. I just recommended 18.08.8 since I know it's usually easier to do a minor upgrade than a major one. When you do go to 19.05 you would just want to make sure you get 19.05.4 or later as this is when the fixes were introduced to the 19.05 branch.

Thanks,
Ben

Comment 15 Ben Roberts 2020-01-02 15:26:09 MST

Hi Hjalti,

I wanted to check back with you and see if you needed any further help with this ticket.  Please let me know if you have any questions about purging the old data or upgrading.  

Thanks,
Ben

Comment 16 Ben Roberts 2020-01-10 09:07:46 MST

Hi Hjalti,

I believe the information I sent should have helped and haven't heard any additional questions.  I'm going to close the ticket, but feel free to update it if something does come up.

Thanks,
Ben