| Summary: | "Invalid account or partition", but account and partition are present | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Steve Ford <fordste5> |
| Component: | slurmctld | Assignee: | Jason Booth <jbooth> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | nate |
| Version: | 18.08.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | MSU | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
Slurmctld log
slurmdbd log slurm.conf sdiag output Slurmctld log slurmdbd log slurmdbd.log debug2 slurmctld log debug2 Slurmctld Core Dump Logging patch slurmdbd log 05/01 dmesg |
||
|
Description
Steve Ford
2019-02-27 13:00:16 MST
Hi Steve, Would you attach your slurmctld.log and slurmdbd.log. I suspect that there were some changes made to the slurmdbd that were not propagated to the slurmctld. I will look over the logs to see if I notice anything that may clue us into the cause. Would you also answer these questions for me: Does the slurmdbd sit on a different system then the slurmctld and are there any firewalls between the two? Is this a fairly new account credential that was added? Do you use partition names in your associations? eg. sacctmgr show assoc Hello Jason, Our slurmctld and slurmdbd are on the same server. This was not a new user or account. We do not use partitions in our associations. Unfortunately the logs from that time are no longer available. However, we are seeing another issue that looks like a slurmctld/slurmdbd disconnect. Our job submit plugin adds new users to our accounting database when they first submit a job. After adding the user, we verify they were added by checking that job_desc.default_account is set. Lately, this check is failing, even though we can see from sacctmgr that the user was added. At the same time, we deleted a user using sacctmgr. When jobs were submitted by that user, SLURM behaved as though they were still in the accounting database. I'll attach the logs from when we saw this other error. Best, Steve Created attachment 9474 [details]
Slurmctld log
Created attachment 9475 [details]
slurmdbd log
The disconnect disappears for a while when slurmctld is restarted. Hi Steve, Can you attach your slurm.conf. I did notice the following from the logs but the first two are not as concerning to me as the timeouts. >slurmdbd.log:[2019-03-07T13:40:05.410] error: unable to re-connect to as_mysql database >slurmdbd.log:[2019-03-07T13:40:05.410] error: mysql_real_connect failed: 2002 Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2) >slurmdbd.log:[2019-03-07T13:40:05.410] error: unable to re-connect to as_mysql database Then on the server I see dns resolve issues for a few nodes: >[2019-03-07T03:28:04.996] error: Unable to resolve "csm-006": Unknown host >[2019-03-07T03:28:04.997] error: _set_slurmd_addr: failure on csm-006 I also see many timeouts for send/recv operation. >[2019-03-07T11:56:03.406] error: slurm_receive_msgs: Socket timed out on send/recv operation >[2019-03-07T11:58:05.370] error: slurm_receive_msgs: Socket timed out on send/recv operation Would you also let me know the number of jobs your site pushes through in a given day and how long jobs run for? It would also be good to give me a general idea of how large these jobs are and what type of workload they run. Hello Jason, The mysql connection errors around 13:40 are from when we restarted the database server. I'm not worried about the DNS errors. These are nodes that are decommissioned and pending removal from our configuration. DNS lookup fails because they were removed from our DNS server already. We see around 60,000 jobs completing every day. Most are single node jobs. I'll get back to you about how long they run. Could our workload size be affecting the ctld's ability to stay up to date with the dbd? Best, Steve Steve, > We see around 60,000 jobs completing every day. Most are single node jobs. I'll get back to you about how long they run. 60,000 is not that large of a workload so I would say there may be some turning that may need to be done on your cluster. Would you attach your slurm.conf so that I can review your settings? > Could our workload size be affecting the ctld's ability to stay up to date with the dbd? The database should be able to cope with that workload assuming that the dbd has enough RAM and fast storage to cope with your 60k jobs a day. We can look at the output of sdiag to get a better understanding of where the time is being spent so please attach that output along with the slurm.conf. My impressions overall of this issue are that your slurmctld should not be losing associations. It is still unclear to me how you get into this state. Let's start with looking over your configuration and sdiag output and take steps to tune the server while we investigate this issue further. -Jason Created attachment 9563 [details]
slurm.conf
Created attachment 9564 [details]
sdiag output
Hi Steve, Thanks for uploading your sidag and slurm.conf. I reviewed both and I would like to make a few suggestions for your next maintenance cycle. If you would not mind, please look these over and make a few changes. If the 'disconnect' happens again send in slurmctld.log and slurmdbd.log after the new timeout options are in place. > MessageTimeout=120 Time permitted for a round-trip communication to complete in seconds. The default value is 10 seconds. For systems with shared nodes, the slurmd daemon could be paged out and necessitate higher values. > SlurmdTimeout= <Set high enough so that all of the nodes are able to check-in after you restart. This will help avoid canceling jobs due to longer slurmd restarts> > [2019-03-07T13:13:18.022] error: Node ping apparently hung, many nodes may be DOWN or configured SlurmdTimeout should be increased > bf_resolution=600 No greater then 600 is recommended > TCPTimeout=10 You may need to tweak this value to see what works best for your site but 10 is a good starting point. Most of our sites have a value set between 10 and 60. You may also want to tune your dbd as well. Here are a few options to consider: MaxQueryTimeRange (300 but can be tuned depending on if queries are a concern or not) MessageTimeout (sites range from 20-300 with the most common being 300) We also have a number of innodb_* variables in our accounting page that are recommended. https://slurm.schedmd.com/accounting.html We've set all these parameters, including the innodb settings, but still see a disconnect after the ctld has been running for a while. We're now seeing cases where changing MaxJobs for an associations does not take effect until slurmcltd is restarted. I'll attach the latest logs. Created attachment 9747 [details]
Slurmctld log
Created attachment 9748 [details]
slurmdbd log
Hi Steve, Thank you for sending in the logs. I looked over them but was not able to determine a root cause yet. Something that I will note in this ticket is the slurctld logs will have an entry like the following when you are at debug2. [2019-03-29T13:57:26.371] debug2: Processing RPC: ACCOUNTING_UPDATE_MSG from uid=1000 I do see the slurmdbd sending updates but it is not clear if they are coming through to the scheduler. For example: grep "sending updates to msuhpcc at" slurmdbd.log [2019-03-29T05:41:29.959] debug: sending updates to msuhpcc at 192.168.0.102(6817) ver 8448 [2019-03-29T05:41:30.540] debug: sending updates to msuhpcc at 192.168.0.102(6817) ver 8448 [2019-03-29T08:22:03.333] debug: sending updates to msuhpcc at 192.168.0.102(6817) ver 8448 [2019-03-29T08:22:03.792] debug: sending updates to msuhpcc at 192.168.0.102(6817) ver 8448 [2019-03-29T08:22:13.153] debug: sending updates to msuhpcc at 192.168.0.102(6817) ver 8448 If you can raise your log level to debug2 on the scheduler and then update a limit then this may help me understand where the failure is. "scontrol setdebug debug2" Update your limit in sacctmgr "scontrol setdebug <original debug level>" -Jason Hello Jason, I've not yet been able to gather the logs at the increased log level when the error state is present, however, I noticed the following messages in the dbd log that might be relevant: [2019-04-02T05:36:33.914] error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction insert into "msuhpcc_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, account, `partition`, array_task_str, array_max_tasks, array_task_pending, tres_req, work_dir) values (14144648, UNIX_TIMESTAMP(), 14144648, 4294967294, 0, 4294967294, 98, 1, 885315, 2255, 'None assigned', 0, 239, 1553870627, 1553870627, 0, 'validate', 0, 0, 243087, 1, 0, 98304, 'hsu', 'general-short,general-long,hsu-16', '0x3E', 0, 5, '1=1,2=98304,4=1,5=14942', '/mnt/home/lellolou/cosmeticTraits/balding') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_assoc=98, id_user=885315, id_group=2255, nodelist='None assigned', id_resv=0, timelimit=239, time_submit=1553870627, time_eligible=1553870627, time_start=0, mod_time=UNIX_TIMESTAMP(), job_name='validate', track_steps=0, id_qos=1, state=greatest(state, 0), priority=243087, cpus_req=1, nodes_alloc=0, mem_req=98304, id_array_job=14144648, id_array_task=4294967294, pack_job_id=0, pack_job_offset=4294967294, account='hsu', `partition`='general-short,general-long,hsu-16', array_task_str='0x3E', array_max_tasks=0, array_task_pending=5, tres_req='1=1,2=98304,4=1,5=14942', work_dir='/mnt/home/lellolou/cosmeticTraits/balding' [2019-04-02T05:36:33.914] error: It looks like the storage has gone away trying to reconnect [2019-04-03T05:54:07.529] error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction update cluster_table set control_host='192.168.0.102', control_port=6817 where name='msuhpcc'; [2019-04-03T05:54:07.583] error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction update cluster_table set control_host='192.168.0.102', control_port=6817 where name='msuhpcc'; [2019-04-03T05:54:07.673] error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction update cluster_table set control_host='192.168.0.102', control_port=6817 where name='msuhpcc'; Also, is there a way to increase the log level for the dbd live similar to 'scontrol setdebug'? Steve, You can change the loglevel in the slurmdbd and then run "sacctmgr reconfigure". In regards to the deadlocks, can you tell me what version of MySQL/MariaDB you are running and on what Linux distribution e.g. cent6/7? In another comment, you mentioned that the dbd is on the same server. Does the database have dedicated storage and does the server have enough memory to deal with the controller and dbd? -Jason We are running mariadb 5.5.56 on CentOS 7.5. /var host the database and is on a separate partition. The server has 32GB of memory and usually has ~13GB free. I am increasing this to sev 3 since we're seeing this issue a lot when setting and releasing user holds. Created attachment 9802 [details]
slurmdbd.log debug2
Created attachment 9803 [details]
slurmctld log debug2
Steve, I am having a difficult time duplicating this behavior internally and since you are the only site reporting this I think we may be dealing with a database issue. I looked through the logs and it is not entirely clear to me why there is a deadlock in MySQL or why there is the disconnect between slurmctld and the dbd. Would you duplicate and then run a MySQL "SHOW PROCESSLIST". Example: mysql> SHOW PROCESSLIST; +-----+------+-----------+--------------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+ | Id | User | Host | db | Command | Time | State | Info | +-----+------+-----------+--------------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+ | 355 | root | localhost | slurm_master | Sleep | 1177 | | NULL | | 358 | root | localhost | slurm_master | Query | 26 | Waiting for global read lock | update "cluster_step_table" set time_end=1554503658, state=3, kill_requid=-1, exit_code=0, user_sec= | | 359 | root | localhost | NULL | Query | 0 | starting | SHOW PROCESSLIST | +-----+------+-----------+--------------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+ 3 rows in set (0.00 sec) I would also be interested to know if the connection from the slurmctld to the slurmdbd is still active on the head node when the problem occurs. Would you please also run the following command. An example is below which shows my slurmctld connected up to the slurmdbd. sudo netstat -nap | grep slurm | grep ESTABLISHED tcp 0 0 127.0.0.1:7920 127.0.0.1:59732 ESTABLISHED 20962/slurmdbd tcp 0 0 127.0.0.1:59732 127.0.0.1:7920 ESTABLISHED 21876/slurmctld These two commands should give me an idea if there is a MySQL lock or hang and if the connection is still active between the controller and dbd. -Jason MariaDB [(none)]> show processlist; +-------+-------+-----------+-------+---------+------+-------+------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +-------+-------+-----------+-------+---------+------+-------+------------------+----------+ | 66871 | slurm | localhost | slurm | Sleep | 131 | | NULL | 0.000 | | 77810 | slurm | localhost | slurm | Sleep | 0 | | NULL | 0.000 | | 77893 | root | localhost | NULL | Query | 0 | NULL | show processlist | 0.000 | +-------+-------+-----------+-------+---------+------+-------+------------------+----------+ 3 rows in set (0.00 sec) netstat -nap | grep slurm | grep ESTAB tcp 0 0 192.168.0.102:6817 192.168.7.164:36108 ESTABLISHED 29223/slurmctld tcp 0 0 192.168.0.102:6819 192.168.0.102:42942 ESTABLISHED 13074/slurmdbd tcp 0 0 192.168.0.102:6817 192.168.7.156:35004 ESTABLISHED 29223/slurmctld tcp 0 0 192.168.0.102:6817 192.168.7.161:52796 ESTABLISHED 29223/slurmctld tcp 0 0 192.168.0.102:42942 192.168.0.102:6819 ESTABLISHED 29223/slurmctld tcp 0 0 192.168.0.102:6817 192.168.7.155:47650 ESTABLISHED 29223/slurmctld tcp 0 0 192.168.0.102:6817 192.168.7.163:54252 ESTABLISHED 29223/slurmctld tcp 0 0 192.168.0.102:6817 192.168.7.162:39706 ESTABLISHED 29223/slurmctld tcp 0 0 192.168.0.102:6817 192.168.7.160:34202 ESTABLISHED 29223/slurmctld Jason, If it helps, this is our current my.conf: [client] port = 3306 socket = /var/lib/mysql/mysql.sock [isamchk] key_buffer_size = 16M [mysqld] innodb_max_dirty_pages_pct = 0 innodb_log_file_size = 64M innodb_lock_wait_timeout = 900 innodb_buffer_pool_size = 12G innodb_flush_method = O_DIRECT basedir = /usr bind-address = 127.0.0.1 datadir = /var/lib/mysql expire_logs_days = 10 join_buffer_size = 1M key_buffer_size = 1G log-error = /var/log/mariadb/mariadb.log max_allowed_packet = 16M max_binlog_size = 100M max_connections = 151 pid-file = /var/run/mariadb/mariadb.pid port = 3306 query_cache_type = 1 query_cache_limit = 4G query_cache_size = 8G skip-external-locking socket = /var/lib/mysql/mysql.sock ssl = false ssl-ca = /etc/mysql/cacert.pem ssl-cert = /etc/mysql/server-cert.pem ssl-key = /etc/mysql/server-key.pem table_open_cache = 4000 tmp_table_size = 1G thread_cache_size = 8 thread_stack = 256K tmpdir = /tmp user = mysql general-log general-log-file=/var/log/mariadb/queries.log [mysqld-5.0] myisam-recover = BACKUP [mysqld-5.1] myisam-recover = BACKUP [mysqld-5.5] myisam-recover = BACKUP [mysqld-5.6] myisam-recover-options = BACKUP [mysqld-5.7] myisam-recover-options = BACKUP [mysqld_safe] log-error = /var/log/mariadb/mariadb.log nice = 0 socket = /var/lib/mysql/mysql.sock [mysqldump] max_allowed_packet = 16M quick quote-names Hi Steve,
The output you sent is a bit perplexing.
I would expect to see only one slurmctld connection and not 8.
netstat -nap | grep slurm | grep ESTAB
tcp 0 0 192.168.0.102:6817 192.168.7.164:36108 ESTABLISHED 29223/slurmctld
tcp 0 0 192.168.0.102:6819 192.168.0.102:42942 ESTABLISHED 13074/slurmdbd
tcp 0 0 192.168.0.102:6817 192.168.7.156:35004 ESTABLISHED 29223/slurmctld
tcp 0 0 192.168.0.102:6817 192.168.7.161:52796 ESTABLISHED 29223/slurmctld
tcp 0 0 192.168.0.102:42942 192.168.0.102:6819 ESTABLISHED 29223/slurmctld
tcp 0 0 192.168.0.102:6817 192.168.7.155:47650 ESTABLISHED 29223/slurmctld
tcp 0 0 192.168.0.102:6817 192.168.7.163:54252 ESTABLISHED 29223/slurmctld
tcp 0 0 192.168.0.102:6817 192.168.7.162:39706 ESTABLISHED 29223/slurmctld
tcp 0 0 192.168.0.102:6817 192.168.7.160:34202 ESTABLISHED 29223/slurmctld
If you restart mysqld does it show one or two slurmctld reconnects in the netstat output? The processlist you have attached definitely does not show these eight other persistent connections so I suspect something broke down in the communications between the two services at some point.
Would you also check to see if you have multiple slurmctlds running?
ps aux | grep slurmctld
I would also be curious to see the output of the following:
ls -l /proc/`pgrep slurmctld`/fd
> If it helps, this is our current my.conf:
Thank you. I will run some tests with a modified version of this config and see if I can duplicate this behavior.
-Jason
Hi Steve,
I set up a system similar to the one you mentioned in this ticket with CentOS 7.5 and mariadb 5.5, but I did not see the same behavior. So, in addition to my last email would you watch the system to see if those ESTABLISHED slurmctld connection to the dbd eventually clean up? I would also be interested to know if restarting the slurmctld leaves these stale connections around.
Here is an example of what I would expect to see on the system (trimmed output).
ss -pl -t -a state established
Recv-Q Send-Q Local Address:Port Peer Address:Port
0 0 127.0.0.1:7920 127.0.0.1:39324 users:(("slurmdbd",pid=3635,fd=8))
0 0 127.0.0.1:39324 127.0.0.1:7920 users:(("slurmctld",pid=3659,fd=5))
-Jason
Based on the IP addresses of these other connections, I suspect these are nodes connecting to slurmctld. I only see one connection between slurmdbd and slurmctld using port 6819: tcp 0 0 192.168.0.102:6819 192.168.0.102:42942 ESTABLISHED 13074/slurmdbd tcp 0 0 192.168.0.102:42942 192.168.0.102:6819 ESTABLISHED 29223/slurmctld There is only one slurmctld process running on this machine: #ps aux | grep slurmctld | grep -v grep slurm 33542 75.1 6.0 2845048 1987700 ? Sl Apr11 872:36 /usr/sbin/slurmctld Here are slurmcltd's open file descriptors: # ls -l /proc/$(pgrep slurmctld)/fd total 0 lrwx------ 1 slurm slurm 64 Apr 12 09:46 0 -> /dev/null lrwx------ 1 slurm slurm 64 Apr 12 09:46 1 -> /dev/null lrwx------ 1 slurm slurm 64 Apr 12 09:46 10 -> socket:[1723279776] lrwx------ 1 slurm slurm 64 Apr 12 09:46 2 -> /dev/null l-wx------ 1 slurm slurm 64 Apr 12 09:46 3 -> /run/slurmctld.pid lrwx------ 1 slurm slurm 64 Apr 12 09:46 4 -> socket:[1707919994] lrwx------ 1 slurm slurm 64 Apr 12 09:46 5 -> socket:[1723273930] lrwx------ 1 slurm slurm 64 Apr 12 09:46 6 -> socket:[1723276055] lrwx------ 1 slurm slurm 64 Apr 12 09:46 7 -> socket:[1681556143] l-wx------ 1 slurm slurm 64 Apr 12 09:46 8 -> /var/log/slurm/slurmctld.log lrwx------ 1 slurm slurm 64 Apr 12 09:46 9 -> socket:[1723277284] Jason, For what it's worth, I see network traffic going back and forth between the slurmctld and slurmdbd (tcpdump -nn -i lo src host 192.168.0.102 and dst host 192.168.0.102 and port 6819) while this disconnect is present. I forced a core dump of slurmctld during the disconnect. Maybe it can help? Thanks, Steve Created attachment 9909 [details]
Slurmctld Core Dump
Hi Steve, I am looking into adding some additional logging. In the meantime would you send me a backtrace from the core you generated. Core files are rarely useful to us since our system libraries differ from your site. $gdb slurmctld core.53827 (gdb) thread apply all bt -Jason Jason, Here is the backtrace: Thread 2 (Thread 0x7fe8eb1f8700 (LWP 53841)): #0 0x00007fe8f2740d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fe8eb1fcfa8 in _my_sleep (usec=60000000) at backfill.c:597 #2 0x00007fe8eb203aeb in backfill_agent (args=<optimized out>) at backfill.c:956 #3 0x00007fe8f273ce25 in start_thread () from /lib64/libpthread.so.0 #4 0x00007fe8f2466bad in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7fe8f3131740 (LWP 53827)): #0 0x00007fe8f239e277 in raise () from /lib64/libc.so.6 #1 0x00007fe8f239f968 in abort () from /lib64/libc.so.6 #2 0x000000000042d8e8 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:926 I don't think this backtrace itself is significant since I forced the core dump with 'kill -6'. Let me know if you need any other information from the coredump. Best, Steve Created attachment 10007 [details]
Logging patch
Hi Steve,
Thank you for your patience while I have been investigating this issue. After examing the source I know the following:
1) There is a persistent connection between the ctld and your dbd which does not seem to get interrupted at all unless you restart the ctld or the dbd.
2) Although I can not see the root cause I think we will be able to pinpoint where the failure is happening by changing the loglevel in "_slurm_rpc_accounting_update_msg" to info when logging the dbd updates about new users created.
From:
debug2("Processing RPC: ACCOUNTING_UPDATE_MSG from uid=%d", uid);
To:
info("Processing RPC: ACCOUNTING_UPDATE_MSG from uid=%d", uid);
If there were any errors, then we would see entries like the following but we never see this in your logging.
error("assoc_mgr_update gave error: %s",
slurm_strerror(rc));
I have also added debug logging to change the log that prints the type of RPC for that user.
For example:
1 == create
2 == delete
[2019-04-23T16:25:33.139] slurmdbd: update type: 1 user object type: ghost
[2019-04-23T16:25:44.398] slurmdbd: update type: 6 user object type: ghost
This patch will appear shortly on the ticket.
This patch should hopefully tell me if these updates are coming through and if the ctld is processing them. Once the problem happens again try to create and delete a test user and then gather the logs and then them in. You are also welcome to look for the response messages as well:
You should see:
"Processing RPC: ACCOUNTING_UPDATE_MSG from" ...
And
"slurmdbd: update type: 1 user object type:" ...
"update type: 6 user object type:" ...
-Jason
We applied this patch yesterday and had the disconnect this morning. When making changes with sacctmgr, I see the "update type" messages in the slurmdbd logs, but not the "Processing RPC" messages in the slurmcltd logs. After restarting slurmctld, I see both messages when I make changes. Steve, If you still have those logs would you also check for an update similar to the following just after the "slurmdbd: update type": [2019-04-30T13:04:29.654] debug: sending updates to cluster at 127.0.0.1(8817) ver 8704 If you are running with a log level lower then debug I can send over another debug patch which lowers that entry to info. Your comment "...but not the 'Processing RPC' messages in the slurmcltd logs", at least confirms that the ctld received no such packet from the dbd. I think we are losing the packet somehow between the dbd and the ctld. Side questions: 1) Is selinux enabled on this system? 2) Do you use a Linux antivirus program on this server? 3) Any tools that modify iptables throughout the day? -Jason Jason, I will increase the log level to debug for now. > 1) Is selinux enabled on this system? No > 2) Do you use a Linux antivirus program on this server? No > 3) Any tools that modify iptables throughout the day? No I'll update you when we see the issue again. Thanks, Steve We saw the issue again this morning. I see messages in the DBD logs complaining about loss of connection at 6:18am. Created attachment 10086 [details]
slurmdbd log 05/01
Hi Steve, Can you send the output of 'dmesg' and the output of 'ip route get 192.168.0.102'. -Jason # ip route get 192.168.0.102
local 192.168.0.102 dev lo src 192.168.0.102
cache <local>
Created attachment 10103 [details]
dmesg
Hi Steve, Thank you for sending in that output. All of the information that I have seen so far suggests that the dbd/ctld disconnect is a symptom of other things happening on the system. As mentioned below is looks like oom is invoked several times. >[Wed May 1 12:42:48 2019] Out of memory: Kill process 25949 (mysqld) score 398 or sacrifice child >[Wed May 1 12:42:48 2019] Killed process 25949 (mysqld) total-vm:24656840kB, anon-rss:13336180kB, file-rss:0kB, shmem-rss:0kB This is a non-cgroup OOM and that node should be rebooted. Not related but does fill mesg with alot of errors: > blk_update_request: I/O error, dev fd0, sector 0 This looks like this node has a broken floppy. This is concerning as well that OOM is taking down slurmdbd. >[Mon Apr 15 14:18:34 2019] Out of memory: Kill process 26694 (slurmdbd) score 479 or sacrifice child >[Mon Apr 15 14:18:34 2019] Killed process 26694 (slurmdbd) total-vm:17707696kB, anon-rss:15894836kB, file-rss:0kB, shmem-rss:0kB Here are my suggestions for this system. 1) Short term: increase memory 2) Long term: look at moving the dbd off of the same system as the ctld and onto its own system. -Jason Jason, I'll increase the available memory on the box. I'll let you know if we see the disconnect again. Thanks, Steve We are still seeing the disconnect. I decreased the amount of memory in use by mariadb and disabled memory overcommit on the server. This server now has plenty of free memory and no additional OOM events have occurred. Jason, We have not seen a disconnect since the one we saw on Friday. Since we were previously seeing it every day, it looks like the memory tuning helped. I'll let you know if it happens again. Best, Steve Steve - Thanks for the update, and I will wait to hear more from you. Also, that node looks like it could also use a reboot at some point. Hello Jason, It happened again. For the three days where the disconnect did not occur, our database backups were not running. Once we fixed the backups, the disconnect came back. It seems that our database backups are triggering this issue. It currently takes four hours to back up our database, which totals 23GB. Do you have any recommendations on how we can make our backup process less disruptive to the ctld/dbd connection? Thanks, Steve Steve, What command do you use to back up the database? I am curious to know if you use the --lock-tables=false. https://stackoverflow.com/questions/104612/run-mysqldump-without-locking-tables Also, do you have any procedure such as stopping the slurmdbd before running the backup and do you also do any replication? Jason, We are dumping the database with `mysqldump slurm`. Judging from the link you shared, I think we need to add '--single-transaction' to prevent the database dump from locking tables and to get a consistent database state. We'll add this option and see if the issue recurs. We do not currently stop the slurmdbd daemon when we do these dumps nor do we have MySQL replication configured. Thanks, Steve Hi Steve - just following up with you on this case to see if those changes made a difference for your site. Hello Jason, Thank you for following up. Since we added the '--single-transaction' option, the disconnect issue has not resurfaced. Thanks for all your help. Best, Steve Glad that the issue has been resolved. Resolving this out. |