Ticket 6603

Summary: "Invalid account or partition", but account and partition are present
Product: Slurm Reporter: Steve Ford <fordste5>
Component: slurmctldAssignee: Jason Booth <jbooth>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: nate
Version: 18.08.5   
Hardware: Linux   
OS: Linux   
Site: MSU Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Slurmctld log
slurmdbd log
slurm.conf
sdiag output
Slurmctld log
slurmdbd log
slurmdbd.log debug2
slurmctld log debug2
Slurmctld Core Dump
Logging patch
slurmdbd log 05/01
dmesg

Description Steve Ford 2019-02-27 13:00:16 MST
Hello SchedMD,

One of our users was getting the following error when submitting jobs using the 'merzjrke' account:

sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified


The SLURM server logs showed the problem was the 'general-short' partition and the 'merzjrke' account:

[2019-02-27T12:45:54.589] _job_create: invalid account or partition for user 885046, account 'merzjrke', and partition 'general-short'


What's puzzling is that both this account and the partition exist:

$ sacctmgr show account merzjrke
   Account                Descr                  Org 
---------- -------------------- -------------------- 
  merzjrke             merzjrke                buyin 

$ scontrol show partition general-short
PartitionName=general-short
   AllowGroups=ALL AllowAccounts=ALL AllowQos=normal,admin,debug,plzbuyin
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=00:01:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=cne-000,csm-[001-022],csn-[001-039],csp-[006-007,016-027],css-[001-024,026-085,087-103,106-127],lac-[000-225,228-247,250-261,276-277,286-369,372,374-445],nvl-[000-006],qml-[001-005],skl-[000-167],vim-[000-002]
   PriorityJobFactor=1 PriorityTier=5 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=23156 TotalNodes=799 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=750 MaxMemPerNode=UNLIMITED
   TRESBillingWeights=CPU=1.0,Mem=0.152


What's more puzzling is that, after restarting the SLURM server, this user can submit jobs without issue. Have you seen this behavior before?

Thanks,
Steve
Comment 1 Jason Booth 2019-02-27 14:57:46 MST
Hi Steve,

 Would you attach your slurmctld.log and slurmdbd.log. I suspect that there were some changes made to the slurmdbd that were not propagated to the slurmctld. I will look over the logs to see if I notice anything that may clue us into the cause. Would you also answer these questions for me:

Does the slurmdbd sit on a different system then the slurmctld and are there any firewalls between the two? 

Is this a fairly new account credential that was added?

Do you use partition names in your associations?

eg.

sacctmgr show assoc
Comment 2 Steve Ford 2019-03-07 12:08:38 MST
Hello Jason,

Our slurmctld and slurmdbd are on the same server.

This was not a new user or account.

We do not use partitions in our associations.

Unfortunately the logs from that time are no longer available. However, we are seeing another issue that looks like a slurmctld/slurmdbd disconnect.

Our job submit plugin adds new users to our accounting database when they first submit a job. After adding the user, we verify they were added by checking that job_desc.default_account is set. Lately, this check is failing, even though we can see from sacctmgr that the user was added.

At the same time, we deleted a user using sacctmgr. When jobs were submitted by that user, SLURM behaved as though they were still in the accounting database. 

I'll attach the logs from when we saw this other error.

Best,
Steve
Comment 3 Steve Ford 2019-03-07 12:14:31 MST
Created attachment 9474 [details]
Slurmctld log
Comment 4 Steve Ford 2019-03-07 12:15:11 MST
Created attachment 9475 [details]
slurmdbd log
Comment 5 Steve Ford 2019-03-07 12:19:24 MST
The disconnect disappears for a while when slurmctld is restarted.
Comment 7 Jason Booth 2019-03-08 12:01:16 MST
Hi Steve,

 Can you attach your slurm.conf. I did notice the following from the logs but the first two are not as concerning to me as the timeouts.

>slurmdbd.log:[2019-03-07T13:40:05.410] error: unable to re-connect to as_mysql database
>slurmdbd.log:[2019-03-07T13:40:05.410] error: mysql_real_connect failed: 2002 Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
>slurmdbd.log:[2019-03-07T13:40:05.410] error: unable to re-connect to as_mysql database

Then on the server I see dns resolve issues for a few nodes:
>[2019-03-07T03:28:04.996] error: Unable to resolve "csm-006": Unknown host
>[2019-03-07T03:28:04.997] error: _set_slurmd_addr: failure on csm-006

I also see many timeouts for send/recv operation.

>[2019-03-07T11:56:03.406] error: slurm_receive_msgs: Socket timed out on send/recv operation
>[2019-03-07T11:58:05.370] error: slurm_receive_msgs: Socket timed out on send/recv operation

Would you also let me know the number of jobs your site pushes through in a given day and how long jobs run for? It would also be good to give me a general idea of how large these jobs are and what type of workload they run.
Comment 8 Steve Ford 2019-03-13 13:12:35 MDT
Hello Jason,

The mysql connection errors around 13:40 are from when we restarted the database server. 

I'm not worried about the DNS errors. These are nodes that are decommissioned and pending removal from our configuration. DNS lookup fails because they were removed from our DNS server already.

We see around 60,000 jobs completing every day. Most are single node jobs. I'll get back to you about how long they run.

Could our workload size be affecting the ctld's ability to stay up to date with the dbd?

Best,
Steve
Comment 9 Jason Booth 2019-03-13 13:32:51 MDT
Steve,

> We see around 60,000 jobs completing every day. Most are single node jobs. I'll get back to you about how long they run.

60,000 is not that large of a workload so I would say there may be some turning that may need to be done on your cluster. Would you attach your slurm.conf so that I can review your settings?

> Could our workload size be affecting the ctld's ability to stay up to date with the dbd?

The database should be able to cope with that workload assuming that the dbd has enough RAM and fast storage to cope with your 60k jobs a day. We can look at the output of sdiag to get a better understanding of where the time is being spent so please attach that output along with the slurm.conf.

My impressions overall of this issue are that your slurmctld should not be losing associations. It is still unclear to me how you get into this state. Let's start with looking over your configuration and sdiag output and take steps to tune the server while we investigate this issue further.

-Jason
Comment 10 Steve Ford 2019-03-13 13:36:09 MDT
Created attachment 9563 [details]
slurm.conf
Comment 11 Steve Ford 2019-03-13 13:36:57 MDT
Created attachment 9564 [details]
sdiag output
Comment 12 Jason Booth 2019-03-13 16:00:08 MDT
Hi Steve,

 Thanks for uploading your sidag and slurm.conf. I reviewed both and I would like to make a few suggestions for your next maintenance cycle. If you would not mind, please look these over and make a few changes. If the 'disconnect' happens again send in slurmctld.log and slurmdbd.log after the new timeout options are in place.


> MessageTimeout=120
Time permitted for a round-trip communication to complete in seconds. The default value is 10 seconds. For systems with shared nodes, the slurmd daemon could be paged out and necessitate higher values.

> SlurmdTimeout= <Set high enough so that all of the nodes are able to check-in after you restart. This will help avoid canceling jobs due to longer slurmd restarts>
> [2019-03-07T13:13:18.022] error: Node ping apparently hung, many nodes may be DOWN or configured SlurmdTimeout should be increased


> bf_resolution=600
 No greater then 600 is recommended

> TCPTimeout=10
You may need to tweak this value to see what works best for your site but 10 is a good starting point. Most of our sites have a value set between 10 and 60.


You may also want to tune your dbd as well. Here are a few options to consider:

MaxQueryTimeRange (300 but can be tuned depending on if queries are a concern or not)
MessageTimeout (sites range from 20-300 with the most common being 300)

We also have a number of innodb_* variables in our accounting page that are recommended.

https://slurm.schedmd.com/accounting.html
Comment 13 Steve Ford 2019-03-29 08:10:01 MDT
We've set all these parameters, including the innodb settings, but still see a disconnect after the ctld has been running for a while. We're now seeing cases where changing MaxJobs for an associations does not take effect until slurmcltd is restarted. I'll attach the latest logs.
Comment 14 Steve Ford 2019-03-29 08:11:10 MDT
Created attachment 9747 [details]
Slurmctld log
Comment 15 Steve Ford 2019-03-29 08:12:10 MDT
Created attachment 9748 [details]
slurmdbd log
Comment 16 Jason Booth 2019-03-29 14:05:51 MDT
Hi Steve,

 Thank you for sending in the logs. I looked over them but was not able to determine a root cause yet. Something that I will note in this ticket is the slurctld logs will have an entry like the following when you are at debug2.

[2019-03-29T13:57:26.371] debug2: Processing RPC: ACCOUNTING_UPDATE_MSG from uid=1000

I do see the slurmdbd sending updates but it is not clear if they are coming through to the scheduler.

For example:


grep "sending updates to msuhpcc at" slurmdbd.log
[2019-03-29T05:41:29.959] debug:  sending updates to msuhpcc at 192.168.0.102(6817) ver 8448
[2019-03-29T05:41:30.540] debug:  sending updates to msuhpcc at 192.168.0.102(6817) ver 8448
[2019-03-29T08:22:03.333] debug:  sending updates to msuhpcc at 192.168.0.102(6817) ver 8448
[2019-03-29T08:22:03.792] debug:  sending updates to msuhpcc at 192.168.0.102(6817) ver 8448
[2019-03-29T08:22:13.153] debug:  sending updates to msuhpcc at 192.168.0.102(6817) ver 8448


If you can raise your log level to debug2 on the scheduler and then update a limit then this may help me understand where the failure is.

"scontrol setdebug debug2"
Update your limit in sacctmgr
"scontrol setdebug <original debug level>"

-Jason
Comment 17 Steve Ford 2019-04-03 10:44:35 MDT
Hello Jason,

I've not yet been able to gather the logs at the increased log level when the error state is present, however, I noticed the following messages in the dbd log that might be relevant:

[2019-04-02T05:36:33.914] error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction
insert into "msuhpcc_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, account, `partition`, array_task_str, array_max_tasks, array_task_pending, tres_req, work_dir) values (14144648, UNIX_TIMESTAMP(), 14144648, 4294967294, 0, 4294967294, 98, 1, 885315, 2255, 'None assigned', 0, 239, 1553870627, 1553870627, 0, 'validate', 0, 0, 243087, 1, 0, 98304, 'hsu', 'general-short,general-long,hsu-16', '0x3E', 0, 5, '1=1,2=98304,4=1,5=14942', '/mnt/home/lellolou/cosmeticTraits/balding') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_assoc=98, id_user=885315, id_group=2255, nodelist='None assigned', id_resv=0, timelimit=239, time_submit=1553870627, time_eligible=1553870627, time_start=0, mod_time=UNIX_TIMESTAMP(), job_name='validate', track_steps=0, id_qos=1, state=greatest(state, 0), priority=243087, cpus_req=1, nodes_alloc=0, mem_req=98304, id_array_job=14144648, id_array_task=4294967294, pack_job_id=0, pack_job_offset=4294967294, account='hsu', `partition`='general-short,general-long,hsu-16', array_task_str='0x3E', array_max_tasks=0, array_task_pending=5, tres_req='1=1,2=98304,4=1,5=14942', work_dir='/mnt/home/lellolou/cosmeticTraits/balding'
[2019-04-02T05:36:33.914] error: It looks like the storage has gone away trying to reconnect
[2019-04-03T05:54:07.529] error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction
update cluster_table set control_host='192.168.0.102', control_port=6817 where name='msuhpcc';
[2019-04-03T05:54:07.583] error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction
update cluster_table set control_host='192.168.0.102', control_port=6817 where name='msuhpcc';
[2019-04-03T05:54:07.673] error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction
update cluster_table set control_host='192.168.0.102', control_port=6817 where name='msuhpcc';


Also, is there a way to increase the log level for the dbd live similar to 'scontrol setdebug'?
Comment 18 Jason Booth 2019-04-03 16:00:40 MDT
Steve,

 You can change the loglevel in the slurmdbd and then run "sacctmgr reconfigure".

In regards to the deadlocks, can you tell me what version of MySQL/MariaDB you are running and on what Linux distribution e.g. cent6/7? 

In another comment, you mentioned that the dbd is on the same server. Does the database have dedicated storage and does the server have enough memory to deal with the controller and dbd?

-Jason
Comment 19 Steve Ford 2019-04-05 11:18:00 MDT
We are running mariadb 5.5.56 on CentOS 7.5. /var host the database and is on a separate partition.

The server has 32GB of memory and usually has ~13GB free.
Comment 20 Steve Ford 2019-04-05 11:27:33 MDT
I am increasing this to sev 3 since we're seeing this issue a lot when setting and releasing user holds.
Comment 21 Steve Ford 2019-04-05 11:28:30 MDT
Created attachment 9802 [details]
slurmdbd.log debug2
Comment 22 Steve Ford 2019-04-05 11:29:22 MDT
Created attachment 9803 [details]
slurmctld log debug2
Comment 23 Jason Booth 2019-04-05 16:51:06 MDT
Steve,

I am having a difficult time duplicating this behavior internally and since you are the only site reporting this I think we may be dealing with a database issue. 

I looked through the logs and it is not entirely clear to me why there is a deadlock in MySQL or why there is the disconnect between slurmctld and the dbd. Would you duplicate and then run a MySQL "SHOW PROCESSLIST".

Example:

mysql> SHOW PROCESSLIST;
+-----+------+-----------+--------------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+
| Id  | User | Host      | db           | Command | Time | State                        | Info                                                                                                 |
+-----+------+-----------+--------------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+
| 355 | root | localhost | slurm_master | Sleep   | 1177 |                              | NULL                                                                                                 |
| 358 | root | localhost | slurm_master | Query   |   26 | Waiting for global read lock | update "cluster_step_table" set time_end=1554503658, state=3, kill_requid=-1, exit_code=0, user_sec= |
| 359 | root | localhost | NULL         | Query   |    0 | starting                     | SHOW PROCESSLIST                                                                                     |
+-----+------+-----------+--------------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+
3 rows in set (0.00 sec)



I would also be interested to know if the connection from the slurmctld to the slurmdbd is still active on the head node when the problem occurs.

Would you please also run the following command. An example is below which shows my slurmctld connected up to the slurmdbd.

sudo netstat -nap | grep slurm | grep ESTABLISHED
tcp        0      0 127.0.0.1:7920          127.0.0.1:59732         ESTABLISHED 20962/slurmdbd      
tcp        0      0 127.0.0.1:59732         127.0.0.1:7920          ESTABLISHED 21876/slurmctld  


These two commands should give me an idea if there is a MySQL lock or hang and if the connection is still active between the controller and dbd.

-Jason
Comment 24 Steve Ford 2019-04-08 07:04:51 MDT
MariaDB [(none)]> show processlist;
+-------+-------+-----------+-------+---------+------+-------+------------------+----------+
| Id    | User  | Host      | db    | Command | Time | State | Info             | Progress |
+-------+-------+-----------+-------+---------+------+-------+------------------+----------+
| 66871 | slurm | localhost | slurm | Sleep   |  131 |       | NULL             |    0.000 |
| 77810 | slurm | localhost | slurm | Sleep   |    0 |       | NULL             |    0.000 |
| 77893 | root  | localhost | NULL  | Query   |    0 | NULL  | show processlist |    0.000 |
+-------+-------+-----------+-------+---------+------+-------+------------------+----------+
3 rows in set (0.00 sec)

netstat -nap | grep slurm | grep ESTAB
tcp        0      0 192.168.0.102:6817      192.168.7.164:36108     ESTABLISHED 29223/slurmctld     
tcp        0      0 192.168.0.102:6819      192.168.0.102:42942     ESTABLISHED 13074/slurmdbd      
tcp        0      0 192.168.0.102:6817      192.168.7.156:35004     ESTABLISHED 29223/slurmctld     
tcp        0      0 192.168.0.102:6817      192.168.7.161:52796     ESTABLISHED 29223/slurmctld     
tcp        0      0 192.168.0.102:42942     192.168.0.102:6819      ESTABLISHED 29223/slurmctld     
tcp        0      0 192.168.0.102:6817      192.168.7.155:47650     ESTABLISHED 29223/slurmctld     
tcp        0      0 192.168.0.102:6817      192.168.7.163:54252     ESTABLISHED 29223/slurmctld     
tcp        0      0 192.168.0.102:6817      192.168.7.162:39706     ESTABLISHED 29223/slurmctld     
tcp        0      0 192.168.0.102:6817      192.168.7.160:34202     ESTABLISHED 29223/slurmctld
Comment 25 Steve Ford 2019-04-11 12:33:05 MDT
Jason,

If it helps, this is our current my.conf:

[client]
port = 3306
socket = /var/lib/mysql/mysql.sock

[isamchk]
key_buffer_size = 16M

[mysqld]
innodb_max_dirty_pages_pct = 0
innodb_log_file_size = 64M
innodb_lock_wait_timeout = 900
innodb_buffer_pool_size = 12G
innodb_flush_method = O_DIRECT
basedir = /usr
bind-address = 127.0.0.1
datadir = /var/lib/mysql
expire_logs_days = 10
join_buffer_size = 1M
key_buffer_size = 1G
log-error = /var/log/mariadb/mariadb.log
max_allowed_packet = 16M
max_binlog_size = 100M
max_connections = 151
pid-file = /var/run/mariadb/mariadb.pid
port = 3306
query_cache_type = 1
query_cache_limit = 4G
query_cache_size = 8G
skip-external-locking
socket = /var/lib/mysql/mysql.sock
ssl = false
ssl-ca = /etc/mysql/cacert.pem
ssl-cert = /etc/mysql/server-cert.pem
ssl-key = /etc/mysql/server-key.pem
table_open_cache = 4000
tmp_table_size = 1G
thread_cache_size = 8
thread_stack = 256K
tmpdir = /tmp
user = mysql

general-log
general-log-file=/var/log/mariadb/queries.log

[mysqld-5.0]
myisam-recover = BACKUP

[mysqld-5.1]
myisam-recover = BACKUP

[mysqld-5.5]
myisam-recover = BACKUP

[mysqld-5.6]
myisam-recover-options = BACKUP

[mysqld-5.7]
myisam-recover-options = BACKUP

[mysqld_safe]
log-error = /var/log/mariadb/mariadb.log
nice = 0
socket = /var/lib/mysql/mysql.sock

[mysqldump]
max_allowed_packet = 16M
quick
quote-names
Comment 26 Jason Booth 2019-04-11 14:47:00 MDT
Hi Steve,

The output you sent is a bit perplexing.

I would expect to see only one slurmctld connection and not 8. 

netstat -nap | grep slurm | grep ESTAB
tcp        0      0 192.168.0.102:6817      192.168.7.164:36108     ESTABLISHED 29223/slurmctld     
tcp        0      0 192.168.0.102:6819      192.168.0.102:42942     ESTABLISHED 13074/slurmdbd      
tcp        0      0 192.168.0.102:6817      192.168.7.156:35004     ESTABLISHED 29223/slurmctld     
tcp        0      0 192.168.0.102:6817      192.168.7.161:52796     ESTABLISHED 29223/slurmctld     
tcp        0      0 192.168.0.102:42942     192.168.0.102:6819      ESTABLISHED 29223/slurmctld     
tcp        0      0 192.168.0.102:6817      192.168.7.155:47650     ESTABLISHED 29223/slurmctld     
tcp        0      0 192.168.0.102:6817      192.168.7.163:54252     ESTABLISHED 29223/slurmctld     
tcp        0      0 192.168.0.102:6817      192.168.7.162:39706     ESTABLISHED 29223/slurmctld     
tcp        0      0 192.168.0.102:6817      192.168.7.160:34202     ESTABLISHED 29223/slurmctld


If you restart mysqld does it show one or two slurmctld reconnects in the netstat output? The processlist you have attached definitely does not show these eight other persistent connections so I suspect something broke down in the communications between the two services at some point.

Would you also check to see if you have multiple slurmctlds running?

ps aux | grep slurmctld

I would also be curious to see the output of the following:
ls -l /proc/`pgrep slurmctld`/fd


> If it helps, this is our current my.conf:

 Thank you. I will run some tests with a modified version of this config and see if I can duplicate this behavior.

-Jason
Comment 27 Jason Booth 2019-04-11 16:49:11 MDT
Hi Steve,

 I set up a system similar to the one you mentioned in this ticket with CentOS 7.5 and mariadb 5.5, but I did not see the same behavior. So, in addition to my last email would you watch the system to see if those ESTABLISHED slurmctld connection to the dbd eventually clean up? I would also be interested to know if restarting the slurmctld leaves these stale connections around.

Here is an example of what I would expect to see on the system (trimmed output).

ss -pl -t -a state established 
Recv-Q Send-Q                       Local Address:Port                                        Peer Address:Port                
0      0                                127.0.0.1:7920                                           127.0.0.1:39324                 users:(("slurmdbd",pid=3635,fd=8))                
0      0                                127.0.0.1:39324                                          127.0.0.1:7920                  users:(("slurmctld",pid=3659,fd=5))

-Jason
Comment 28 Steve Ford 2019-04-12 08:12:23 MDT
Based on the IP addresses of these other connections, I suspect these are nodes connecting to slurmctld. I only see one connection between slurmdbd and slurmctld using port 6819:
   
tcp        0      0 192.168.0.102:6819      192.168.0.102:42942     ESTABLISHED 13074/slurmdbd          
tcp        0      0 192.168.0.102:42942     192.168.0.102:6819      ESTABLISHED 29223/slurmctld


There is only one slurmctld process running on this machine:
#ps aux | grep slurmctld | grep -v grep
slurm    33542 75.1  6.0 2845048 1987700 ?     Sl   Apr11 872:36 /usr/sbin/slurmctld


Here are slurmcltd's open file descriptors:
# ls -l /proc/$(pgrep slurmctld)/fd
total 0
lrwx------ 1 slurm slurm 64 Apr 12 09:46 0 -> /dev/null
lrwx------ 1 slurm slurm 64 Apr 12 09:46 1 -> /dev/null
lrwx------ 1 slurm slurm 64 Apr 12 09:46 10 -> socket:[1723279776]
lrwx------ 1 slurm slurm 64 Apr 12 09:46 2 -> /dev/null
l-wx------ 1 slurm slurm 64 Apr 12 09:46 3 -> /run/slurmctld.pid
lrwx------ 1 slurm slurm 64 Apr 12 09:46 4 -> socket:[1707919994]
lrwx------ 1 slurm slurm 64 Apr 12 09:46 5 -> socket:[1723273930]
lrwx------ 1 slurm slurm 64 Apr 12 09:46 6 -> socket:[1723276055]
lrwx------ 1 slurm slurm 64 Apr 12 09:46 7 -> socket:[1681556143]
l-wx------ 1 slurm slurm 64 Apr 12 09:46 8 -> /var/log/slurm/slurmctld.log
lrwx------ 1 slurm slurm 64 Apr 12 09:46 9 -> socket:[1723277284]
Comment 29 Steve Ford 2019-04-15 06:59:51 MDT
Jason,

For what it's worth, I see network traffic going back and forth between the slurmctld and slurmdbd (tcpdump -nn -i lo src host 192.168.0.102 and dst host 192.168.0.102 and port 6819) while this disconnect is present.

I forced a core dump of slurmctld during the disconnect. Maybe it can help?

Thanks,
Steve
Comment 30 Steve Ford 2019-04-15 07:00:56 MDT
Created attachment 9909 [details]
Slurmctld Core Dump
Comment 31 Jason Booth 2019-04-15 10:35:26 MDT
Hi Steve,

I am looking into adding some additional logging. In the meantime would you send me a backtrace from the core you generated. Core files are rarely useful to us since our system libraries differ from your site.

$gdb slurmctld core.53827

(gdb) thread apply all bt

-Jason
Comment 32 Steve Ford 2019-04-15 10:45:25 MDT
Jason,

Here is the backtrace:

Thread 2 (Thread 0x7fe8eb1f8700 (LWP 53841)):
#0  0x00007fe8f2740d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fe8eb1fcfa8 in _my_sleep (usec=60000000) at backfill.c:597
#2  0x00007fe8eb203aeb in backfill_agent (args=<optimized out>) at backfill.c:956
#3  0x00007fe8f273ce25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fe8f2466bad in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fe8f3131740 (LWP 53827)):
#0  0x00007fe8f239e277 in raise () from /lib64/libc.so.6
#1  0x00007fe8f239f968 in abort () from /lib64/libc.so.6
#2  0x000000000042d8e8 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:926

I don't think this backtrace itself is significant since I forced the core dump with 'kill -6'. Let me know if you need any other information from the coredump.

Best,
Steve
Comment 33 Jason Booth 2019-04-23 16:33:03 MDT
Created attachment 10007 [details]
Logging patch

Hi Steve,

 Thank you for your patience while I have been investigating this issue. After examing the source I know the following:

1) There is a persistent connection between the ctld and your dbd which does not seem to get interrupted at all unless you restart the ctld or the dbd.

2) Although I can not see the root cause I think we will be able to pinpoint where the failure is happening by changing the loglevel in "_slurm_rpc_accounting_update_msg" to info when logging the dbd updates about new users created.

From:
debug2("Processing RPC: ACCOUNTING_UPDATE_MSG from uid=%d", uid);

To:
info("Processing RPC: ACCOUNTING_UPDATE_MSG from uid=%d", uid);


If there were any errors, then we would see entries like the following but we never see this in your logging.

                error("assoc_mgr_update gave error: %s",
                      slurm_strerror(rc));

I have also added debug logging to change the log that prints the type of RPC for that user.

For example: 
1 == create
2 == delete
 
[2019-04-23T16:25:33.139] slurmdbd: update type: 1 user object type: ghost
[2019-04-23T16:25:44.398] slurmdbd: update type: 6 user object type: ghost
 This patch will appear shortly on the ticket.


This patch should hopefully tell me if these updates are coming through and if the ctld is processing them. Once the problem happens again try to create and delete a test user and then gather the logs and then them in. You are also welcome to look for the response messages as well:

You should see:
"Processing RPC: ACCOUNTING_UPDATE_MSG from" ...
And
"slurmdbd: update type: 1 user object type:" ...
"update type: 6 user object type:" ...

-Jason
Comment 34 Steve Ford 2019-04-30 10:19:20 MDT
We applied this patch yesterday and had the disconnect this morning. When making changes with sacctmgr, I see the "update type" messages in the slurmdbd logs, but not the "Processing RPC" messages in the slurmcltd logs.

After restarting slurmctld, I see both messages when I make changes.
Comment 35 Jason Booth 2019-04-30 13:14:09 MDT
Steve,

 If you still have those logs would you also check for an update similar to the following just after the "slurmdbd: update type":

[2019-04-30T13:04:29.654] debug:  sending updates to cluster at 127.0.0.1(8817) ver 8704

If you are running with a log level lower then debug I can send over another debug patch which lowers that entry to info. Your comment "...but not the 'Processing RPC' messages in the slurmcltd logs", at least confirms that the ctld received no such packet from the dbd.

I think we are losing the packet somehow between the dbd and the ctld.  

Side questions:
 1) Is selinux enabled on this system?
 2) Do you use a Linux antivirus program on this server?
 3) Any tools that modify iptables throughout the day?

-Jason
Comment 36 Steve Ford 2019-04-30 13:25:24 MDT
Jason,

I will increase the log level to debug for now.

> 1) Is selinux enabled on this system?
No
> 2) Do you use a Linux antivirus program on this server? 
No
> 3) Any tools that modify iptables throughout the day? 
No


I'll update you when we see the issue again.

Thanks,
Steve
Comment 37 Steve Ford 2019-05-01 06:46:11 MDT
We saw the issue again this morning. I see messages in the DBD logs complaining about loss of connection at 6:18am.
Comment 38 Steve Ford 2019-05-01 06:46:57 MDT
Created attachment 10086 [details]
slurmdbd log 05/01
Comment 39 Jason Booth 2019-05-01 14:01:15 MDT
Hi Steve,

 Can you send the output of 'dmesg' and the output of 'ip route get 192.168.0.102'. 

-Jason
Comment 40 Steve Ford 2019-05-02 07:47:03 MDT
# ip route get 192.168.0.102
local 192.168.0.102 dev lo src 192.168.0.102 
    cache <local>
Comment 41 Steve Ford 2019-05-02 07:48:22 MDT
Created attachment 10103 [details]
dmesg
Comment 42 Jason Booth 2019-05-02 11:00:34 MDT
Hi Steve,

Thank you for sending in that output. All of the information that I have seen so far suggests that the dbd/ctld disconnect is a symptom of other things happening on the system. As mentioned below is looks like oom is invoked several times. 

>[Wed May  1 12:42:48 2019] Out of memory: Kill process 25949 (mysqld) score 398 or sacrifice child
>[Wed May  1 12:42:48 2019] Killed process 25949 (mysqld) total-vm:24656840kB, anon-rss:13336180kB, file-rss:0kB, shmem-rss:0kB

This is a non-cgroup OOM and that node should be rebooted.

Not related but does fill mesg with alot of errors:
> blk_update_request: I/O error, dev fd0, sector 0
This looks like this node has a broken floppy. 


This is concerning as well that OOM is taking down slurmdbd. 
>[Mon Apr 15 14:18:34 2019] Out of memory: Kill process 26694 (slurmdbd) score 479 or sacrifice child
>[Mon Apr 15 14:18:34 2019] Killed process 26694 (slurmdbd) total-vm:17707696kB, anon-rss:15894836kB, file-rss:0kB, shmem-rss:0kB

Here are my suggestions for this system. 
1) Short term: increase memory
2) Long term: look at moving the dbd off of the same system as the ctld and onto its own system.


-Jason
Comment 43 Steve Ford 2019-05-02 11:07:04 MDT
Jason,

I'll increase the available memory on the box. I'll let you know if we see the disconnect again.

Thanks,
Steve
Comment 44 Steve Ford 2019-05-03 06:59:40 MDT
We are still seeing the disconnect. I decreased the amount of memory in use by mariadb and disabled memory overcommit on the server. This server now has plenty of free memory and no additional OOM events have occurred.
Comment 45 Steve Ford 2019-05-06 10:34:37 MDT
Jason,

We have not seen a disconnect since the one we saw on Friday. Since we were previously seeing it every day, it looks like the memory tuning helped. I'll let you know if it happens again.

Best,
Steve
Comment 46 Jason Booth 2019-05-06 10:37:51 MDT
Steve - Thanks for the update, and I will wait to hear more from you. Also, that node looks like it could also use a reboot at some point.
Comment 47 Steve Ford 2019-05-07 08:45:50 MDT
Hello Jason,

It happened again. For the three days where the disconnect did not occur, our database backups were not running. Once we fixed the backups, the disconnect came back. It seems that our database backups are triggering this issue.

It currently takes four hours to back up our database, which totals 23GB.

Do you have any recommendations on how we can make our backup process less disruptive to the ctld/dbd connection?

Thanks,
Steve
Comment 48 Jason Booth 2019-05-07 10:09:38 MDT
Steve,

What command do you use to back up the database?

I am curious to know if you use the --lock-tables=false. 

https://stackoverflow.com/questions/104612/run-mysqldump-without-locking-tables

Also, do you have any procedure such as stopping the slurmdbd before running the backup and do you also do any replication?
Comment 49 Steve Ford 2019-05-07 10:33:48 MDT
Jason,

We are dumping the database with `mysqldump slurm`. Judging from the link you shared, I think we need to add '--single-transaction' to prevent the database dump from locking tables and to get a consistent database state. We'll add this option and see if the issue recurs.

We do not currently stop the slurmdbd daemon when we do these dumps nor do we have MySQL replication configured.

Thanks,
Steve
Comment 50 Jason Booth 2019-05-28 15:59:01 MDT
Hi Steve - just following up with you on this case to see if those changes made a difference for your site.
Comment 51 Steve Ford 2019-05-29 07:37:40 MDT
Hello Jason,

Thank you for following up. Since we added the '--single-transaction' option, the disconnect issue has not resurfaced. Thanks for all your help.

Best,
Steve
Comment 52 Jason Booth 2019-05-29 09:28:28 MDT
Glad that the issue has been resolved. Resolving this out.