| Summary: | slurmdbd error message after upgrade | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | mengxing cheng <mxcheng> |
| Component: | slurmdbd | Assignee: | Felip Moll <felip.moll> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | tim |
| Version: | 17.11.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | University of Chicago | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
mengxing cheng
2018-02-05 09:13:03 MST
The error indicates that you are trying to insert too much data into InnoDB too quickly and the InnoDB log is filling up before the data can be flushed into the main data files. What underlying storage are you using for the database and for the innodb log files? It is also possible that your MySQL innodb_log_file_size is way too small, increasing that could help. I cannot confirm that but it is also possible that the transaction have not ended correctly and the migration of job table is done partially. If you still have the old database, you could compare that all the jobs until the update are in the new database. You can also check for orphaned jobs in mysql: 'sacctmgr show lostjobs' Send me the output of this command. -------- In order to tune the database: Step 1) Check the following parameters in /etc/my.cnf and increase them depending on your current values: [mysqld] innodb_log_buffer_size = xxM -> up to 3 times your value innodb_buffer_pool_size = yyG innodb_log_file_size = zzM -> up to 6 times your value It is worth to read more about this parameters, I am not a DBA expert and this is only an idea. Step 2) You must stop mysql in a clean way, very important: mysql -uroot -p -e"SET GLOBAL innodb_fast_shutdown = 0;" Step 3) systemctl stop mysql Step 4) mv /var/lib/mysql/ib_logfile* /some/tmp_backup_dir/ Step 5) systemctl start mysql innodb_fast_shutdown = 0 forces InnoDB to completely purge transactional changes from all InnoDB moving parts, including the transactional logs (ib_logfile0, ib_logfile1). Step 6) Check everything is working fine, sacct, jobs, etc. Step 7) You can remove the backup rm -rf /some/tmp_backup_dir/ ---- Tell me how everything goes. +Extra info: https://dba.stackexchange.com/questions/16510/mysql-innodb-innodb-error-the-age-of-the-last-checkpoint-is-innodb-which-exc https://www.percona.com/blog/2008/11/21/how-to-calculate-a-good-innodb-log-file-size/ Felip, thank you for help! We use mysqld. I don't see lostjobs. # sacctmgr show lostjobs Runaway Jobs: No runaway jobs found Partially updated database is a problem I worry about. Slurmdbd reported a lot following errors on Feb 05th though it is fine now. Could you check if those errors are related to the upgrade or some other problem. 2018-02-05T10:17:49.454] error: mysql_query failed: 1064 You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'loop interactions') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx' at line 1 insert into "midway_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, account, `partition`, node_inx, array_task_str, array_task_pending, tres_alloc, tres_req, work_dir) values (42890296, UNIX_TIMESTAMP(), 42890295, 0, 0, 4294967294, 36730, 172, 425015474, 425015474, 'midway2-0190', 0, 2160, 1517847464, 1517847464, 1517847464, 'angle', 0, 1, 131946, 1, 1, 9223372036854783808, 'pi-dinner', 'broadwl', '361', NULL, 0, '1=1,2=8000,3=18446744073709551614,4=1,5=1', '1=1,2=8000,4=1', '/project/dinner/lhong/pT432_pE318_CIIADP/SMD_HETK_preOpen/SMD_HETK/umbrella/analysis/A'-loop interactions') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_assoc=36730, id_user=425015474, id_group=425015474, nodelist='midway2-0190', id_resv=0, timelimit=2160, time_submit=1517847464, time_eligible=1517847464, time_start=1517847464, mod_time=UNIX_TIMESTAMP(), job_name='angle', track_steps=0, id_qos=172, state=greatest(state, 1), priority=131946, cpus_req=1, nodes_alloc=1, mem_req=9223372036854783808, id_array_job=42890295, id_array_task=0, pack_job_id=0, pack_job_offset=4294967294, account='pi-dinner', `partition`='broadwl', node_inx='361', array_task_str=NULL, array_task_pending=0, tres_alloc='1=1,2=8000,3=18446744073709551614,4=1,5=1', tres_req='1=1,2=8000,4=1', work_dir='/project/dinner/lhong/pT432_pE318_CIIADP/SMD_HETK_preOpen/SMD_HETK/umbrella/analysis/A'-loop interactions' [2018-02-05T10:17:49.455] error: It looks like the storage has gone away trying to reconnect [2018-02-05T10:17:49.455] error: mysql_query failed: 1064 You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'loop interactions') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx' at line 1 insert into "midway_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, account, `partition`, node_inx, array_task_str, array_task_pending, tres_alloc, tres_req, work_dir) values (42890296, UNIX_TIMESTAMP(), 42890295, 0, 0, 4294967294, 36730, 172, 425015474, 425015474, 'midway2-0190', 0, 2160, 1517847464, 1517847464, 1517847464, 'angle', 0, 1, 131946, 1, 1, 9223372036854783808, 'pi-dinner', 'broadwl', '361', NULL, 0, '1=1,2=8000,3=18446744073709551614,4=1,5=1', '1=1,2=8000,4=1', '/project/dinner/lhong/pT432_pE318_CIIADP/SMD_HETK_preOpen/SMD_HETK/umbrella/analysis/A'-loop interactions') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_assoc=36730, id_user=425015474, id_group=425015474, nodelist='midway2-0190', id_resv=0, timelimit=2160, time_submit=1517847464, time_eligible=1517847464, time_start=1517847464, mod_time=UNIX_TIMESTAMP(), job_name='angle', track_steps=0, id_qos=172, state=greatest(state, 1), priority=131946, cpus_req=1, nodes_alloc=1, mem_req=9223372036854783808, id_array_job=42890295, id_array_task=0, pack_job_id=0, pack_job_offset=4294967294, account='pi-dinner', `partition`='broadwl', node_inx='361', array_task_str=NULL, array_task_pending=0, tres_alloc='1=1,2=8000,3=18446744073709551614,4=1,5=1', tres_req='1=1,2=8000,4=1', work_dir='/project/dinner/lhong/pT432_pE318_CIIADP/SMD_HETK_preOpen/SMD_HETK/umbrella/analysis/A'-loop interactions' [2018-02-05T10:17:49.456] error: mysql_query failed: 1064 You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'loop interactions') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx' at line 1 insert into "midway_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, account, `partition`, node_inx, array_task_str, array_task_pending, tres_alloc, tres_req, work_dir) values (42890297, UNIX_TIMESTAMP(), 42890295, 1, 0, 4294967294, 36730, 172, 425015474, 425015474, 'midway2-0221', 0, 2160, 1517847464, 1517847464, 1517847464, 'angle', 0, 1, 131946, 1, 1, 9223372036854783808, 'pi-dinner', 'broadwl', '380', NULL, 0, '1=1,2=8000,3=18446744073709551614,4=1,5=1', '1=1,2=8000,4=1', '/project/dinner/lhong/pT432_pE318_CIIADP/SMD_HETK_preOpen/SMD_HETK/umbrella/analysis/A'-loop interactions') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_assoc=36730, id_user=425015474, id_group=425015474, nodelist='midway2-0221', id_resv=0, timelimit=2160, time_submit=1517847464, time_eligible=1517847464, time_start=1517847464, mod_time=UNIX_TIMESTAMP(), job_name='angle', track_steps=0, id_qos=172, state=greatest(state, 1), priority=131946, cpus_req=1, nodes_alloc=1, mem_req=9223372036854783808, id_array_job=42890295, id_array_task=1, pack_job_id=0, pack_job_offset=4294967294, account='pi-dinner', `partition`='broadwl', node_inx='380', array_task_str=NULL, array_task_pending=0, tres_alloc='1=1,2=8000,3=18446744073709551614,4=1,5=1', tres_req='1=1,2=8000,4=1', work_dir='/project/dinner/lhong/pT432_pE318_CIIADP/SMD_HETK_preOpen/SMD_HETK/umbrella/analysis/A'-loop interactions' [2018-02-05T10:17:49.456] error: It looks like the storage has gone away trying to reconnect I don't see error in mysqld.log but there are many many following logs in mysql-slow.log # Query_time: 0.000180 Lock_time: 0.000059 Rows_sent: 0 Rows_examined: 0 SET timestamp=1517868102; select distinct t1.id_wckey, t1.is_def, t1.wckey_name, t1.user from "midway_wckey_table" as t1 where t1.deleted=0 && (t1.is_def=1) && (t1.user='abourassa') order by wckey_name, user; # User@Host: slurm[slurm] @ midway-mgt.rcc.uchicago.edu [128.135.112.70] # Query_time: 0.000179 Lock_time: 0.000049 Rows_sent: 0 Rows_examined: 0 SET timestamp=1517868104; select distinct t1.id_wckey, t1.is_def, t1.wckey_name, t1.user from "midway_wckey_table" as t1 where t1.deleted=0 && (t1.is_def=1) && (t1.user='abrax') order by wckey_name, user; Hi Mengxing, (In reply to mengxing cheng from comment #3) > Felip, thank you for help! We use mysqld. I don't see lostjobs. > > # sacctmgr show lostjobs > Runaway Jobs: No runaway jobs found > This is good, if you didn't see the database to crash probably the update was fine. > Partially updated database is a problem I worry about. You can just check what I said in first comment, to compare the list of jobs from the old and the new database, you will be 100% sure then that all data is migrated, but I am confident that it is. > Slurmdbd reported a > lot following errors on Feb 05th though it is fine now. Could you check if > those errors are related to the upgrade or some other problem. > The errors you report are coming from a bug. Your user have a directory with "'" character and we are not escaping this correctly: "/project/dinner/lhong/pT432_pE318_CIIADP/SMD_HETK_preOpen/SMD_HETK/umbrella/analysis/A'-loop interactions" I will start digging into it and providing a fix for this issue. In the meantime I recommend you to look in the filesystem for weird paths like this. They can give problems not only with Slurm but other programs and utilities. > > I don't see error in mysqld.log but there are many many following logs in > mysql-slow.log > # Query_time: 0.000180 Lock_time: 0.000059 Rows_sent: 0 Rows_examined: 0 > SET timestamp=1517868102; > select distinct t1.id_wckey, t1.is_def, t1.wckey_name, t1.user from > "midway_wckey_table" as t1 where t1.deleted=0 && (t1.is_def=1) && > (t1.user='abourassa') order by wckey_name, user; > # User@Host: slurm[slurm] @ midway-mgt.rcc.uchicago.edu [128.135.112.70] > # Query_time: 0.000179 Lock_time: 0.000049 Rows_sent: 0 Rows_examined: 0 > SET timestamp=1517868104; > select distinct t1.id_wckey, t1.is_def, t1.wckey_name, t1.user from > "midway_wckey_table" as t1 where t1.deleted=0 && (t1.is_def=1) && > (t1.user='abrax') order by wckey_name, user; This indicates your mysql is slow. Check comment #2 and tell me which are your values in my.cnf, also tell me which underlying filesystem type and hardware have for the database. You definitively need some tuning in MySQL. Is your database used for anything else than Slurm? Felip, The problem of directory path including special ' character has been fixed. Thanks for pointing it out. I am checking database but glad to know that you are confident in the upgrade success. The filesystem is ext4 on top of drbd (Distributed Replicated Block Device) HA replication constructed from mechanic HDDs. I agree that they are not fast backend storage system for database. The database server hosts a slurmdbd of another very small cluster of 150 nodes. Here is my.cnf # cat /etc/my.cnf [mysqld] #port = 8881 #skip_grant_tables=1 #innodb_force_recovery=1 #innodb_purge_threads=0 datadir=/var/lib/mysql socket=/var/lib/mysql/mysql.sock user=mysql # Disabling symbolic-links is recommended to prevent assorted security risks symbolic-links=0 log-bin=/var/lib/mysql/mysql-bin.log relay-log=/var/lib/mysql/mysql-relay-bin.log server-id=10 query_cache_type=1 key_buffer = 256M expire_logs_days = 3 innodb_file_per_table innodb_buffer_pool_size=4G innodb_log_file_size=64M #innodb_log_file_size=128M #innodb_lock_wait_timeout=400 innodb_lock_wait_timeout=900 thread_cache_size=4 table_open_cache=512 query_cache_size = 64M query_cache_limit = 32M join_buffer_size = 32M tmp_table_size = 1G max_heap_table_size = 1G slow-query-log = 1 slow-query-log-file = /var/log/mysql-slow.log long_query_time = 1 log-queries-not-using-indexes [mysqld_safe] log-error=/var/log/mysqld.log pid-file=/var/run/mysqld/mysqld.pid (In reply to mengxing cheng from comment #7) > Felip, > > The problem of directory path including special ' character has been fixed. > Thanks for pointing it out. I am checking database but glad to know that you > are confident in the upgrade success. > > > The filesystem is ext4 on top of drbd (Distributed Replicated Block Device) > HA replication constructed from mechanic HDDs. I agree that they are not > fast backend storage system for database. The database server hosts a > slurmdbd of another very small cluster of 150 nodes. > > Here is my.cnf > # cat /etc/my.cnf > [mysqld] > > #port = 8881 > #skip_grant_tables=1 > #innodb_force_recovery=1 > #innodb_purge_threads=0 > > datadir=/var/lib/mysql > socket=/var/lib/mysql/mysql.sock > user=mysql > # Disabling symbolic-links is recommended to prevent assorted security risks > symbolic-links=0 > > log-bin=/var/lib/mysql/mysql-bin.log > relay-log=/var/lib/mysql/mysql-relay-bin.log > server-id=10 > > query_cache_type=1 > > key_buffer = 256M > > expire_logs_days = 3 > > innodb_file_per_table > innodb_buffer_pool_size=4G > innodb_log_file_size=64M > #innodb_log_file_size=128M > #innodb_lock_wait_timeout=400 > innodb_lock_wait_timeout=900 > thread_cache_size=4 > table_open_cache=512 > > query_cache_size = 64M > query_cache_limit = 32M > join_buffer_size = 32M > > tmp_table_size = 1G > max_heap_table_size = 1G > > slow-query-log = 1 > slow-query-log-file = /var/log/mysql-slow.log > long_query_time = 1 > log-queries-not-using-indexes > > [mysqld_safe] > log-error=/var/log/mysqld.log > pid-file=/var/run/mysqld/mysqld.pid Given this information I think that innodb_log_file_size could be increased. I see a commented value of 128MB, did you make this change in the past? DRBD over mechanical disks is not the best for a MySQL, but probably tuning this values will help. So, I recommend you to do: 1. Tune the DB values in my.conf as noted in previous comments 2. Do the job queries on old and new databases 3. I have the patch ready and pending to review&commit, once done I'll let you know the commit # just in case you want to apply it. Regards, Felip Felip, Thank you very much for help. I can wait for new release for the special character issue because it is not urgent. You can close this ticket if you like. Mengxing |