Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dtle ha - kill or stop dest dtle at full repl stage cause src MySQL Waiting for table metadata lock when execute ddl #904

Closed
asiroliu opened this issue Mar 7, 2022 · 4 comments
Assignees
Labels
VERIFIED verified issue
Milestone

Comments

@asiroliu
Copy link
Collaborator

asiroliu commented Mar 7, 2022

Description

dtle ha - kill or stop dest dtle at full repl stage cause src MySQL Waiting for table metadata lock when execute ddl

Steps to reproduce the issue

  1. create dtle job2.
job.json
{
  "job_id": "kill_dest_dtle_full_repl",
  "is_password_encrypted": true,
  "task_step_name": "all",
  "failover": true,
  "retry": 2,
  "src_task": {
    "task_name": "src",
    "node_id": "9ec8e26d-81f1-cbdb-ca7d-34a955aef901",
    "mysql_src_task_config": {
      "gtid": "",
      "binlog_relay": false
    },
    "drop_table_if_exists": true,
    "skip_create_db_table": false,
    "repl_chan_buffer_size": 120,
    "chunk_size": 1,
    "group_max_size": 1,
    "group_timeout": 100,
    "connection_config": {
      "database_type": "MySQL",
      "host": "172.100.9.2",
      "port": 3306,
      "user": "test_src",
      "password": "wg3VHFuOiamoLXWi63NRxrlwLIc2QhWumQoloqXJr2sdlcCCiQr0+zMji2L5wWeLsUUJyBjjfR7u4jVdfSJo6WjDzqVr93k+9L/bo1glxpeYEfifkeThQmoFtKu992LIgr2bpmEyuBF+NyvQnCCvWZYlYZrhdY1wJQBghoNznF0="
    },
    "replicate_do_db": [
      {
        "table_schema": "action_db_1",
        "tables": [
          {
            "table_name": "sbtest1"
          }
        ]
      }
    ]
  },
  "dest_task": {
    "task_name": "dest",
    "node_id": "d6bf03c4-5716-8656-68c3-6c0aa6201422",
    "mysql_dest_task_config": {
      "use_my_sql_dependency": false,
      "dependency_history_size": 2500,
      "parallel_workers": 32
    },
    "connection_config": {
      "database_type": "MySQL",
      "host": "172.100.9.1",
      "port": 3306,
      "user": "test_dest",
      "password": "L/bz/aGlaPYb1Ohn5js7ad/vGEb/9jTzjvHpYCIiJ4Kuuo8XAPAR0TMMXBALXKwOo5NWnHT+mP94LQIOngFsModPSKCMtOPTYbQ2Z9lgjCsBL4f2YyqM7AruY69x9L0c7tZ9T9OkWuGAUbny7WTAyJfwUWB+6kThpQNvHVp9Yj0="
    }
  }
}
  1. kill or stop dest dtle when job is full repl stage
shell> pgrep nomad | xargs -r -n 1 kill -9
shell> pgrep dtle | xargs -r -n 1 kill -9

shell> systemctl stop dtle-nomad
  1. wait for dtle ha finished
  2. src MySQL execute ddl and it's blocked
SQL> DROP DATABASE IF EXISTS `action_db_1`;
  1. open other MySQL client
mysql> show processlist;
+----+----------+---------------------+------+------------------+------+---------------------------------------------------------------+---------------------------------------+
| Id | User     | Host                | db   | Command          | Time | State                                                         | Info                                  |
+----+----------+---------------------+------+------------------+------+---------------------------------------------------------------+---------------------------------------+
| 17 | test_src | 172.100.9.12:43784  | NULL | Sleep            |  111 |                                                               | NULL                                  |
| 18 | test_src | 172.100.9.12:52076  | NULL | Sleep            |   93 |                                                               | NULL                                  |
| 19 | test_src | 172.100.9.12:52086  | NULL | Sleep            |   84 |                                                               | NULL                                  |
| 22 | test_src | 172.100.9.12:60062  | NULL | Binlog Dump GTID |   84 | Master has sent all binlog to slave; waiting for more updates | NULL                                  |
| 28 | test     | 172.100.9.253:59784 | NULL | Query            |   26 | Waiting for table metadata lock                               | DROP DATABASE IF EXISTS `action_db_1` |
| 29 | test     | localhost           | NULL | Query            |    0 | starting                                                      | show processlist                      |
+----+----------+---------------------+------+------------------+------+---------------------------------------------------------------+---------------------------------------+

Output of ./dtle version:**

9.9.9.9-master-8ae751b
@asiroliu
Copy link
Collaborator Author

asiroliu commented Mar 7, 2022

src MySQL general log after kill dest dtle

2022-03-07T05:13:38.404220Z	   22 Query	SELECT * FROM `action_db_1`.`sbtest1` where (((`id` > '4999'))) and (true) order by `id` asc LIMIT 1
2022-03-07T05:13:38.405216Z	   22 Query	SELECT * FROM `action_db_1`.`sbtest1` where (((`id` > '5000'))) and (true) order by `id` asc LIMIT 1
2022-03-07T05:13:38.443356Z	   22 Query	COMMIT
2022-03-07T05:13:38.443708Z	   22 Quit
2022-03-07T05:13:38.443712Z	   21 Quit

# kill dest dtle

2022-03-07T05:13:39.456301Z	   20 Query	SELECT @@SERVER_UUID /*dtle*/
2022-03-07T05:13:39.458437Z	   25 Connect	[email protected] on  using TCP/IP
2022-03-07T05:13:39.458792Z	   25 Query	SHOW GLOBAL VARIABLES LIKE 'BINLOG_CHECKSUM'
2022-03-07T05:13:39.462365Z	   25 Query	SET @master_binlog_checksum='NONE'
2022-03-07T05:13:39.462527Z	   25 Query	SET @master_heartbeat_period=3000000000
2022-03-07T05:13:39.462735Z	   25 Binlog Dump GTID	Log: '' Pos: 4 GTIDs: '00003306-0000-0000-0000-000000003306:1-16,
38697ae2-9dd5-11ec-8c09-0242ac640903:1-11'
2022-03-07T05:13:39.463920Z	   20 Query	SELECT @@SERVER_UUID /*dtle*/
2022-03-07T05:13:51.611325Z	   24 Query	SELECT @@SQL_MODE
2022-03-07T05:13:51.612574Z	   24 Query	SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA
            WHERE SCHEMA_NAME = 'action_db_1'
2022-03-07T05:13:51.616357Z	   24 Query	SELECT @@SQL_MODE
2022-03-07T05:13:51.618352Z	   24 Query	SELECT @@SQL_MODE
2022-03-07T05:13:51.619145Z	   24 Query	SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA
            WHERE SCHEMA_NAME = 'action_db_1'
2022-03-07T05:13:51.620038Z	   24 Query	SHOW CREATE DATABASE `action_db_1`
2022-03-07T05:13:51.624452Z	   24 Query	SELECT @@SQL_MODE
2022-03-07T05:13:51.624975Z	   24 Query	SELECT @@SQL_MODE
2022-03-07T05:13:51.625501Z	   24 Query	SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA
            WHERE SCHEMA_NAME = 'action_db_1'
2022-03-07T05:13:51.626373Z	   24 Query	SELECT DISTINCT TABLES.TABLE_NAME

            FROM INFORMATION_SCHEMA.TABLES JOIN INFORMATION_SCHEMA.COLUMNS ON
                TABLES.TABLE_SCHEMA = COLUMNS.TABLE_SCHEMA AND
                TABLES.TABLE_NAME = COLUMNS.TABLE_NAME
            LEFT JOIN INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS ON
                TABLES.TABLE_SCHEMA = REFERENTIAL_CONSTRAINTS.CONSTRAINT_SCHEMA
                AND
                TABLES.TABLE_NAME = REFERENTIAL_CONSTRAINTS.TABLE_NAME
            LEFT JOIN (
                  SELECT CONSTRAINT_SCHEMA, TABLE_NAME, CONSTRAINT_NAME,
                         GROUP_CONCAT(COLUMN_NAME ORDER BY ORDINAL_POSITION)
                         AS COLUMN_NAME, REFERENCED_TABLE_SCHEMA,
                         GROUP_CONCAT(REFERENCED_COLUMN_NAME ORDER BY
                         ORDINAL_POSITION) AS REFERENCED_COLUMN_NAME
                  FROM INFORMATION_SCHEMA.KEY_COLUMN_USAGE
                  GROUP BY CONSTRAINT_SCHEMA, TABLE_NAME, CONSTRAINT_NAME,
                           REFERENCED_TABLE_SCHEMA
            ) AS KEY_COLUMN_USAGE ON
                TABLES.TABLE_SCHEMA = KEY_COLUMN_USAGE.CONSTRAINT_SCHEMA
                AND
                TABLES.TABLE_NAME = KEY_COLUMN_USAGE.TABLE_NAME
            WHERE TABLES.TABLE_SCHEMA = 'action_db_1' AND TABLE_TYPE <> 'VIEW'

            ORDER BY TABLES.TABLE_NAME
2022-03-07T05:13:51.705678Z	   24 Query	SELECT NAME

            FROM mysql.proc
            WHERE DB = 'action_db_1' AND TYPE = 'FUNCTION'
2022-03-07T05:13:51.706463Z	   24 Query	SELECT NAME

            FROM mysql.proc
            WHERE DB = 'action_db_1' AND TYPE = 'PROCEDURE'
2022-03-07T05:13:51.707075Z	   24 Query	SELECT TABLE_NAME

            FROM INFORMATION_SCHEMA.VIEWS
            WHERE TABLE_SCHEMA = 'action_db_1'
2022-03-07T05:13:51.707649Z	   24 Query	SELECT TRIGGER_NAME

            FROM INFORMATION_SCHEMA.TRIGGERS
            WHERE TRIGGER_SCHEMA = 'action_db_1'
2022-03-07T05:13:51.712731Z	   24 Query	SELECT NAME

            FROM mysql.event
            WHERE DB = 'action_db_1'
2022-03-07T05:13:51.798245Z	   24 Query	SELECT @@SQL_MODE
2022-03-07T05:13:51.799131Z	   24 Query	SELECT @@SQL_MODE
2022-03-07T05:13:51.799565Z	   24 Query	SELECT @@SQL_MODE
2022-03-07T05:13:51.800009Z	   24 Query	SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA
            WHERE SCHEMA_NAME = 'action_db_1'
2022-03-07T05:13:51.800514Z	   24 Query	USE `action_db_1`
2022-03-07T05:13:51.800823Z	   24 Query	SHOW CREATE TABLE `sbtest1`
2022-03-07T05:13:51.803945Z	   24 Query	SELECT @@SQL_MODE
2022-03-07T05:13:51.805024Z	   24 Query	SELECT @@SQL_MODE
2022-03-07T05:13:51.805892Z	   24 Query	SELECT TABLE_SCHEMA, TABLE_NAME, ENGINE, AUTO_INCREMENT, AVG_ROW_LENGTH, CHECKSUM, TABLE_COLLATION, TABLE_COMMENT, ROW_FORMAT, CREATE_OPTIONS
  FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'action_db_1' AND TABLE_NAME = 'sbtest1'
2022-03-07T05:13:51.806717Z	   24 Query	SELECT ORDINAL_POSITION, COLUMN_NAME, COLUMN_TYPE, IS_NULLABLE,
         COLUMN_DEFAULT, EXTRA, COLUMN_COMMENT, COLUMN_KEY
  FROM INFORMATION_SCHEMA.COLUMNS
  WHERE TABLE_SCHEMA = 'action_db_1' AND TABLE_NAME = 'sbtest1'
2022-03-07T05:13:51.807690Z	   24 Query	SELECT PARTITION_NAME, SUBPARTITION_NAME, PARTITION_ORDINAL_POSITION,
         SUBPARTITION_ORDINAL_POSITION, PARTITION_METHOD, SUBPARTITION_METHOD,
         PARTITION_EXPRESSION, SUBPARTITION_EXPRESSION, PARTITION_DESCRIPTION
  FROM INFORMATION_SCHEMA.PARTITIONS
  WHERE TABLE_SCHEMA = 'action_db_1' AND TABLE_NAME = 'sbtest1'
2022-03-07T05:13:51.810913Z	   24 Query	SHOW CREATE TABLE `action_db_1`.`sbtest1`
2022-03-07T05:13:51.811824Z	   24 Query	SELECT @@SQL_MODE
2022-03-07T05:13:51.812221Z	   24 Query	SELECT COUNT(*) FROM `action_db_1`.`sbtest1`
2022-03-07T05:13:51.815593Z	   24 Query	SELECT @@SQL_MODE
2022-03-07T05:13:51.816090Z	   24 Query	SELECT @@session.max_allowed_packet
2022-03-07T05:13:51.817692Z	   24 Query	CHECKSUM TABLE `action_db_1`.`sbtest1`
2022-03-07T05:13:51.842114Z	   26 Connect	[email protected] on  using TCP/IP
2022-03-07T05:13:51.843053Z	   26 Query	SELECT SCHEMA_NAME FROM information_schema.SCHEMATA
2022-03-07T05:13:51.845332Z	   26 Query	DROP DATABASE IF EXISTS `action_db_1`

@asiroliu asiroliu changed the title dtle ha - kill dest dtle cause src MySQL Waiting for table metadata lock when execute ddl dtle ha - kill dest dtle at full repl stage cause src MySQL Waiting for table metadata lock when execute ddl Mar 8, 2022
@asiroliu
Copy link
Collaborator Author

asiroliu commented Mar 8, 2022

kill dest dtle at incr repl stage can't cause this issue

@asiroliu asiroliu changed the title dtle ha - kill dest dtle at full repl stage cause src MySQL Waiting for table metadata lock when execute ddl dtle ha - kill or stop dest dtle at full repl stage cause src MySQL Waiting for table metadata lock when execute ddl Mar 8, 2022
@asiroliu
Copy link
Collaborator Author

asiroliu commented Mar 9, 2022

通过更多的测试发现,只要源端的task在全量复制阶段重建并且是还是原来的dtle被分配,就会触发执行ddl 时 Waiting for table metadata lock
kill 目标端dtle、stop目标端dtle、暂停job都可以复现

复现关键字:

  1. 源端task重建
  2. 仍由原dtle执行该源端task
  3. 全量阶段

@asiroliu asiroliu added this to the 4.22.03.0 milestone Mar 22, 2022
@LordofAvernus
Copy link
Collaborator

LordofAvernus commented Mar 23, 2022

通过上述流程复现成功:

  •  DROP DATABASE  的线程状态为 Waiting for table metadata lock
    查询相关文档,发现该状态场景为 当前库有其他的线程正在操作(读/写)  
    
  • 查看当前是否有事务未结束 参考
    select * from information_schema.innodb_trx;
    image

  • 使用SQL查询当前未提交的事务最新的SQL命令 参考
    SELECT ps.id ,trx.trx_id ,trx_started ,trx_state ,trx_isolation_level ,esh.event_id ,esh.timer_wait ,esh.event_name ,esh.sql_text ,esh.returned_sqlstate ,esh.mysql_errno ,esh.message_text ,esh.errors ,esh.warnings FROM information_schema.innodb_trx trx JOIN information_schema.processlist ps ON trx.trx_mysql_thread_id = ps.id LEFT JOIN performance_schema.threads th ON th.processlist_id = trx.trx_mysql_thread_id LEFT JOIN performance_schema.events_statements_history esh ON esh.thread_id = th.thread_id WHERE trx.trx_started < current_time - INTERVAL 2 second AND ps.user != 'SYSTEM_USER' ORDER BY esh.event_id;

    从结果看出,当前未提交事务中有查询SQL,并且是全量同步过程中使用的SQL
    image

  • 以上基本可以确定为源端任务重启有部分连接MySQL的协程未正常关闭,通过查看DTLE提供的pprof接口查看未释放的协程 http://10.186.61.75:8190/debug/pprof/goroutine?debug=1

image

最终结论:
当extractor重启时,dumper正在全量读取数据时,没有正确关闭channel,导致出现协程无法关闭,读取事务未结束。DDL语句夯死

@asiroliu asiroliu added the VERIFIED verified issue label Mar 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
VERIFIED verified issue
Projects
None yet
Development

No branches or pull requests

2 participants