Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

after filtered some DDL event and manually fix downstream, tracker can't track table structure #5272

Closed
lance6716 opened this issue Apr 26, 2022 · 7 comments
Assignees
Labels
affects-5.3 affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.0 area/dm Issues or PRs related to DM. severity/moderate type/bug The issue is confirmed as a bug.

Comments

@lance6716
Copy link
Contributor

What did you do?

task

...
block-allow-list:        # 上游数据库实例匹配的表的 block-allow-list 过滤规则集,如果 DM 版本 <= v2.0.0-beta.2 则使用 black-white-list
  bw-rule-1:             # 黑白名单配置的名称
    do-dbs: ["test"] # 迁移哪些库
...
filters:
  filter-rule-1:
    schema-pattern: "test"
    table-pattern: "test1"
    events: ["all ddl"]
    action: Ignore
  1. create test.test1 in upstream
  2. start-task --remove-meta
  3. alter table test1 add column c4 int; in upstream
  4. create table test2 (c int primary key); in upstream or wait 30s, to flush checkpoints
  5. insert into test1. Now task will report error because downstream doesn't have column c4
  6. alter table test1 add column c4 int; in downstream
  7. resume-task

What did you expect to see?

task goes on

What did you see instead?

gen insert sqls failed, sourceTable: test.test1, targetTable: test.test1: Column count doesn't match value count: 3 (columns) vs 4 (values)",

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

at least v5.4.0

current status of DM cluster (execute query-status <task-name> in dmctl)

(paste current status of DM cluster here)
@lance6716 lance6716 added type/bug The issue is confirmed as a bug. area/dm Issues or PRs related to DM. labels Apr 26, 2022
@lance6716
Copy link
Contributor Author

lance6716 commented Apr 26, 2022

Root cause: (I only look at the source code, didn't check the actual behaviour)

For the first time when error happens, it's downstream error "Error 1054: Unknown column...". For this time genSQL is succeeded and DML job is added to queue, the TableInfo in memory table checkpoint is filled with downstream table structure. After error happens, in checkpoint.Rollback memory checkpoint is rollbacked to flushed checkpoint which has nil TableInfo, and schema tracker resets the table structure

When task is resumed (or auto resumed), table checkpoint and schema tracker doesn't contains the TableInfo so we will use downstream table structure. But at this time, the first step is schema tracker loaded the downstream table structure, and soon we failed at genSQL for the error "Column count doesn't match value count". Note that at this time we didn't save TableInfo to memory table checkpoint, but the table checkpoint still exists because it's created when the first error happens and didn't get dropped by DROP TABLE. Then in checkpoint.Rollback because memory table checkpoint has nil TableInfo, schema tracker didn't reset, and also in following logic schema tracker didn't drop the table since the memory table checkpoint exists.

To me, this is caused by TableInfo in schema tracker is not consistent with memory table checkpoint. We can fix it when refine the code.

@D3Hunter
Copy link
Contributor

D3Hunter commented Apr 26, 2022

cannot reproduce in current master, and there is another bug: after auto-resume on first error, the dml is skipped too, and global point is larger than table point:

+------------+-----------+----------+------------------+------------+
| id         | cp_schema | cp_table | binlog_name      | binlog_pos |
+------------+-----------+----------+------------------+------------+
| mysql-3306 |           |          | mysql-bin.000001 |        766 |
| mysql-3306 | test      | test2    | mysql-bin.000001 |        505 |
| mysql-3306 | test      | test1    | mysql-bin.000001 |        735 |
+------------+-----------+----------+------------------+------------+

dm-worker1.log

@lance6716
Copy link
Contributor Author

lance6716 commented Apr 26, 2022

cannot reproduce in current master, and there is another bug: after auto-resume on first error, the dml is skipped too, and global point is larger than table point:

+------------+-----------+----------+------------------+------------+
| id         | cp_schema | cp_table | binlog_name      | binlog_pos |
+------------+-----------+----------+------------------+------------+
| mysql-3306 |           |          | mysql-bin.000001 |        766 |
| mysql-3306 | test      | test2    | mysql-bin.000001 |        505 |
| mysql-3306 | test      | test1    | mysql-bin.000001 |        735 |
+------------+-----------+----------+------------------+------------+

if you checkout the test part of my pr, it's expected to fail

and please upload the log for above case. if the table is skipped, its table checkpoint may not be updated. but dml should not be lost.

@D3Hunter
Copy link
Contributor

D3Hunter commented Apr 26, 2022

5.3 doesn't has this issue, can auto recover ---> due to keepalive failed and recreate another syncer

@niubell
Copy link
Contributor

niubell commented Apr 27, 2022

/assign gmhdbjd

@niubell
Copy link
Contributor

niubell commented Apr 27, 2022

/unassign lance6716

@lance6716
Copy link
Contributor Author

fixed by #5273

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.3 affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.0 area/dm Issues or PRs related to DM. severity/moderate type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

4 participants