pessimistic shard DDL can't handle new DDL when the old one has not been resolved #628

csuzhangxc · 2020-04-23T06:43:43Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

What did you do? If possible, provide a recipe for reproducing the error.

execute multiple DDL consecutively in close time in pessimistic shard DDL mode.
What did you expect to see?

replicate these DDLs normally.
What did you see instead?

DDL coordination blocked.

all tables in DM-workers (returned by query-status) are synced, but some sources in DM-master (returned by show-ddl-locks) are unsynced.

this should be caused by fail to try sync shard DDL lock, sharding ddls in ddl lock *** is different with ***, then the DDL info received from DM-worker is ignored (and DM-worker will not resend again).

now, we can pause-task and resume-task to let DM-worker resend the DDL info to handle it manually.
Versions of the cluster
- DM version (run dmctl -V or dm-worker -V or dm-master -V):
```
DM master before git commit hash b2a8fb85e0e93540382518b487d934cd2e392066
```

The text was updated successfully, but these errors were encountered:

lichunzhu · 2020-05-14T13:00:37Z

问题描述

DM - HA 的悲观 shard ddl 的执行过程大概为：

dm-worker 收到 ddl，传 ddl info 到 etcd
dm-master 监听 etcd，直到得到 task 对应所有 source 的 ddl。第一个发送 ddl-info 的为 lock owner
dm-master 收集齐 info 后，通过 etcd 向 lock owner 下发 exec operation
dm-worker 收到 exec operation，执行成功后通过 etcd done operation 汇报 ddl 已经同步，** 随后该 worker 继续自己的任务 **
dm-master 收到 owner 发送的 done 信息，向非 owner 发送 skip operation
非 owner dm-worker skip ddl，通过 etcd done operation 汇报，** 随后继续自己的任务 **。所有 dm-worker 会在同一个 etcd txn 中删除 info 和标记 operation 为 done
dm-master 收到所有 dm-worker 的 done operation，先删掉 etcd 的所有 operations，然后再删除 master 持有的 lock 信息

在步骤 4，6 中 dm-worker exec/done 完成继续自己的任务时，如果遇到新的 ddl 会传新的 ddl info 到 etcd 尝试构建新的 ddl lock 锁。但是如果此时 master 还没有收齐上一个 lock 的 operations，没有删除这个 worker 对应的 lock，master 就会认为这是一条“非悲观规则”的 ddl，而忽略处理该 ddl，从而引发 block。

错误举例

task 涉及 2 个 dm-worker，dm-worker1 为 owner，每个 worker 都先后执行 ddl1 与 ddl2。假设 ddl1 的 owner 为 worker1，且 ddl1 和 ddl2 会操作同一张表。

可能的时间线为：

worker1 收到 ddl1，传 ddl1 info 到 etcd，dm-master 监听到 info 创建 ddl1 lock，置 worker1 为 ddl1 owner
worker2 收到 ddl1，传 ddl1 info 到 etcd，dm-master 标记 ddl1 lock 状态为 synced，同时向 etcd 中 worker1 对应 key put 一个 ddl1 exec operation
worker1 收到 ddl1 exec operation，执行 ddl，执行成功后在一个 etcd txn 中 put 一个 ddl1 exec done operation 同时 delete ddl1 info，4. worker1 此时会继续扫接下来的 binlog
worker1 扫到 ddl2，传 ddl2 info 到 etcd，此时 dm-master 还存有这个 table 的 lock， TrySync 会找到这个 table 的 ddl1 lock，但在 TrySync 发现 ddl1 和 ddl2 不同从而报错，忽略该 info
dm-master 收到 worker1 的 exec done -> 给 worker2 put ddl1 ignore operation -> worker2 ignore ddl 后 put 一个 dd1 ignore done operation 并 delete ddl1 info -> dm-master 收到所有 operations -> dm-master 移除该 lock
dm-worker2 收到 ddl2，传 ddl2 info 到 etcd，dm-master 监听到 info 创建 ddl2 lock，置 worker2 为 ddl2 owner，但因为 dm-worker1 的 ddl2 info put event 已经在步骤 4 中被忽略，因此该锁不能被自动同步，dm-worker1，2 都会卡在这里
人工介入，pause + resume task 后 worker 重新 put ddl info 触发 event，lock 从而正常构建同步下去

解决方案

方案一

DM-worker 遇到 ddl 时，在 DM-master 有 lock 协调时，相关的 table 不发新的 DDL info 给 DM-master。
检查方法可以是检查这个 worker table 对应的 etcd operation，worker 执行时如果发现 etcd 里自己的 operation 是 done 还没有被删，就等到被删了再继续。其他情况直接报错。
这种方案会些许影响效率，但是总的来说对逻辑侵入小，且效率影响并不是特别大，不需要额外重建逻辑，因此个人倾向方案一。

方案二

DM-worker 支持重复发送 DDL info。如果 DM-worker 等了一会儿没有等到 exec/skip operation 就重复上传 ddl info 触发 master 重新处理 lock。这要求 master 有处理重复 ddl info 的能力，同时重启时要考虑一下又有 operation 又有 info 时的处理。

方案三

DM-master 支持重复消费（之前丢弃过的）DDL info，用一个数据结构 cache 特殊处理。同样重启时要考虑一下又有 operation 又有 info 时的处理。

问题讨论

在讨论后决定使用方案一，因为维护更简洁。

csuzhangxc added the type/bug This issue is a bug report label Apr 23, 2020

csuzhangxc mentioned this issue Apr 23, 2020

*: rollback schema in the tracker; fix save table checkpoint in optimistic mode #625

Merged

csuzhangxc added this to the v2.0.0 beta.2 milestone Apr 30, 2020

csuzhangxc self-assigned this Jun 1, 2020

csuzhangxc mentioned this issue Jun 8, 2020

*: fix new shard DDL handle when the previous one has not complete #725

Merged

csuzhangxc closed this as completed in #725 Jun 12, 2020

sre-bot added the severity/moderate label Aug 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pessimistic shard DDL can't handle new DDL when the old one has not been resolved #628

pessimistic shard DDL can't handle new DDL when the old one has not been resolved #628

csuzhangxc commented Apr 23, 2020

lichunzhu commented May 14, 2020

pessimistic shard DDL can't handle new DDL when the old one has not been resolved #628

pessimistic shard DDL can't handle new DDL when the old one has not been resolved #628

Comments

csuzhangxc commented Apr 23, 2020

Bug Report

lichunzhu commented May 14, 2020

问题描述

DM - HA 的悲观 shard ddl 的执行过程大概为：

错误举例

可能的时间线为：

解决方案

方案一

方案二

方案三

问题讨论