Skip to content
This repository has been archived by the owner on Nov 24, 2023. It is now read-only.

shardddl: fix resume conflict ddl. #739

Merged
merged 4 commits into from
Jun 16, 2020

Conversation

GMHDBJD
Copy link
Collaborator

@GMHDBJD GMHDBJD commented Jun 15, 2020

manually cherrypick #722

and fix #740.

What problem does this PR solve?

What is changed and how it works?

reset shardmeta after start or resume task

Tests

  • Unit test
  • Integration test

@GMHDBJD GMHDBJD added priority/normal Minor change, requires approval from ≥1 primary reviewer status/PTAL This PR is ready for review. Add this label back after committing new changes type/cherry-pick This PR is just a cherry-pick (backport) labels Jun 15, 2020
Copy link
Member

@csuzhangxc csuzhangxc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you not cherry-pick the code about UnresumableErrCodes?

@@ -4,7 +4,6 @@ task-mode: all
is-sharding: true
meta-schema: "dm_meta"
remove-meta: false
enable-heartbeat: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you remove this line? for the status checking?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It throw a error heartbeat config is different from previous used: serverID not equal when restart task. 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe our test case has a problem before? I run it locally, but got another error

"msg": "[code=38008:class=dm-master:scope=internal:level=high] grpc request error: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:8262: connect: connection refused\"\ngithub.com/pingcap/dm/pkg/terror.(*Error).Delegate\n\t/home/zhangxc/gopath/src/github.com/pingcap/dm/pkg/terror/terror.go:267\ngithub.com/pingcap/dm/dm/master/workerrpc.callRPC\n\t/home/zhangxc/gopath/src/github.com/pingcap/dm/dm/master/workerrpc/rawgrpc.go:124\ngithub.com/pingcap/dm/dm/master/workerrpc.(*GRPCClient).SendRequest\n\t/home/zhangxc/gopath/src/github.com/pingcap/dm/dm/master/workerrpc/rawgrpc.go:64\ngithub.com/pingcap/dm/dm/master.(*Server).getStatusFromWorkers.func2\n\t/home/zhangxc/gopath/src/github.com/pingcap/dm/dm/master/server.go:1135\ngithub.com/pingcap/dm/dm/master.(*AgentPool).Emit\n\t/home/zhangxc/gopath/src/github.com/pingcap/dm/dm/master/agent_pool.go:117\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"
[2020/06/15 21:27:04.989 +08:00] [INFO] [mode.go:99] ["change count"] [task=sequence_sharding] [unit="binlog replication"] ["previous count"=1] ["new count"=2]
[2020/06/15 21:27:04.989 +08:00] [INFO] [syncer.go:1836] ["save table checkpoint for source"] [task=sequence_sharding] [unit="binlog replication"] [event=query] [source=`sharding_seq`.`t1`] ["start position"="(mysql-bin|000001.000001, 12499)"] ["end position"="(mysql-bin|000001.000001, 12617)"]
[2020/06/15 21:27:04.989 +08:00] [INFO] [syncer.go:1839] ["source shard group is not synced"] [task=sequence_sharding] [unit="binlog replication"] [event=query] [source=`sharding_seq`.`t1`] ["start position"="(mysql-bin|000001.000001, 12499)"] ["end position"="(mysql-bin|000001.000001, 12617)"]
[2020/06/15 21:27:04.990 +08:00] [INFO] [syncer.go:1618] [task=sequence_sharding] [unit="binlog replication"] [event=query] [statement="alter table t2 drop column f"] [schema=sharding_seq] ["last position"="(mysql-bin|000001.000001, 12617)"] [position="(mysql-bin|000001.000001, 12800)"] ["gtid set"=NULL]
[2020/06/15 21:27:04.990 +08:00] [INFO] [syncer.go:1634] ["resolve sql"] [task=sequence_sharding] [unit="binlog replication"] [event=query] ["raw statement"="alter table t2 drop column f"] [statements="[\"ALTER TABLE `sharding_seq`.`t2` DROP COLUMN `f`\"]"] [schema=sharding_seq] ["last position"="(mysql-bin|000001.000001, 12800)"] [position="(mysql-bin|000001.000001, 12800)"] ["gtid set"=NULL]
[2020/06/15 21:27:04.991 +08:00] [INFO] [syncer.go:1721] ["prepare to handle ddls"] [task=sequence_sharding] [unit="binlog replication"] [event=query] [ddls="[\"ALTER TABLE `sharding_target2`.`t_target` DROP COLUMN `f`\"]"] ["raw statement"="alter table t2 drop column f"] [position="(mysql-bin|000001.000001, 12800)"]
[2020/06/15 21:27:04.991 +08:00] [INFO] [syncer.go:2157] ["flush all jobs"] [task=sequence_sharding] [unit="binlog replication"] ["global checkpoint"="(mysql-bin|000001.000001, 12434)(flushed (mysql-bin|000001.000001, 12434))"]
[2020/06/15 21:27:04.992 +08:00] [INFO] [syncer.go:875] ["flush checkpoints except for these tables"] [task=sequence_sharding] [unit="binlog replication"] [tables="[[\"sharding_seq\",\"t1\"],[\"sharding_seq\",\"t2\"]]"]
[2020/06/15 21:27:04.992 +08:00] [INFO] [syncer.go:878] ["prepare flush sqls"] [task=sequence_sharding] [unit="binlog replication"] ["shard meta sqls"="[]"] ["shard meta arguments"="[]"]
[2020/06/15 21:27:04.993 +08:00] [INFO] [syncer.go:887] ["flushed checkpoint"] [task=sequence_sharding] [unit="binlog replication"] [checkpoint="(mysql-bin|000001.000001, 12434)(flushed (mysql-bin|000001.000001, 12434))"]
[2020/06/15 21:27:04.993 +08:00] [INFO] [relay.go:113] ["current earliest active relay log"] [task=sequence_sharding] [unit="binlog replication"] ["active relay log"=661372f8-af0a-11ea-926b-0242ac120003.000001/mysql-bin.000001]
[2020/06/15 21:27:04.993 +08:00] [INFO] [syncer.go:1823] ["try to sync table in shard group"] [task=sequence_sharding] [unit="binlog replication"] [event=query] [source=`sharding_seq`.`t2`] [ddls="[\"ALTER TABLE `sharding_target2`.`t_target` DROP COLUMN `f`\"]"] ["raw statement"="alter table t2 drop column f"] [in-sharding=true] ["start position"="(mysql-bin|000001.000001, 12682)"] [is-synced=true] [unsynced=0]
[2020/06/15 21:27:04.993 +08:00] [INFO] [syncer.go:1836] ["save table checkpoint for source"] [task=sequence_sharding] [unit="binlog replication"] [event=query] [source=`sharding_seq`.`t2`] ["start position"="(mysql-bin|000001.000001, 12682)"] ["end position"="(mysql-bin|000001.000001, 12800)"]
[2020/06/15 21:27:04.993 +08:00] [INFO] [syncer.go:1843] ["source shard group is synced"] [task=sequence_sharding] [unit="binlog replication"] [event=query] [source=`sharding_seq`.`t2`] ["start position"="(mysql-bin|000001.000001, 12682)"] ["end position"="(mysql-bin|000001.000001, 12800)"]
[2020/06/15 21:27:04.993 +08:00] [INFO] [mode.go:99] ["change count"] [task=sequence_sharding] [unit="binlog replication"] ["previous count"=2] ["new count"=1]
[2020/06/15 21:27:06.894 +08:00] [INFO] [server.go:252] [request=QueryStatus] [payload=]
[2020/06/15 21:27:06.961 +08:00] [INFO] [printer.go:54] ["Welcome to dm-worker"] ["Release Version"=None] ["Git Commit Hash"=None] ["Git Branch"=None] ["UTC Build Time"=None] ["Go Version"=None]
[2020/06/15 21:27:06.961 +08:00] [INFO] [main.go:58] ["dm-worker config"="{\"log-level\":\"info\",\"log-file\":\"/tmp/dm_test/sequence_sharding/worker1/log/dm-worker.log\",\"log-rotate\":\"\",\"worker-addr\":\":8262\",\"enable-gtid\":false,\"auto-fix-gtid\":false,\"relay-dir\":\"/tmp/dm_test/sequence_sharding/worker1/relay_log\",\"meta-dir\":\"./dm_worker_meta\",\"server-id\":429595703,\"flavor\":\"mysql\",\"charset\":\"\",\"relay-binlog-name\":\"\",\"relay-binlog-gtid\":\"\",\"source-id\":\"mysql-replica-01\",\"from\":{\"host\":\"127.0.0.1\",\"port\":3306,\"user\":\"root\",\"max-allowed-packet\":67108864,\"session\":null},\"purge\":{\"interval\":3600,\"expires\":0,\"remain-space\":15},\"checker\":{\"check-enable\":true,\"backoff-rollback\":{\"duration\":\"5m0s\"},\"backoff-max\":{\"duration\":\"5m0s\"}},\"tracer\":{\"enable\":false,\"source\":\"mysql-replica-01\",\"tracer-addr\":\"\",\"batch-size\":20,\"checksum\":false},\"config-file\":\"/home/zhangxc/gopath/src/github.com/pingcap/dm/tests/sequence_sharding/conf/dm-worker1.toml\"}"]
[2020/06/15 21:27:06.963 +08:00] [ERROR] [main.go:78] ["fail to start dm-worker"] [error="[code=40048:class=dm-worker:scope=internal:level=high] start server: listen tcp :8262: bind: address already in use"] [errorVerbose="[code=40048:class=dm-worker:scope=internal:level=high] start server: listen tcp :8262: bind: address already in use\ngithub.com/pingcap/dm/pkg/terror.(*Error).Delegate\n\t/home/zhangxc/gopath/src/github.com/pingcap/dm/pkg/terror/terror.go:267\ngithub.com/pingcap/dm/dm/worker.(*Server).Start\n\t/home/zhangxc/gopath/src/github.com/pingcap/dm/dm/worker/server.go:75\ngithub.com/pingcap/dm/cmd/dm-worker.main\n\t/home/zhangxc/gopath/src/github.com/pingcap/dm/cmd/dm-worker/main.go:76\ngithub.com/pingcap/dm/cmd/dm-worker.TestRunMain.func3\n\t/home/zhangxc/gopath/src/github.com/pingcap/dm/cmd/dm-worker/main_test.go:64\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"]
[2020/06/15 21:27:06.964 +08:00] [INFO] [main.go:81] ["dm-worker exit"]
[2020/06/15 21:27:10.807 +08:00] [INFO] [server.go:284] [request=FetchDDLInfo]
[2020/06/15 21:27:10.808 +08:00] [INFO] [worker.go:435] ["save DDLInfo into subTasks"] [component="worker controller"]
[2020/06/15 21:27:10.808 +08:00] [INFO] [server.go:292] [request=FetchDDLInfo] ["ddl info"="task:\"sequence_sharding\" schema:\"sharding_target2\" table:\"t_target\" DDLs:\"ALTER TABLE `sharding_target2`.`t_target` DROP COLUMN `f`\" "]
[2020/06/15 21:27:10.812 +08:00] [INFO] [server.go:309] ["receive DDLLockInfo"] [request=FetchDDLInfo] ["ddl lock info"="task:\"sequence_sharding\" ID:\"sequence_sharding-`sharding_target2`.`t_target`\" "]
[2020/06/15 21:27:10.851 +08:00] [INFO] [server.go:324] [request=ExecuteDDL] [payload="task:\"sequence_sharding\" lockID:\"sequence_sharding-`sharding_target2`.`t_target`\" traceGID:\"resolveDDLLock.7\" DDLs:\"ALTER TABLE `sharding_target2`.`t_target` DROP COLUMN `f`\" "]
[2020/06/15 21:27:10.853 +08:00] [INFO] [syncer.go:1923] ["ignore DDL job"] [task=sequence_sharding] [unit="binlog replication"] [event=query] [source=`sharding_seq`.`t2`] [ddls="[\"ALTER TABLE `sharding_target2`.`t_target` DROP COLUMN `f`\"]"] ["raw statement"="alter table t2 drop column f"] ["start position"="(mysql-bin|000001.000001, 12682)"] ["end position"="(mysql-bin|000001.000001, 12800)"] [request="{\"task\":\"sequence_sharding\",\"lockID\":\"sequence_sharding-`sharding_target2`.`t_target`\",\"traceGID\":\"resolveDDLLock.7\",\"DDLs\":[\"ALTER TABLE `sharding_target2`.`t_target` DROP COLUMN `f`\"]}"]
[2020/06/15 21:27:10.854 +08:00] [INFO] [syncer.go:1927] ["start to handle ddls in shard mode"] [task=sequence_sharding] [unit="binlog replication"] [event=query] [ddls="[\"ALTER TABLE `sharding_target2`.`t_target` DROP COLUMN `f`\"]"] ["raw statement"="alter table t2 drop column f"] ["start position"="(mysql-bin|000001.000001, 12682)"] ["end position"="(mysql-bin|000001.000001, 12800)"]
[2020/06/15 21:27:10.855 +08:00] [INFO] [syncer.go:909] ["ignore sharding DDLs"] [task=sequence_sharding] [unit="binlog replication"] [ddls="[\"ALTER TABLE `sharding_target2`.`t_target` DROP COLUMN `f`\"]"]
[2020/06/15 21:27:10.855 +08:00] [INFO] [worker.go:475] ["ExecuteDDL remove cacheDDLInfo"] [component="worker controller"]
[2020/06/15 21:27:10.855 +08:00] [WARN] [syncer.go:799] ["exit triggered"] [task=sequence_sharding] [unit="binlog replication"] [failpoint=ExitAfterDDLBeforeFlush]
[2020/06/15 21:27:10.856 +08:00] [INFO] [main_test.go:56] ["os exits"] ["exit code"=1]

It seems the DM-worker is trying to start before the previous existed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the above error may be a problem in my local env, I'll debug it tomorrow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet you have no python in your env. 😂

Copy link
Member

@csuzhangxc csuzhangxc Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right.... I'm on another machine.. we may need to check Python exists when running test cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we start-task, restart worker, work's server-id is changed, and then restart task, task's heartbeat server-id is changed with the new worker's server-id, then heartbeat config is different

Copy link
Member

@csuzhangxc csuzhangxc Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's because we saved the previous task's config (and the server_id) in the dm_worker_meta? if so, we may need to fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I open an issue #740, and will try to fix it in v1.0.6.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add a commit (e477779) in this PR, @GMHDBJD PTAL

@GMHDBJD
Copy link
Collaborator Author

GMHDBJD commented Jun 15, 2020

Why do you not cherry-pick the code about UnresumableErrCodes?

The origin pr only unresume optimistic ddl conflict. Maybe I need add both?

@csuzhangxc
Copy link
Member

Why do you not cherry-pick the code about UnresumableErrCodes?

The origin pr only unresume optimistic ddl conflict. Maybe I need add both?

I think we should unresume DDL conflict for optimistic and pessimistic.

@codecov
Copy link

codecov bot commented Jun 15, 2020

Codecov Report

Merging #739 into release-1.0 will increase coverage by 0.0226%.
The diff coverage is 43.7500%.

@@                 Coverage Diff                 @@
##           release-1.0       #739        +/-   ##
===================================================
+ Coverage      57.8399%   57.8625%   +0.0226%     
===================================================
  Files              166        166                
  Lines            16990      16992         +2     
===================================================
+ Hits              9827       9832         +5     
+ Misses            6207       6203         -4     
- Partials           956        957         +1     

@GMHDBJD GMHDBJD added status/WIP This PR is still work in progress status/PTAL This PR is ready for review. Add this label back after committing new changes and removed status/PTAL This PR is ready for review. Add this label back after committing new changes status/WIP This PR is still work in progress labels Jun 15, 2020
Copy link
Member

@csuzhangxc csuzhangxc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@csuzhangxc csuzhangxc added status/LGT1 One reviewer already commented LGTM and removed status/PTAL This PR is ready for review. Add this label back after committing new changes labels Jun 16, 2020
@csuzhangxc csuzhangxc requested a review from WangXiangUSTC June 16, 2020 02:52
@csuzhangxc
Copy link
Member

@WangXiangUSTC PTAL

@GMHDBJD PTAL too because I also updated this PR.

@csuzhangxc
Copy link
Member

Why do you not cherry-pick the code about UnresumableErrCodes?

The origin pr only unresume optimistic ddl conflict. Maybe I need add both?

I open another PR #741 for the pessimistic shard DDL conflict.

@GMHDBJD
Copy link
Collaborator Author

GMHDBJD commented Jun 16, 2020

PTAL

Copy link
Contributor

@WangXiangUSTC WangXiangUSTC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@csuzhangxc csuzhangxc added this to the v1.0.6 milestone Jun 16, 2020
@csuzhangxc csuzhangxc added status/LGT2 Two reviewers already commented LGTM, ready for merge and removed status/LGT1 One reviewer already commented LGTM labels Jun 16, 2020
@csuzhangxc csuzhangxc merged commit 32da300 into pingcap:release-1.0 Jun 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
priority/normal Minor change, requires approval from ≥1 primary reviewer status/LGT2 Two reviewers already commented LGTM, ready for merge type/cherry-pick This PR is just a cherry-pick (backport)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants