poor: resume task if sync unit exits with `invalid connection` error #66

IANTHEREAL · 2019-03-05T09:26:31Z

What problem does this PR solve?

try to resume task if sync unit exits with invalid connection error

What is changed and how it works?

it's a poor and very rough retry feature, the main reason is that the concurrency control of the sub task module is very confusing and needs to be optimized. After improve its state transition and concurrency control, I will optimize the implementation of retry feature.

Check List

Tests

Manual test (add detailed scripts or steps below)
- set up a dm cluster
- start a task
- interfere network to let connetion timeout
- check whether dm-worker would try to resume task

amyangfei · 2019-03-06T10:02:50Z

dm/worker/subtask.go

 		if len(result.Errors) == 0 {
 			if result.IsCanceled {
 				stage = pb.Stage_Stopped // canceled by user
 			} else {
 				stage = pb.Stage_Finished // process finished with no error
 			}
 		} else {
+			/* TODO
+			it's a poor and very rough retry feature, the main reason is that
+			the concurrency control of the sub task module is very confusing and needs to be optimized.


Is the concurrency control optimization including the following scenario?
all of DM-worker's connections to downstream TiDB suddenly reset by downstream(caused by TiDB restart, network cutting etc.), then we must recover each connection in syncer and if the worker-count config of syncer is N, we will wait (N + 1) * 10 seconds at most.

first of all it is concurrent, I don't understand why we need to wait (N + 1) * 10 seconds at most.
For the refactoring, more is the reconstruction of the the structure, making the moudle and funcion more reasone, of course we can increase the backoff property

I mean for each subtask, (N + 1) * 10 seconds may be not continuous, but each DB connection must be recovered. So the worst case is (N + 1) * 10 seconds sums up.

but it's real concurrent, right?

it's the logic of unit, we can refine while refactoring sync unit

we can refactor units one by one and then the outer state transition later.

dm/worker/subtask.go

Co-Authored-By: GregoryIan <[email protected]>

csuzhangxc · 2019-03-07T07:02:43Z

dm/worker/subtask.go

+	switch current.Type() {
+	case pb.UnitType_Sync:
+		for _, err := range errors {
+			if strings.Contains(err.Msg, "invalid connection") {


should we need to close all DB connections and re-open them in sync unit?
some users reported that they need to resume-task multi times when "invalid connection" occurred.

csuzhangxc · 2019-03-07T08:04:42Z

LGTM

…ingcap#66)

IANTHEREAL added 4 commits March 5, 2019 15:36

*: add poor retry

103b912

refine code

7eb04ce

*: correct code

12bd313

*: leave a todo comment

c6ae341

IANTHEREAL added status/PTAL This PR is ready for review. Add this label back after committing new changes type/enhancement Performance improvement or refactoring labels Mar 5, 2019

amyangfei reviewed Mar 6, 2019

View reviewed changes

Update dm/worker/subtask.go

4754c9c

Co-Authored-By: GregoryIan <[email protected]>

csuzhangxc reviewed Mar 7, 2019

View reviewed changes

csuzhangxc added status/LGT1 One reviewer already commented LGTM and removed status/PTAL This PR is ready for review. Add this label back after committing new changes labels Mar 7, 2019

amyangfei approved these changes Mar 7, 2019

View reviewed changes

amyangfei added status/LGT2 Two reviewers already commented LGTM, ready for merge and removed status/LGT1 One reviewer already commented LGTM labels Mar 7, 2019

IANTHEREAL merged commit 1120003 into pingcap:master Mar 7, 2019

IANTHEREAL deleted the ian/retry branch March 7, 2019 10:02

csuzhangxc mentioned this pull request Mar 14, 2019

task paused frequently because of "invalid connection" #46

Closed

lichunzhu pushed a commit to lichunzhu/dm that referenced this pull request Apr 6, 2020

poor: resume task if sync unit exits with invalid connection error (p…

f80e6fb

…ingcap#66)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

poor: resume task if sync unit exits with `invalid connection` error #66

poor: resume task if sync unit exits with `invalid connection` error #66

IANTHEREAL commented Mar 5, 2019

amyangfei Mar 6, 2019 •

edited

Loading

IANTHEREAL Mar 7, 2019

amyangfei Mar 7, 2019

IANTHEREAL Mar 7, 2019

IANTHEREAL Mar 7, 2019

csuzhangxc Mar 7, 2019

csuzhangxc Mar 7, 2019 •

edited

Loading

csuzhangxc commented Mar 7, 2019

poor: resume task if sync unit exits with invalid connection error #66

poor: resume task if sync unit exits with invalid connection error #66

Conversation

IANTHEREAL commented Mar 5, 2019

What problem does this PR solve?

What is changed and how it works?

Check List

amyangfei Mar 6, 2019 • edited Loading

Choose a reason for hiding this comment

IANTHEREAL Mar 7, 2019

Choose a reason for hiding this comment

amyangfei Mar 7, 2019

Choose a reason for hiding this comment

IANTHEREAL Mar 7, 2019

Choose a reason for hiding this comment

IANTHEREAL Mar 7, 2019

Choose a reason for hiding this comment

csuzhangxc Mar 7, 2019

Choose a reason for hiding this comment

csuzhangxc Mar 7, 2019 • edited Loading

Choose a reason for hiding this comment

csuzhangxc commented Mar 7, 2019

poor: resume task if sync unit exits with `invalid connection` error #66

poor: resume task if sync unit exits with `invalid connection` error #66

amyangfei Mar 6, 2019 •

edited

Loading

csuzhangxc Mar 7, 2019 •

edited

Loading