-
Notifications
You must be signed in to change notification settings - Fork 188
poor: resume task if sync unit exits with invalid connection
error
#66
Conversation
if len(result.Errors) == 0 { | ||
if result.IsCanceled { | ||
stage = pb.Stage_Stopped // canceled by user | ||
} else { | ||
stage = pb.Stage_Finished // process finished with no error | ||
} | ||
} else { | ||
/* TODO | ||
it's a poor and very rough retry feature, the main reason is that | ||
the concurrency control of the sub task module is very confusing and needs to be optimized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the concurrency control optimization
including the following scenario?
all of DM-worker's connections to downstream TiDB suddenly reset by downstream(caused by TiDB restart, network cutting etc.), then we must recover each connection in syncer and if the worker-count config of syncer is N, we will wait (N + 1) * 10 seconds at most.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first of all it is concurrent, I don't understand why we need to wait (N + 1) * 10 seconds at most.
For the refactoring, more is the reconstruction of the the structure, making the moudle and funcion more reasone, of course we can increase the backoff property
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean for each subtask, (N + 1) * 10 seconds
may be not continuous, but each DB connection must be recovered. So the worst case is (N + 1) * 10 seconds
sums up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it's real concurrent, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's the logic of unit, we can refine while refactoring sync unit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can refactor units one by one and then the outer state transition later.
Co-Authored-By: GregoryIan <[email protected]>
switch current.Type() { | ||
case pb.UnitType_Sync: | ||
for _, err := range errors { | ||
if strings.Contains(err.Msg, "invalid connection") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we need to close all DB connections and re-open them in sync unit?
some users reported that they need to resume-task
multi times when "invalid connection" occurred.
LGTM |
What problem does this PR solve?
try to resume task if sync unit exits with
invalid connection
errorWhat is changed and how it works?
it's a poor and very rough retry feature, the main reason is that the concurrency control of the sub task module is very confusing and needs to be optimized. After improve its state transition and concurrency control, I will optimize the implementation of retry feature.
Check List
Tests