BR should tolerate a small amount of tikv crash #980

shuijing198799 · 2021-04-06T02:13:16Z

Feature Request

Describe your feature request related problem:

In the case of 9 tikv nodes, when a tikv crashes and cannot get up again, br will not work and report the log

error: cluster tidb1373933076652000599/test-330031, wait pipe message failed, errMsg [2021/04/01 03:29:14.541 +00:00] [ERROR] [push.go:54] ["fail to connect store"] [StoreID=70467] [stack="github.com/pingcap/br/pkg/backup.(*pushDown).pushBackup\n\tgithub.com/pingcap/br@/pkg/backup/push.go:54\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRange\n\tgithub.com/pingcap/br@/pkg/backup/client.go:524\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRanges.func2.1\n\tgithub.com/pingcap/br@/pkg/backup/client.go:459\ngithub.com/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\n\tgithub.com/pingcap/br@/pkg/utils/worker.go:63\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:57"]
[2021/04/01 03:29:14.541 +00:00] [ERROR] [push.go:54] ["fail to connect store"] [StoreID=5417812] `[stack="github.com/pingcap/br/pkg/backup.(*pushDown).pushBackup\n\tgithub.com/pingcap/br@/pkg/backup/push.go:54\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRange\n\tgithub.com/pingcap/br@/pkg/backup/client.go:524\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRanges.func2.1\n\tgithub.com/pingcap/br@/pkg/backup/client.go:459\ngithub.com/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\n\tgithub.com/pingcap/br@/pkg/utils/worker.go:63\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:57"]

Describe the feature you'd like:

When there is a problem with a small amount of tikv, but the cluster can work normally, should br continue to work instead of failing directly?

The text was updated successfully, but these errors were encountered:

SunRunAway · 2021-04-06T04:51:15Z

Should we treat it as a bug?
@overvenus

YuJuncen · 2021-04-12T06:58:33Z

Analyze: During the BackupRange procedure, a ‘snapshot’ of store status was used. Briefly, the procedure:

Get all store info via conn.GetAllTiKVStores.
For each store,
- if the store is disconnected, skip it.
- if the store is up, try to send a backup request to it. Ignore all retryable errors.
For each range not backed up yet,
- find the leader of the region which contains the start key of this range. (Uh, seems buggy cuz the range can cross regions, and there is a TODO left.)
- send a backup request to the leader.

During step 2, any disconnected stores would terminate the whole backup procedure: failed to connect to store isn't treated as retryable error.

Solution: In theory, we can ignore all ‘failed to connect to store’ errors during step 2. (Because those could finally be retried in step 3.)

cyliu0 · 2021-04-16T03:55:41Z

To be clear, after the above PR, the current implementation will tolerate a single tikv node down in a cluster with 3 copies. The tikv design can't tolerate 2 kv down with the same region.

SunRunAway · 2021-04-26T03:13:24Z

Cloud you cherry-pick this bug to 4.0.X?

glorv · 2021-04-26T06:56:22Z

@YuJuncen Please add this issue and pr to the v4.0 bug triage doc

shuijing198799 added the type/feature-request New feature or request label Apr 6, 2021

shuijing198799 changed the title ~~br should tolerate a small amount of tikv crash~~ BR should tolerate a small amount of tikv crash Apr 6, 2021

overvenus added type/bug Something isn't working and removed type/feature-request New feature or request labels Apr 12, 2021

YuJuncen mentioned this issue Apr 12, 2021

backup: allow backup tolerate minor TiKV failure #997

Merged

ti-chi-bot closed this as completed in #997 Apr 15, 2021

ti-srebot mentioned this issue Apr 15, 2021

backup: allow backup tolerate minor TiKV failure (#997) #1019

Merged

jebter added the severity/critical label Apr 16, 2021

ti-srebot mentioned this issue Apr 26, 2021

backup: allow backup tolerate minor TiKV failure (#997) (#1019) #1062

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BR should tolerate a small amount of tikv crash #980

BR should tolerate a small amount of tikv crash #980

shuijing198799 commented Apr 6, 2021

SunRunAway commented Apr 6, 2021

YuJuncen commented Apr 12, 2021 •

edited

Loading

cyliu0 commented Apr 16, 2021

SunRunAway commented Apr 26, 2021

glorv commented Apr 26, 2021

BR should tolerate a small amount of tikv crash #980

BR should tolerate a small amount of tikv crash #980

Comments

shuijing198799 commented Apr 6, 2021

Feature Request

Describe your feature request related problem:

Describe the feature you'd like:

SunRunAway commented Apr 6, 2021

YuJuncen commented Apr 12, 2021 • edited Loading

cyliu0 commented Apr 16, 2021

SunRunAway commented Apr 26, 2021

glorv commented Apr 26, 2021

YuJuncen commented Apr 12, 2021 •

edited

Loading