Skip to content
This repository has been archived by the owner on Jul 24, 2024. It is now read-only.

BR should tolerate a small amount of tikv crash #980

Closed
shuijing198799 opened this issue Apr 6, 2021 · 5 comments · Fixed by #997
Closed

BR should tolerate a small amount of tikv crash #980

shuijing198799 opened this issue Apr 6, 2021 · 5 comments · Fixed by #997
Labels
severity/critical type/bug Something isn't working

Comments

@shuijing198799
Copy link

Feature Request

Describe your feature request related problem:

In the case of 9 tikv nodes, when a tikv crashes and cannot get up again, br will not work and report the log

error: cluster tidb1373933076652000599/test-330031, wait pipe message failed, errMsg [2021/04/01 03:29:14.541 +00:00] [ERROR] [push.go:54] ["fail to connect store"] [StoreID=70467] [stack="github.com/pingcap/br/pkg/backup.(*pushDown).pushBackup\n\tgithub.com/pingcap/br@/pkg/backup/push.go:54\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRange\n\tgithub.com/pingcap/br@/pkg/backup/client.go:524\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRanges.func2.1\n\tgithub.com/pingcap/br@/pkg/backup/client.go:459\ngithub.com/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\n\tgithub.com/pingcap/br@/pkg/utils/worker.go:63\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:57"]
[2021/04/01 03:29:14.541 +00:00] [ERROR] [push.go:54] ["fail to connect store"] [StoreID=5417812] `[stack="github.com/pingcap/br/pkg/backup.(*pushDown).pushBackup\n\tgithub.com/pingcap/br@/pkg/backup/push.go:54\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRange\n\tgithub.com/pingcap/br@/pkg/backup/client.go:524\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRanges.func2.1\n\tgithub.com/pingcap/br@/pkg/backup/client.go:459\ngithub.com/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\n\tgithub.com/pingcap/br@/pkg/utils/worker.go:63\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:57"]

Describe the feature you'd like:

When there is a problem with a small amount of tikv, but the cluster can work normally, should br continue to work instead of failing directly?

@shuijing198799 shuijing198799 added the type/feature-request New feature or request label Apr 6, 2021
@shuijing198799 shuijing198799 changed the title br should tolerate a small amount of tikv crash BR should tolerate a small amount of tikv crash Apr 6, 2021
@SunRunAway
Copy link
Contributor

Should we treat it as a bug?
@overvenus

@overvenus overvenus added type/bug Something isn't working and removed type/feature-request New feature or request labels Apr 12, 2021
@YuJuncen
Copy link
Collaborator

YuJuncen commented Apr 12, 2021

Analyze: During the BackupRange procedure, a ‘snapshot’ of store status was used. Briefly, the procedure:

  1. Get all store info via conn.GetAllTiKVStores.
  2. For each store,
    • if the store is disconnected, skip it.
    • if the store is up, try to send a backup request to it. Ignore all retryable errors.
  3. For each range not backed up yet,
    • find the leader of the region which contains the start key of this range. (Uh, seems buggy cuz the range can cross regions, and there is a TODO left.)
    • send a backup request to the leader.

During step 2, any disconnected stores would terminate the whole backup procedure: failed to connect to store isn't treated as retryable error.

Solution: In theory, we can ignore all ‘failed to connect to store’ errors during step 2. (Because those could finally be retried in step 3.)

@cyliu0
Copy link

cyliu0 commented Apr 16, 2021

To be clear, after the above PR, the current implementation will tolerate a single tikv node down in a cluster with 3 copies. The tikv design can't tolerate 2 kv down with the same region.

@SunRunAway
Copy link
Contributor

Cloud you cherry-pick this bug to 4.0.X?

@glorv
Copy link
Collaborator

glorv commented Apr 26, 2021

@YuJuncen Please add this issue and pr to the v4.0 bug triage doc

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
severity/critical type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants