Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too manly orphan peers cannot be remove #6573

Closed
nolouch opened this issue Jun 8, 2023 · 0 comments · Fixed by #6574
Closed

Too manly orphan peers cannot be remove #6573

nolouch opened this issue Jun 8, 2023 · 0 comments · Fixed by #6574
Labels
affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. severity/critical type/bug The issue is confirmed as a bug.

Comments

@nolouch
Copy link
Contributor

nolouch commented Jun 8, 2023

Bug Report

What your problem?

I see a region has many learners:

{
  "id": 353747615,
  "start_key": "7480000000000000FFC85F698000000000FF000003016D656469FF6173746FFF726568FF6F7573652EFF636FFF6D2E61750000FD01FF5330363739316633FFFF62323364363462FF64FF633837656235FF3737FF6238333930FF396561FF36000000FF00000000F8000000FC",
  "end_key": "7480000000000000FFC85F698000000000FF000003016D656469FF6173746FFF726568FF6F7573652EFF636FFF6D2E61750000FD01FF8537323965353338FFFF65626131363466FF34FF323966393161FF6339FF3635353364FF626138FF33000000FF00000000F8000000FC",
  "epoch": {
    "conf_ver": 368,
    "version": 32492
  },
  "peers": [
    {
      "id": 353747616,
      "store_id": 350571036,
      "role_name": "Voter"
    },
    {
      "id": 353747617,
      "store_id": 350571030,
      "role_name": "Voter"
    },
    {
      "id": 353747618,
      "store_id": 350571032,
      "role_name": "Voter"
    },
    {
      "id": 391098694,
      "store_id": 350571034,
      "role": 1,
      "role_name": "Learner",
      "is_learner": true
    },
    {
      "id": 433406127,
      "store_id": 350571134,
      "role": 1,
      "role_name": "Learner",
      "is_learner": true
    },
    {
      "id": 522453028,
      "store_id": 350571031,
      "role": 1,
      "role_name": "Learner",
      "is_learner": true
    },
    {
      "id": 544524983,
      "store_id": 350571762,
      "role": 1,
      "role_name": "Learner",
      "is_learner": true
    },
    {
      "id": 556827866,
      "store_id": 111263,
      "role": 1,
      "role_name": "Learner",
      "is_learner": true
    },
    {
      "id": 573463879,
      "store_id": 350614150,
      "role": 1,
      "role_name": "Learner",
      "is_learner": true
    },
    {
      "id": 588797740,
      "store_id": 350571033,
      "role": 1,
      "role_name": "Learner",
      "is_learner": true
    },
    {
      "id": 603872787,
      "store_id": 350571029,
      "role": 1,
      "role_name": "Learner",
      "is_learner": true
    },
    {
      "id": 613045451,
      "store_id": 350571028,
      "role": 1,
      "role_name": "Learner",
      "is_learner": true
    },
    {
      "id": 641149356,
      "store_id": 297948,
      "role": 1,
      "role_name": "Learner",
      "is_learner": true
    },
    {
      "id": 647251283,
      "store_id": 111264,
      "role": 1,
      "role_name": "Learner",
      "is_learner": true
    },
    {
      "id": 649352674,
      "store_id": 350571035,
      "role": 1,
      "role_name": "Learner",
      "is_learner": true
    },
    {
      "id": 652969200,
      "store_id": 355469333,
      "role": 1,
      "role_name": "Learner",
      "is_learner": true
    }
  ],
  "leader": {
    "id": 353747616,
    "store_id": 350571036,
    "role_name": "Voter"
  },
  "down_peers": [
    {
      "down_seconds": 52959,
      "peer": {
        "id": 353747617,
        "store_id": 350571030,
        "role_name": "Voter"
      }
    }
  ],
  "cpu_usage": 0,
  "written_bytes": 0,
  "read_bytes": 0,
  "written_keys": 0,
  "read_keys": 0,
  "approximate_size": 95,
  "approximate_keys": 762600
}

It's caused by two reasons:

  • replace-rule-down-peer timeout: I see too many logs like, all the operator timeouted on step 0 [add learner].
./pd-2023-06-03T13-33-30.129.log:{"level":"INFO","time":"2023/06/03 11:46:08.170 +00:00","caller":"operator_controller.go:589","message":"operator timeout","region-id":353747615,"takes":"17m3.047234387s","operator":"replace-rule-down-peer {mv peer: store [350571030] to [350571762]} (kind:replica,region, region:353747615(32492, 349), createAt:2023-06-03 11:29:05.117760266 +0000 UTC m=+2046279.324124606, startAt:2023-06-03 11:29:05.123047381 +0000 UTC m=+2046279.329411721, currentStep:0, size:95, steps:[add learner peer 453738139 on store 350571762, use joint consensus, promote learner peer 453738139 on store 350571762 to voter, demote voter peer 353747617 on store 350571030 to learner, leave joint state, promote learner peer 453738139 on store 350571762 to voter, demote voter peer 353747617 on store 350571030 to learner, remove peer on store 350571030],timeout:[17m0s]) timeout","additional-info":""}
./pd-2023-06-03T13-33-30.129.log:{"level":"INFO","time":"2023/06/03 12:25:36.574 +00:00","caller":"operator_controller.go:589","message":"operator timeout","region-id":353747615,"takes":"17m3.243330583s","operator":"replace-rule-down-peer {mv peer: store [350571030] to [350571762]} (kind:replica,region, region:353747615(32492, 351), createAt:2023-06-03 12:08:33.327519197 +0000 UTC m=+2048647.533883538, startAt:2023-06-03 12:08:33.330778266 +0000 UTC m=+2048647.537142606, currentStep:0, size:95, steps:[add learner peer 480407187 on store 350571762, use joint consensus, promote learner peer 480407187 on store 350571762 to voter, demote voter peer 353747617 on store 350571030 to learner, leave joint state, promote learner peer 480407187 on store 350571762 to voter, demote voter peer 353747617 on store 350571030 to learner, remove peer on store 350571030],timeout:[17m0s]) timeout","additional-info":""}
./pd-2023-06-03T15-53-31.123.log:{"level":"INFO","time":"2023/06/03 13:36:04.587 +00:00","caller":"operator_controller.go:589","message":"operator timeout","region-id":353747615,"takes":"17m3.98191797s","operator":"replace-rule-down-peer {mv peer: store [350571030] to [350571031]} (kind:replica,region, region:353747615(32492, 353), createAt:2023-06-03 13:19:00.605264956 +0000 UTC m=+2052874.811629296, startAt:2023-06-03 13:19:00.605430889 +0000 UTC m=+2052874.811795230, currentStep:0, size:95, steps:[add learner peer 522453028 on store 350571031, use joint consensus, promote learner peer 522453028 on store 350571031 to voter, demote voter peer 353747617 on store 350571030 to learner, leave joint state, promote learner peer 522453028 on store 350571031 to voter, demote voter peer 353747617 on store 350571030 to learner, remove peer on store 350571030],timeout:[17m0s]) timeout","additional-info":""}
./pd-2023-06-03T15-53-31.123.log:{"level":"INFO","time":"2023/06/03 14:19:56.074 +00:00","caller":"operator_controller.go:589","message":"operator timeout","region-id":353747615,"takes":"17m1.988264062s","operator":"replace-rule-down-peer {mv peer: store [350571030] to [350571762]} (kind:replica,region, region:353747615(32492, 354), createAt:2023-06-03 14:02:54.086604586 +0000 UTC m=+2055508.292968988, startAt:2023-06-03 14:02:54.086651416 +0000 UTC m=+2055508.293015757, currentStep:0, size:95, steps:[add learner peer 544524983 on store 350571762, use joint consensus, promote learner peer 544524983 on store 350571762 to voter, demote voter peer 353747617 on store 350571030 to learner, leave joint state, promote learner peer 544524983 on store 350571762 to voter, demote voter peer 353747617 on store 350571030 to learner, remove peer on store 350571030],timeout:[17m0s]) timeout","additional-info":""}
./pd-2023-06-03T15-53-31.123.log:{"level":"INFO","time":"2023/06/03 14:46:00.079 +00:00","caller":"operator_controller.go:589","message":"operator timeout","region-id":353747615,"takes":"17m4.49687753s","operator":"replace-rule-down-peer {mv peer: store [350571030] to [111263]} (kind:replica,region, region:353747615(32492, 355), createAt:2023-06-03 14:28:55.582456545 +0000 UTC m=+2057069.788820953, startAt:2023-06-03 14:28:55.582911436 +0000 UTC m=+2057069.789275777, currentStep:0, size:95, steps:[add learner peer 556827866 on store 111263, use joint consensus, promote learner peer 556827866 on store 111263 to voter, demote voter peer 353747617 on store 350571030 to learner, leave joint state, promote learner peer 556827866 on store 111263 to voter, demote voter peer 353747617 on store 350571030 to learner, remove peer on store 350571030],timeout:[17m0s]) timeout","additional-info":""}
./pd-2023-06-03T15-53-31.123.log:{"level":"INFO","time":"2023/06/03 15:21:44.601 +00:00","caller":"operator_controller.go:589","message":"operator timeout","region-id":353747615,"takes":"17m0.020277218s","operator":"replace-rule-down-peer {mv peer: store [350571030] to [350614150]} (kind:replica,region, region:353747615(32492, 356), createAt:2023-06-03 15:04:44.580993411 +0000 UTC m=+2059218.787357753, startAt:2023-06-03 15:04:44.581352103 +0000 UTC m=+2059218.787716443, currentStep:0, size:95, steps:[add learner peer 573463879 on store 350614150, use joint consensus, promote learner peer 573463879 on store 350614150 to voter, demote voter peer 353747617 on store 350571030 to learner, leave joint state, promote learner peer 573463879 on store 350614150 to voter, demote voter peer 353747617 on store 350571030 to learner, remove peer on store 350571030],timeout:[17m0s]) timeout","additional-info":""}
./pd-2023-06-03T18-22-35.616.log:{"level":"INFO","time":"2023/06/03 16:04:34.593 +00:00","caller":"operator_controller.go:589","message":"operator timeout","region-id":353747615,"takes":"17m1.974285645s","operator":"replace-rule-down-peer {mv peer: store [350571030] to [350571033]} (kind:replica,region, region:353747615(32492, 357), createAt:2023-06-03 15:47:32.619204107 +0000 UTC m=+2061786.825568447, startAt:2023-06-03 15:47:32.619700475 +0000 UTC m=+2061786.826064819, currentStep:0, size:95, steps:[add learner peer 588797740 on store 350571033, use joint consensus, promote learner peer 588797740 on store 350571033 to voter, demote voter peer 353747617 on store 350571030 to learner, leave joint state, promote learner peer 588797740 on store 350571033 to voter, demote voter peer 353747617 on store 350571030 to learner, remove peer on store 350571030],timeout:[17m0s]) timeout","additional-info":""}
./pd-2023-06-03T18-22-35.616.log:{"level":"INFO","time":"2023/06/03 16:48:14.576 +00:00","caller":"operator_controller.go:589","message":"operator timeout","region-id":353747615,"takes":"17m4.469064112s","operator":"replace-rule-down-peer {mv peer: store [350571030] to [350571029]} (kind:replica,region, region:353747615(32492, 358), createAt:2023-06-03 16:31:10.107144037 +0000 UTC m=+2064404.313508378, startAt:2023-06-03 16:31:10.107644245 +0000 UTC m=+2064404.314008585, currentStep:0, size:95, steps:[add learner peer 603872787 on store 350571029, use joint consensus, promote learner peer 603872787 on store 350571029 to voter, demote voter peer 353747617 on store 350571030 to learner, leave joint state, promote learner peer 603872787 on store 350571029 to voter, demote voter peer 353747617 on store 350571030 to learner, remove peer on store 350571030],timeout:[17m0s]) timeout","additional-info":""}
./pd-2023-06-03T18-22-35.616.log:{"level":"INFO","time":"2023/06/03 17:14:51.086 +00:00","caller":"operator_controller.go:589","message":"operator timeout","region-id":353747615,"takes":"17m3.004712149s","operator":"replace-rule-down-peer {mv peer: store [350571030] to [350571028]} (kind:replica,region, region:353747615(32492, 359), createAt:2023-06-03 16:57:48.081489116 +0000 UTC m=+2066002.287853456, startAt:2023-06-03 16:57:48.082262125 +0000 UTC m=+2066002.288626465, currentStep:0, size:95, steps:[add learner peer 613045451 on store 350571028, use joint consensus, promote learner peer 613045451 on store 350571028 to voter, demote voter peer 353747617 on store 350571030 to learner, leave joint state, promote learner peer 613045451 on store 350571028 to voter, demote voter peer 353747617 on store 350571030 to learner, remove peer on store 350571030],timeout:[17m0s]) timeout","additional-info":""}
./pd-2023-06-03T18-22-35.616.log:{"level":"INFO","time":"2023/06/03 17:42:22.115 +00:00","caller":"operator_controller.go:589","message":"operator timeout","region-id":353747615,"takes":"17m5.034091194s","operator":"replace-rule-down-peer {mv peer: store [350571030] to [111264]} (kind:replica,region, region:353747615(32492, 360), createAt:2023-06-03 17:25:17.080564494 +0000 UTC m=+2067651.286928850, startAt:2023-06-03 17:25:17.080974622 +0000 UTC m=+2067651.287338965, currentStep:0, size:95, steps:[add learner peer 621603839 on store 111264, use joint consensus, promote learner peer 621603839 on store 111264 to voter, demote voter peer 353747617 on store 350571030 to learner, leave joint state, promote learner peer 621603839 on store 111264 to voter, demote voter peer 353747617 on store 350571030 to learner, remove peer on store 350571030],timeout:[17m0s]) timeout","additional-info":""}
./pd-2023-06-03T18-22-35.616.log:{"level":"INFO","time":"2023/06/03 18:21:58.585 +00:00","caller":"operator_controller.go:589","message":"operator timeout","region-id":353747615,"takes":"17m0.004244195s","operator":"replace-rule-down-peer {mv peer: store [350571030] to [355469333]} (kind:replica,region, region:353747615(32492, 362), createAt:2023-06-03 18:04:58.580695787 +0000 UTC m=+2070032.787060127, startAt:2023-06-03 18:04:58.581184059 +0000 UTC m=+2070032.787548399, currentStep:0, size:95, steps:[add learner peer 633222196 on store 355469333, use joint consensus, promote learner peer 633222196 on store 355469333 to voter, demote voter peer 353747617 on store 350571030 to learner, leave joint state, promote learner peer 633222196 on store 355469333 to voter, demote voter peer 353747617 on store 350571030 to learner, remove peer on store 350571030],timeout:[17m0s]) timeout","additional-info":""}
./pd-2023-06-03T20-56-41.085.log:{"level":"INFO","time":"2023/06/03 18:57:55.107 +00:00","caller":"operator_controller.go:589","message":"operator timeout","region-id":353747615,"takes":"17m4.03169867s","operator":"replace-rule-down-peer {mv peer: store [350571030] to [297948]} (kind:replica,region, region:353747615(32492, 364), createAt:2023-06-03 18:40:51.07547951 +0000 UTC m=+2072185.281843839, startAt:2023-06-03 18:40:51.075713517 +0000 UTC m=+2072185.282077857, currentStep:0, size:95, steps:[add learner peer 641149356 on store 297948, use joint consensus, promote learner peer 641149356 on store 297948 to voter, demote voter peer 353747617 on store 350571030 to learner, leave joint state, promote learner peer 641149356 on store 297948 to voter, demote voter peer 353747617 on store 350571030 to learner, remove peer on store 350571030],timeout:[17m0s]) timeout","additional-info":""}
./pd-2023-06-03T20-56-41.085.log:{"level":"INFO","time":"2023/06/03 19:26:51.969 +00:00","caller":"operator_controller.go:589","message":"operator timeout","region-id":353747615,"takes":"17m0.855072912s","operator":"replace-rule-down-peer {mv peer: store [350571030] to [111264]} (kind:replica,region, region:353747615(32492, 365), createAt:2023-06-03 19:09:51.114284394 +0000 UTC m=+2073925.320648722, startAt:2023-06-03 19:09:51.114561245 +0000 UTC m=+2073925.320925585, currentStep:0, size:95, steps:[add learner peer 647251283 on store 111264, use joint consensus, promote learner peer 647251283 on store 111264 to voter, demote voter peer 353747617 on store 350571030 to learner, leave joint state, promote learner peer 647251283 on store 111264 to voter, demote voter peer 353747617 on store 350571030 to learner, remove peer on store 350571030],timeout:[17m0s]) timeout","additional-info":""}
./pd-2023-06-03T20-56-41.085.log:{"level":"INFO","time":"2023/06/03 20:09:43.086 +00:00","caller":"operator_controller.go:589","message":"operator timeout","region-id":353747615,"takes":"17m4.501608048s","operator":"replace-rule-down-peer {mv peer: store [350571030] to [350571035]} (kind:replica,region, region:353747615(32492, 366), createAt:2023-06-03 19:52:38.585094575 +0000 UTC m=+2076492.791458915, startAt:2023-06-03 19:52:38.585220846 +0000 UTC m=+2076492.791585186, currentStep:0, size:95, steps:[add learner peer 649352674 on store 350571035, use joint consensus, promote learner peer 649352674 on store 350571035 to voter, demote voter peer 353747617 on store 350571030 to learner, leave joint state, promote learner peer 649352674 on store 350571035 to voter, demote voter peer 353747617 on store 350571030 to learner, remove peer on store 350571030],timeout:[17m0s]) timeout","additional-info":""}
./pd-2023-06-03T20-56-41.085.log:{"level":"INFO","time":"2023/06/03 20:43:40.073 +00:00","caller":"operator_controller.go:589","message":"operator timeout","region-id":353747615,"takes":"17m3.949959334s","operator":"replace-rule-down-peer {mv peer: store [350571030] to [355469333]} (kind:replica,region, region:353747615(32492, 367), createAt:2023-06-03 20:26:36.120668917 +0000 UTC m=+2078530.327033245, startAt:2023-06-03 20:26:36.123814362 +0000 UTC m=+2078530.330178702, currentStep:0, size:95, steps:[add learner peer 652969200 on store 355469333, use joint consensus, promote learner peer 652969200 on store 355469333 to voter, demote voter peer 353747617 on store 350571030 to learner, leave joint state, promote learner peer 652969200 on store 355469333 to voter, demote voter peer 353747617 on store 350571030 to learner, remove peer on store 350571030],timeout:[17m0s]) timeout","additional-info":""}
  • PD do not remove the orphan peers.

What did you expect to see?

no too many learner

What did you see instead?

too many learner

What version of PD are you using (pd-server -V)?

v7.1.0

@nolouch nolouch added the type/bug The issue is confirmed as a bug. label Jun 8, 2023
@nolouch nolouch added severity/critical affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. labels Jun 8, 2023
@ti-chi-bot ti-chi-bot bot closed this as completed in #6574 Jun 8, 2023
ti-chi-bot bot pushed a commit that referenced this issue Jun 8, 2023
close #6573

rule-checker: fix the too many orphan peers that cannot be removed
- let the health peer can be removed once there exist redundant

Signed-off-by: nolouch <[email protected]>
ti-chi-bot bot pushed a commit that referenced this issue Jun 9, 2023
close #6573, ref #6574

rule-checker: fix the too many orphan peers that cannot be removed
- let the health peer can be removed once there exist redundant

Signed-off-by: nolouch <[email protected]>

Co-authored-by: nolouch <[email protected]>
ti-chi-bot bot added a commit that referenced this issue Jun 26, 2023
close #6573, ref #6574

rule-checker: fix the too many orphan peers that cannot be removed
- let the health peer can be removed once there exist redundant

Signed-off-by: nolouch <[email protected]>

Co-authored-by: nolouch <[email protected]>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. severity/critical type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant