It takes 2 hours for tikv failover #362

zyguan · 2019-04-01T06:16:24Z

When a tikv is down, its state in pd firstly turns to Disconnected, then becomes Down after 1 hour. According to failover logic here, it will take 2 hours for failover. Does this behave as expected? It's little misleading.

The text was updated successfully, but these errors were encountered:

weekface · 2019-04-01T06:37:51Z

Now, it will failover after becoming Down for 1 hours. Did not consider the state of Disconnected.

@tennix @xiaojingchen PTAL

tennix · 2019-04-01T06:54:00Z

How will PD handle this when a TiKV fails especially when will PD begin to schedule data on the failed store? @nolouch

nolouch · 2019-04-01T06:58:44Z

@zyguan Can you show the config of the PD? pd-ctl>> config show all

zyguan · 2019-04-01T07:49:14Z

@nolouch here it is.

{
  "client-urls": "http://0.0.0.0:2379",
  "peer-urls": "http://0.0.0.0:2380",
  "advertise-client-urls": "http://demo-pd-1.demo-pd-peer.test-calico-ipip.svc:2379",
  "advertise-peer-urls": "http://demo-pd-1.demo-pd-peer.test-calico-ipip.svc:2380",
  "name": "demo-pd-1",
  "data-dir": "/var/lib/pd",
  "initial-cluster": "demo-pd-1=http://demo-pd-1.demo-pd-peer.test-calico-ipip.svc:2380",
  "initial-cluster-state": "new",
  "join": "",
  "lease": 3,
  "log": {
    "level": "info",
    "format": "text",
    "disable-timestamp": false,
    "file": {
      "filename": "",
      "log-rotate": true,
      "max-size": 0,
      "max-days": 0,
      "max-backups": 0
    }
  },
  "log-file": "",
  "log-level": "",
  "tso-save-interval": "3s",
  "metric": {
    "job": "demo-pd-1",
    "address": "",
    "interval": "15s"
  },
  "schedule": {
    "max-snapshot-count": 3,
    "max-pending-peer-count": 16,
    "max-merge-region-size": 0,
    "max-merge-region-keys": 0,
    "split-merge-interval": "1h0m0s",
    "patrol-region-interval": "100ms",
    "max-store-down-time": "1h0m0s",
    "leader-schedule-limit": 4,
    "region-schedule-limit": 4,
    "replica-schedule-limit": 8,
    "merge-schedule-limit": 8,
    "tolerant-size-ratio": 5,
    "low-space-ratio": 0.8,
    "high-space-ratio": 0.6,
    "disable-raft-learner": "false",
    "disable-remove-down-replica": "false",
    "disable-replace-offline-replica": "false",
    "disable-make-up-replica": "false",
    "disable-remove-extra-replica": "false",
    "disable-location-replacement": "false",
    "disable-namespace-relocation": "false",
    "schedulers-v2": [
      {
        "type": "balance-region",
        "args": null,
        "disable": false
      },
      {
        "type": "balance-leader",
        "args": null,
        "disable": false
      },
      {
        "type": "hot-region",
        "args": null,
        "disable": false
      },
      {
        "type": "label",
        "args": null,
        "disable": false
      }
    ]
  },
  "replication": {
    "max-replicas": 3,
    "location-labels": "zone,rack,host"
  },
  "namespace": {},
  "cluster-version": "2.1.3",
  "quota-backend-bytes": "0 B",
  "auto-compaction-mode": "periodic",
  "auto-compaction-retention-v2": "1h",
  "TickInterval": "500ms",
  "ElectionInterval": "3s",
  "PreVote": true,
  "security": {
    "cacert-path": "",
    "cert-path": "",
    "key-path": ""
  },
  "label-property": {},
  "WarningMsgs": null,
  "namespace-classifier": "table"
}

weekface · 2019-04-03T07:06:24Z

@nolouch PTAL

weekface · 2019-04-03T07:33:09Z

We should failover when the TiKV instance becomes Down. Do not need to wait another 1 hour. @zyguan

zyguan · 2019-04-03T15:24:59Z

So, the failover should be triggered in a short time after pd.maxStoreDownTime, rather than 2*pd.maxStoreDownTime.

weekface · 2019-04-04T02:20:52Z

Yes

* First commit of cleaned-up Get Started section * Fixed formatting * Fixes to Get Started and GKE tutorial * Fixes to GKE tutorial * Fixes to GKE tutorial * Fixes to Get Started * Added Grafana information and fixed some other Get Started items * Fix TOC * Update en/deploy-tidb-from-kubernetes-gke.md Co-authored-by: DanielZhangQD <[email protected]> * Revert "Update en/deploy-tidb-from-kubernetes-gke.md" I accidentally applied this commit using the web interface. This reverts commit 5bc072959a269726dfe5c7ff780608ce2617ed92. * Update en/get-started.md Co-authored-by: DanielZhangQD <[email protected]> * Update en/get-started.md Co-authored-by: DanielZhangQD <[email protected]> * Change order of ops for tidb-operator install. Change wording and org of GKE tutorial. * Fixed broken links * Fixed markdown lint complaints * Added an Upgrade section * Added note about MySQL 8.0 client default-auth plugin. * Fix md lint * Fix md formatting * Added note to kill kubectl port-forwarding Co-authored-by: DanielZhangQD <[email protected]>

zyguan added the type/question Further information is requested label Apr 1, 2019

weekface added type/bug Something isn't working and removed type/question Further information is requested labels Apr 3, 2019

weekface added the test/stability stability tests label Apr 4, 2019

weekface mentioned this issue Apr 4, 2019

fix tikv failover #368

Merged

tennix closed this as completed in #368 Apr 9, 2019

zyguan mentioned this issue Apr 10, 2019

stability: add sst-file-corruption case #382

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It takes 2 hours for tikv failover #362

It takes 2 hours for tikv failover #362

zyguan commented Apr 1, 2019

weekface commented Apr 1, 2019 •

edited

Loading

tennix commented Apr 1, 2019 •

edited

Loading

nolouch commented Apr 1, 2019

zyguan commented Apr 1, 2019

weekface commented Apr 3, 2019

weekface commented Apr 3, 2019

zyguan commented Apr 3, 2019

weekface commented Apr 4, 2019

It takes 2 hours for tikv failover #362

It takes 2 hours for tikv failover #362

Comments

zyguan commented Apr 1, 2019

weekface commented Apr 1, 2019 • edited Loading

tennix commented Apr 1, 2019 • edited Loading

nolouch commented Apr 1, 2019

zyguan commented Apr 1, 2019

weekface commented Apr 3, 2019

weekface commented Apr 3, 2019

zyguan commented Apr 3, 2019

weekface commented Apr 4, 2019

weekface commented Apr 1, 2019 •

edited

Loading

tennix commented Apr 1, 2019 •

edited

Loading