checksum fails easily by tikv timeout when restore big table with local backend #365

glorv · 2020-08-06T06:36:32Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

What did you do? If possible, provide a recipe for reproducing the error.
When uses lightning to restore big table with local backend, I have seem several times in lightning logs with follows:

[2020/08/05 21:05:39.713 +08:00] [ERROR] [main.go:82] ["tidb lightning encountered error stack info"] [error="restore table `test`.`t` failed: compute remote checksum failed: Error 9002: TiKV server timeout"]

And further more, in some benchmark, after finishing load data, when manually exec select count(*) from table with big table, the select may fail with:

mysql> select count(*) from t;
ERROR 1105 (HY000): Execution terminated due to exceeding the deadline

By consulting @breeswish , The root cause for this is that tikv process subtask in one region execeeded 1 minute, Either because the region is too large or the task waited for too long.

I thinking the primary cause of this is that With local backend, region key range is not accurate the size of 96M, thus maybe some region is too big.

What did you expect to see?
checksum should success without retry and select count(*) should return successfully
What did you see instead?
Versions of the cluster
Operation logs
Configuration of the cluster and the task
Screenshot/exported-PDF of Grafana dashboard or metrics' graph in Prometheus for TiDB-Lightning if possible

The text was updated successfully, but these errors were encountered:

lance6716 · 2020-08-13T08:11:39Z

https://internal.pingcap.net/jira/browse/TIDB-4758

for your reference

glorv added the type/bug This issue is a bug report label Aug 6, 2020

kennytm added difficulty/3-hard Hard issue priority/P2 Medium priority issue labels Aug 6, 2020

kennytm added the status/WIP Work in progress label Aug 7, 2020

sre-bot added the severity/major label Aug 10, 2020

glorv mentioned this issue Aug 12, 2020

backend: split and ingest region size more precise #369

Merged

glorv closed this as completed in #369 Sep 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

checksum fails easily by tikv timeout when restore big table with local backend #365

checksum fails easily by tikv timeout when restore big table with local backend #365

glorv commented Aug 6, 2020

lance6716 commented Aug 13, 2020

checksum fails easily by tikv timeout when restore big table with local backend #365

checksum fails easily by tikv timeout when restore big table with local backend #365

Comments

glorv commented Aug 6, 2020

Bug Report

lance6716 commented Aug 13, 2020