-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data and loads are not well distributed in my benchmark #7349
Comments
Hi, @jaltabike Thank you for your feedback. |
PTAL @liukun4515 |
Can you provide more detailed ycsb workload file and load script? |
@liukun4515
And, my load script is as follows:
*** My first comment contained incorrect information, so I've corrected the information: I've changed the "target" value of the YCSB config from 20000 to 32000. |
@liukun4515 All the configs of TiDB and YCSB were identical to that of the above benchmark except:
As the result of this benchmark, the trends were similar to that of the above benchmark's result but more severe. (Maybe, due to the above result) the performance of the TiDB benchmark was worse than expected, as follows: Please let me know why this is happening. Thank you. |
This situation is caused by hot region. |
Thank you, @liukun4515 |
@jaltabike |
Any update for your benchmark result? |
@liukun4515 Can TiDB split the region automatically if the user changes SHARD_ROW_ID_BITS via ALTER TABLE? |
The value of SHARD_ROW_ID_BITS doesn't affect the region split. @dbjoa |
Thank you for your answer, @liukun4515 |
@liukun4515 @XuHuaiyu I just set the value of SHARD_ROW_ID_BITS to 4, and performed the same benchmark as the first one mentioned above. As the result, it seemed that data and loads were well-distributed as follows:
Now, i'm performing the same benchmark as the second one mentioned above, which will take about a day. I'll also share the results as soon as possible. By the way, why might "setting a large value of SHARD_ROW_ID_BITS" lead to a large number of RPC requests? And, what does "scheduler pending commands" means? |
@jaltabike |
Hi, this is an answer for your previous question:
TiKV handles write requests in a component called Scheduler. If there are many scheduler pending commands for one TiKV instance, it usually means that there are too many writes for this TiKV instance. If only one TiKV instance has many pending commands but others do not, it means that the write requests are not balanced to all TiKVs, which usually means that there are hot write regions. |
@winoros @breeswish |
I find that you use zipfian for the request distribution, it will cause hot keys and then cause hot region. Maybe a better way is to use uniform. |
@siddontang |
@liukun4515 @XuHuaiyu I just set the value of SHARD_ROW_ID_BITS to 4, and performed the same benchmark as the second one mentioned above. As the result, it seemed that data and loads were relatively well-distributed as follows:
Except for above two exceptions, everything works well now! |
For exception 1, can you check OOM message in dmesg? I guess the TiKV restarted at that time |
@siddontang |
oh, this is a known problem. /cc @breeswish |
@siddontang, would you let us know the issue filed? |
Hmm, currently when our scheduler channel is full, tikv will panic. You can adjust the configuration The reason that scheduler channel is full is related to large amount of log output from scheduler when write operations take > 1s. This issue (should) have been solved in master branch, but not 2.0.x. |
@siddontang @breeswish Again, I performed the same benchmark with my latest benchmark above. The only difference was a tidb version; I used the tidb version 2.1.0-rc.1 which is the latest version I can install by using tidb-ansible now. As the result, it seemed that data and loads were relatively well-distributed, and there was no error during the benchmark. However, there were two unusual things.
|
@jaltabike Sorry I missed the notification. Would you like to share the TiKV metrics in the "Threads" section? We can help diagnosis the issue. Currently we indeed have some known edge cases that may cause what you see and the fix requires some adjustments of configurations. However we need to first identify it. |
@jaltabike Hi, there should be a dashboard called "xxx-tikv" (by default, |
Hi @jaltabike The jitter has a direct relationship with the panic issue you face previously. Previously this kind of jitter will cause panic, now it will not panic, but will slow down the write for a short time. This situation is relaxed in 2.1 (master) branch, because logging becomes async. Currently I have no idea about the active written leader issue. Would you like to share your logs? We may be able to investigate its cause according to logs. Also you could try with latest master, which provides better region split functionality, maybe helpful to stabilize the write QPS. |
@breeswish Please notice that there are some errors such as "get snapshot failed". Thank you. |
@jaltabike I have got your logs. Will keep you updated once there are some findings. |
I am going to close this issue as stale. Please feel free to re-open it if you are still experiencing issues. Thank you! |
Please answer these questions before submitting your issue. Thanks!
If possible, provide a recipe for reproducing the error.
I tried to insert tuples into TiDB by using YCSB (Yahoo! Cloud Serving Benchmark).
My TiDB settings were as follows:
. node#1~store: fix boltdb package import. #3: 1 x TiDB & 1 x PD ware deployed on each instance (i.e., total 3 x TiDBs & 3 x PDs)
. node#4~Update README.md #9: 1 x TiKV was deployed on each instance (i.e., total 6 x TiKVs)
. prepared_plan_cache: enabled: true
. raftstore: sync-log: false
. rocksdb: bytes-per-sync: "1MB"
. rocksdb: wal-bytes-per-sync: "512KB"
. raftdb: bytes-per-sync: "1MB"
. raftdb: wal-bytes-per-sync: "512KB"
. storage: scheduler-concurrency: 1024000
. storage: scheduler-worker-pool-size: 6
. *: max-write-buffer-number: 10
CREATE TABLE usertable (
YCSB_KEY VARCHAR(255) PRIMARY KEY,
FIELD0 TEXT, FIELD1 TEXT,
FIELD2 TEXT, FIELD3 TEXT,
FIELD4 TEXT, FIELD5 TEXT,
FIELD6 TEXT, FIELD7 TEXT,
FIELD8 TEXT, FIELD9 TEXT
);
And YCSB settings were as follows:
. maxexecutiontime: 3600 (i.e., insert tuples for 3600 seconds)
. target: 32000 (i.e., insert tuples at a rate 32000 insertions per sec)
. threads: 512 (i.e., use 512 client threads to insert tuples)
. db.batchsize: 100 & jdbc.autocommit: false (i.e., commit every 100 inserts)
What did you expect to see?
I expected the data and loads to be well distributed.
What did you see instead?
Data and loads were not well distributed. I think the performance (TPS and latency) was limited by this.
Data were not well distributed as follows:
Loads for TiKV were not well distributed as follows:
(In overall, tikv1 seemed to be overloaded.)
Questions
Why were data not well-distributed in my situation?
Why were loads (CPU, IO, Network) not well-distributed in my situation?
What is "scheduler pending commands"? Is it related to unbalanced load or data?
Can the region size exceed region-max-size? Following graph shows average region size was 20.3GiB which is too big! It seems that there was one very large size region. If there was such a region, why did not this large size region split?
What version of TiDB are you using (
tidb-server -V
or runselect tidb_version();
on TiDB)?v2.0.5
The text was updated successfully, but these errors were encountered: