Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data and loads are not well distributed in my benchmark #7349

Closed
jaltabike opened this issue Aug 10, 2018 · 33 comments
Closed

Data and loads are not well distributed in my benchmark #7349

jaltabike opened this issue Aug 10, 2018 · 33 comments
Assignees
Labels
type/question The issue belongs to a question.

Comments

@jaltabike
Copy link

jaltabike commented Aug 10, 2018

Please answer these questions before submitting your issue. Thanks!

  1. What did you do?
    If possible, provide a recipe for reproducing the error.

I tried to insert tuples into TiDB by using YCSB (Yahoo! Cloud Serving Benchmark).
My TiDB settings were as follows:

  • I used 9 AWS EC2 instances (each instance is i3.2xlarge: 8 vCPUs, 61GiB RAM , 1 x 1900GB nvme disk)
    . node#1~store: fix boltdb package import. #3: 1 x TiDB & 1 x PD ware deployed on each instance (i.e., total 3 x TiDBs & 3 x PDs)
    . node#4~Update README.md #9: 1 x TiKV was deployed on each instance (i.e., total 6 x TiKVs)
  • tidb_version: v2.0.5
  • I used the default TiDB config. Exception is as follows:
    . prepared_plan_cache: enabled: true
  • I used the default PD config.
  • I used the default TiKV config. Exceptions are as follows:
    . raftstore: sync-log: false
    . rocksdb: bytes-per-sync: "1MB"
    . rocksdb: wal-bytes-per-sync: "512KB"
    . raftdb: bytes-per-sync: "1MB"
    . raftdb: wal-bytes-per-sync: "512KB"
    . storage: scheduler-concurrency: 1024000
    . storage: scheduler-worker-pool-size: 6
    . *: max-write-buffer-number: 10
  • I created one table like follows:
    CREATE TABLE usertable (
    YCSB_KEY VARCHAR(255) PRIMARY KEY,
    FIELD0 TEXT, FIELD1 TEXT,
    FIELD2 TEXT, FIELD3 TEXT,
    FIELD4 TEXT, FIELD5 TEXT,
    FIELD6 TEXT, FIELD7 TEXT,
    FIELD8 TEXT, FIELD9 TEXT
    );

And YCSB settings were as follows:

  • I used YCSB "load" command to insert tuples into TiDB
  • I used the following config for YCSB "load"
    . maxexecutiontime: 3600 (i.e., insert tuples for 3600 seconds)
    . target: 32000 (i.e., insert tuples at a rate 32000 insertions per sec)
    . threads: 512 (i.e., use 512 client threads to insert tuples)
    . db.batchsize: 100 & jdbc.autocommit: false (i.e., commit every 100 inserts)
  1. What did you expect to see?
    I expected the data and loads to be well distributed.

  2. What did you see instead?
    Data and loads were not well distributed. I think the performance (TPS and latency) was limited by this.

Data were not well distributed as follows:

  • Sore size => store size of tikv_1 and tikv7 was 3 times larger than that of the others
    image
  • Leader and region distribution => It was unbalanced
    image
    image

Loads for TiKV were not well distributed as follows:
(In overall, tikv1 seemed to be overloaded.)

  • CPU => tikv_1 (also tikv_7) used about 200% (100%) more CPU than other tikvs
    image
  • Load => load received by tikv1(=tikv_1) was twice more than that of the others
    image
  • IO Util => IO util of tikv1(=tikv_1) and tikv5(=tikv_7) was six times (twice for tikv5) more than that of the others
    image
  • Network traffic Inbound => Inbound of tikv1 and tikv5 was twice more than that of the others
    image
  • Network traffic Outbound => Outbound of tikv1 was very larger than that of the others
    image
  • Region average written keys => most of keys seemd to be written into tikv_1
    image
  • Scheduler pending commands => most of scheduler pending commands were issued from tikv_1
    image
  1. Questions
    Why were data not well-distributed in my situation?
    Why were loads (CPU, IO, Network) not well-distributed in my situation?
    What is "scheduler pending commands"? Is it related to unbalanced load or data?
    Can the region size exceed region-max-size? Following graph shows average region size was 20.3GiB which is too big! It seems that there was one very large size region. If there was such a region, why did not this large size region split?
    image

  2. What version of TiDB are you using (tidb-server -V or run select tidb_version(); on TiDB)?
    v2.0.5

@XuHuaiyu
Copy link
Contributor

Hi, @jaltabike Thank you for your feedback.
We'll reply to you soon.

@XuHuaiyu XuHuaiyu added the type/question The issue belongs to a question. label Aug 10, 2018
@XuHuaiyu
Copy link
Contributor

PTAL @liukun4515

@liukun4515
Copy link
Contributor

Can you provide more detailed ycsb workload file and load script?
@jaltabike

@jaltabike
Copy link
Author

@liukun4515
My YCSB properties(configs) are as follows:

  • maxexecutiontime=3600
  • recordcount=1000000000
  • operationcount=1000000000
  • workload=com.yahoo.ycsb.workloads.CoreWorkload
  • readallfields=true
  • readproportion=0
  • updateproportion=0
  • scanproportion=0
  • insertproportion=1
  • requestdistribution=zipfian
  • insertstart=0
  • fieldcount=10
  • fieldlength=100
  • reportlatencyforeacherror=true
  • db.driver=com.mysql.jdbc.Driver
  • db.url=jdbc:mysql://tidb-proxy:4001/ycsb?useSSL=false&useServerPrepStmts=true
  • db.user=???
  • db.passwd=???
  • db.batchsize=100
  • jdbc.batchupdateapi=false
  • jdbc.autocommit=false

And, my load script is as follows:
ycsb -jvm-args="${JVM_ARGS}" load jdbc -P ${PROPERTY_FILE} -s -target 32000 -threads 512
where

  • JVM_ARGS="-Xms8G -Xmx8G -XX:PermSize=512m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSWaitDuration=10000 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:+CMSClassUnloadingEnabled"
  • PROPERTY_FILE contains properties I listed above.

*** My first comment contained incorrect information, so I've corrected the information: I've changed the "target" value of the YCSB config from 20000 to 32000.

@jaltabike
Copy link
Author

@liukun4515
I performed another benchmark.

All the configs of TiDB and YCSB were identical to that of the above benchmark except:

  • (for YCSB) maxexecutiontime = 0 (i.e., the benchmark was finished when the predefined number of tuples (specified by recordcount (=1000000000)) were inserted.
  • (for YCSB) target = 20000 (i.e., insert tuples at a rate 20000 insertions per sec)

As the result of this benchmark, the trends were similar to that of the above benchmark's result but more severe.

  • Data were not well distributed as follows:
    image
    image
    image

  • Loads for TiKV were not well distributed as follows:
    image
    image
    image
    image
    image
    image
    image

  • Approximate region size continued to increase; I suspect that split worked properly.
    image

(Maybe, due to the above result) the performance of the TiDB benchmark was worse than expected, as follows:
(*from 22:00, the performance was very unstable.)
image

Please let me know why this is happening.

Thank you.

@liukun4515
Copy link
Contributor

liukun4515 commented Aug 16, 2018

This situation is caused by hot region.
You can refer to _tidb_rowid and SHARD_ROW_ID_BITS in this doc
@jaltabike

@jaltabike
Copy link
Author

Thank you, @liukun4515
I will change the value of SHARD_ROW_ID_BITS and run the ycsb benchmark again.
By the way, can the region size exceed region-max-size?
"approximate region size", the graph I attached above, shows that average region size is 20.3GiB which is far larger than region-max-size (default: 144MB).
Why do not regions whose size is far larger than region-max-size be splitted?

@liukun4515
Copy link
Contributor

@jaltabike
The region size may exceed the region-max-size.
The range split is based on the scanning data in this region, it will cost some time.
When most of the data load into the hot region, the speed of splitting region can't catch up that of loading data, the region will become larger.
You can set the SHARD_ROW_ID_BITS and the data will be written into multiple different regions.
The split region by keys will speed up the split without scanning the data.

@liukun4515
Copy link
Contributor

Any update for your benchmark result?

@dbjoa
Copy link
Contributor

dbjoa commented Aug 17, 2018

@liukun4515 Can TiDB split the region automatically if the user changes SHARD_ROW_ID_BITS via ALTER TABLE?

@liukun4515
Copy link
Contributor

liukun4515 commented Aug 17, 2018

The value of SHARD_ROW_ID_BITS doesn't affect the region split. @dbjoa

@jaltabike
Copy link
Author

Thank you for your answer, @liukun4515
I currently can not access my workspace until next Tuesday (Aug. 21st).
I will run the benchmark next Wednesday (Aug. 22nd), and I will share the results as soon as possible.

@jaltabike
Copy link
Author

@liukun4515 @XuHuaiyu
I would like to share my new benchmark results.

I just set the value of SHARD_ROW_ID_BITS to 4,

and performed the same benchmark as the first one mentioned above.

As the result, it seemed that data and loads were well-distributed as follows:

  • Data were well-distributed:
    image
    image
    image

  • Loads for TiKV were relatively well-distributed:
    image
    image
    image
    image
    image
    (*For outbound network traffic, tikv2 was twice as much as the others.)
    image
    (*For "region average written keys", tikv_4(=tikv2) was up to twice as much as the others.)
    image

  • Region size seemed normal:
    image

Now, i'm performing the same benchmark as the second one mentioned above, which will take about a day.

I'll also share the results as soon as possible.

By the way, why might "setting a large value of SHARD_ROW_ID_BITS" lead to a large number of RPC requests?

And, what does "scheduler pending commands" means?

@winoros
Copy link
Member

winoros commented Aug 22, 2018

@jaltabike
When there's no int primary key and you performed a range query on table's primary key and try to get data that not only containing index key but also other columns, TiDB will first scan index and get its corresponding __tidb_rowid then use the __tidb_rowid to seek the table data. You can explain the sql to see its execution plan.
SHARD_ROW_ID_BITS won't affect the process of scanning index, but will affect the process of scanning table.
Table's data is stored in __tidb_rowid's order. So when there's no SHARD_ROW_ID_BITS, the continuous index values' corresponding __tidb_rowid is more likely stored in same region. But it will stored in multiple region to avoid hot spot when there's SHARD_ROW_ID_BITS.
One RPC task sending to tikv will only contain one region. So the more region one query involved, the more RPC request is sent.
Thus the larger SHARD_ROW_ID_BITS is, the more RPC requests are sent.

@breezewish
Copy link
Member

Hi, this is an answer for your previous question:

What is "scheduler pending commands"? Is it related to unbalanced load or data?

TiKV handles write requests in a component called Scheduler. If there are many scheduler pending commands for one TiKV instance, it usually means that there are too many writes for this TiKV instance. If only one TiKV instance has many pending commands but others do not, it means that the write requests are not balanced to all TiKVs, which usually means that there are hot write regions.

@jaltabike
Copy link
Author

@winoros @breeswish
Thank you for your answer~!

@siddontang
Copy link
Member

@jaltabike

I find that you use zipfian for the request distribution, it will cause hot keys and then cause hot region. Maybe a better way is to use uniform.

@jaltabike
Copy link
Author

@siddontang
Thank you for your comment.
The "hot region" you mention may contain index entries (not tuples).
As I understand it,
because the table used in my benchmark has non-integer PK,
TiDB uses an implicit auto-increment ROW ID to distribute the data.
Therefore, this ROW ID seems to have nothing to do with my benchmark configuration (zipfian),
and there may not be the "hot region" of tuples with by config (zipfian) if I use proper value for SHARD_ROW_ID_BITS.
Is what I understood correct?

@jaltabike
Copy link
Author

@liukun4515 @XuHuaiyu
I would like to share my new benchmark results.

I just set the value of SHARD_ROW_ID_BITS to 4,

and performed the same benchmark as the second one mentioned above.

As the result, it seemed that data and loads were relatively well-distributed as follows:

  • Data were relatively well-distributed:
    image
    image
    image

  • Loads for TiKV were relatively not well-distributed, but there were no major differences or bottleneck nodes:
    image
    (*For "cpu", tikv_4(=tikv2) was about twice higher than tikv_6(=tikv4) over a certain period of time (12:00-16:00).
    image
    (*For "Load", tikv2 was about 5 times higher than tikv4 over a certain period of time (12:00-16:00).
    image
    image
    image
    ("IO Util" and "Network Traffic" were relatively well-distributed.)

  • Region size seemed normal:
    image

  • Exception#1
    Memory used by tikv_4 (also, tikv_5) dropped abruptly at 16:43 (20:45 for tikv_5).
    This causes a sudden temporary performance (TPS) degradation.
    At the similar time, "flush messages per server" of tikv_4 (also, tikv_5) increased abruptly.
    In addition, until the similar time, "scheduler pending commands" of tikv_4 (also, tikv_5) was quite higher than others.
    Could you explain what happened at tikv_4 and tikv_5?
    image
    image
    image

  • Exception#2
    There were some peaks for "region average written keys" (around at 12:00, 14:00, 20:00-22:00)
    "active written leaders" seemed to be related to the symptoms.
    Could you explain what happened at those times?
    image
    image

Except for above two exceptions, everything works well now!
I wrote two questions about these exceptions above.
Could you explain why those things happened?

@siddontang
Copy link
Member

For exception 1, can you check OOM message in dmesg? I guess the TiKV restarted at that time

@jaltabike
Copy link
Author

@siddontang
Because there is enough memory available (total memory size: 61GiB, memory used by tikv_4: 15GiB at max), it is unlikely that OOM occurred.
Although I couldn't find any OOM messeage in dmesg (*command I used: dmesg | egrep -i 'killed process'),
tikv.log says that tikv restarted at that time as you expected.
And there is an error that seems to be related to this restart:
2018/08/22 16:41:24.075 panic_hook.rs:97: [ERROR] thread 'apply worker' panicked 'schedule WriteFinished msg failed, cid=301417729, err:channel is full' at "src/storage/txn/scheduler.rs:305" stack backtrace: 0: 0x56532b8eea9e - backtrace::backtrace::libunwind::trace::h79c4af28be231e54 at /home/jenkins/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.2.3/src/backtrace/libunwind.rs:54 - backtrace::backtrace::trace::h56424d341baa1bce at /home/jenkins/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.2.3/src/backtrace/mod.rs:70 1: 0x56532b417937 - tikv::util::panic_hook::set_exit_hook::{{closure}}::h0b6c560535990cc2 at /home/jenkins/workspace/build_tikv_2.0/go/src/github.com/pingcap/tikv/target/release/build/backtrace-5e253aa535e0fe5b/out/capture.rs:79 2: 0x56532bac1005 - std::panicking::rust_panic_with_hook::h1be8ff44b2f85fd7 at libstd/panicking.rs:403 3: 0x56532bac0e9d - std::panicking::begin_panic_fmt::h9462a882b9175ee8 at libstd/panicking.rs:349 4: 0x56532b4ae536 - tikv::storage::txn::scheduler::make_engine_cb::{{closure}}::hfe27a1a97e21ecd1 at src/storage/txn/scheduler.rs:305 - <F as alloc::boxed::FnBox<A>>::call_box::h57ec56596bccc902 at /checkout/src/liballoc/boxed.rs:635 5: 0x56532b210cc1 - <alloc::boxed::Box<alloc::boxed::FnBox<A, Output$u3d$R$GT$$u20$$u2b$$u20$core..marker..Send$u20$$u2b$$u20$$u27$a$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::h82cf0030dfdbbb3c at /checkout/src/liballoc/boxed.rs:655 - <tikv::storage::engine::raftkv::RaftKv<S> as tikv::storage::engine::Engine>::async_write::{{closure}}::ha91f34c2dcbcc53f at /home/jenkins/workspace/build_tikv_2.0/go/src/github.com/pingcap/tikv/src/storage/engine/raftkv.rs:351 - <F as alloc::boxed::FnBox<A>>::call_box::h1b7136ccb46725a4 at /checkout/src/liballoc/boxed.rs:635 6: 0x56532b2157e6 - <alloc::boxed::Box<alloc::boxed::FnBox<A, Output$u3d$R$GT$$u20$$u2b$$u20$core..marker..Send$u20$$u2b$$u20$$u27$a$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::h2a1f8642cca6fa56 at /checkout/src/liballoc/boxed.rs:655 - <tikv::storage::engine::raftkv::RaftKv<S>>::exec_write_requests::{{closure}}::h814c4b4d2178d49f at /home/jenkins/workspace/build_tikv_2.0/go/src/github.com/pingcap/tikv/src/storage/engine/raftkv.rs:253 - <F as alloc::boxed::FnBox<A>>::call_box::hd11a74bf0727282b at /checkout/src/liballoc/boxed.rs:635 7: 0x56532b481e13 - <alloc::boxed::Box<alloc::boxed::FnBox<A, Output$u3d$R$GT$$u20$$u2b$$u20$core..marker..Send$u20$$u2b$$u20$$u27$a$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::h89ecb0376eab96e6 at /checkout/src/liballoc/boxed.rs:655 - tikv::raftstore::store::msg::Callback::invoke_with_response::hd97635c8677b0280 at src/raftstore/store/msg.rs:74 8: 0x56532b639959 - tikv::raftstore::store::worker::apply::ApplyCallback::invoke_all::{{closure}}::hef317f5553c32ffa at src/raftstore/store/worker/apply.rs:210 - <core::option::Option<T>>::map::hd0180dda22da680f at /checkout/src/libcore/option.rs:404 - tikv::raftstore::store::worker::apply::ApplyCallback::invoke_all::h041788c5f74ac70a at src/raftstore/store/worker/apply.rs:210 - tikv::raftstore::store::worker::apply::ApplyContextCore::write_to_db::hc40d3a232f21a17d at src/raftstore/store/worker/apply.rs:332 9: 0x56532b63a69f - tikv::raftstore::store::worker::apply::ApplyContextCore::commit_opt::hcdee16c5f1789e29 at src/raftstore/store/worker/apply.rs:313 10: 0x56532b6375fc - tikv::raftstore::store::worker::apply::ApplyContextCore::commit::h0c1ba7daa9da77d5 at src/raftstore/store/worker/apply.rs:307 - tikv::raftstore::store::worker::apply::ApplyDelegate::handle_raft_entry_normal::h4985fdb9a600f05b at src/raftstore/store/worker/apply.rs:620 - tikv::raftstore::store::worker::apply::ApplyDelegate::handle_raft_committed_entries::hd1fb2174284af3f3 at src/raftstore/store/worker/apply.rs:572 11: 0x56532b632e72 - tikv::raftstore::store::worker::apply::Runner::handle_applies::hb6b17ca3a51a2254 at src/raftstore/store/worker/apply.rs:2104 - <tikv::raftstore::store::worker::apply::Runner as tikv::util::worker::Runnable<tikv::raftstore::store::worker::apply::Task>>::run::h6e962018c2147a26 at src/raftstore/store/worker/apply.rs:2218 12: 0x56532bb17adb - <tikv::util::worker::Worker<T>>::start_with_timer::{{closure}}::hb0a2fddaf412a89c at /home/jenkins/workspace/build_tikv_2.0/go/src/github.com/pingcap/tikv/src/util/worker/mod.rs:101 - std::sys_common::backtrace::__rust_begin_short_backtrace::hd6f18288a5a93ece at /checkout/src/libstd/sys_common/backtrace.rs:136 13: 0x56532b2163a4 - std::thread::Builder::spawn::{{closure}}::h6682623a4453c1b1 at /checkout/src/libstd/thread/mod.rs:406 - <F as alloc::boxed::FnBox<A>>::call_box::hfa61078dd51ed6db at /checkout/src/liballoc/boxed.rs:635 14: 0x56532bab7c87 - std::sys::unix::thread::Thread::new::thread_start::h0c3a0b0523777f20 at /checkout/src/liballoc/boxed.rs:645 15: 0x7f24397566b9 - start_thread 16: 0x7f243927641c - clone 17: 0x0 - <unknown>

@siddontang
Copy link
Member

oh, this is a known problem.

/cc @breeswish

@dbjoa
Copy link
Contributor

dbjoa commented Aug 24, 2018

@siddontang, would you let us know the issue filed?

@breezewish
Copy link
Member

breezewish commented Aug 24, 2018

Hmm, currently when our scheduler channel is full, tikv will panic. You can adjust the configuration scheduler-notify-capacity to a larger value, i.e. 102400, so that it will be less likely to crash.

The reason that scheduler channel is full is related to large amount of log output from scheduler when write operations take > 1s. This issue (should) have been solved in master branch, but not 2.0.x.

@jaltabike
Copy link
Author

@siddontang @breeswish
Thank you for your comments.

Again, I performed the same benchmark with my latest benchmark above.

The only difference was a tidb version; I used the tidb version 2.1.0-rc.1 which is the latest version I can install by using tidb-ansible now.

As the result, it seemed that data and loads were relatively well-distributed, and there was no error during the benchmark.

However, there were two unusual things.

  1. The average TPS maintained a target TPS (20,000), but there were several temporary decreases in TPS during the benchmark, as follows:
    image
    Note that, prior benchmark's result showed relatively stable TPS.
    Could you explain what caused this phenomenon?

  2. There were some peaks for "region average written keys" (around at 04:00~06:00, 12:00-16:00)
    "active written leaders" seemed to be related to the symptoms.
    Could you explain what happened at those times?
    image

@breezewish
Copy link
Member

@jaltabike Sorry I missed the notification. Would you like to share the TiKV metrics in the "Threads" section? We can help diagnosis the issue. Currently we indeed have some known edge cases that may cause what you see and the fix requires some adjustments of configurations. However we need to first identify it.

@jaltabike
Copy link
Author

Thank you, @breeswish
TiKV metrics in the "Threads" section are as follows:
image
image

There are only 2 metrics in the "Threads" section.
Each metric contains various types of values, so that it is not easy to see the metrics.
If you specify some types you want to see, I will share the metrics contains only those types of values.

@breezewish
Copy link
Member

@jaltabike Hi, there should be a dashboard called "xxx-tikv" (by default, IP:3000/dashboard/db/test-cluster-tikv), in which there is a panel called "Thread CPU".

@jaltabike
Copy link
Author

@breeswish
TiKV metrics in the "Thread CPU" section are as follows:
image
image
image
image

I also attach again the metrics that seemed to have problems.
image

@breezewish
Copy link
Member

Hi @jaltabike

The jitter has a direct relationship with the panic issue you face previously. Previously this kind of jitter will cause panic, now it will not panic, but will slow down the write for a short time. This situation is relaxed in 2.1 (master) branch, because logging becomes async.

Currently I have no idea about the active written leader issue. Would you like to share your logs? We may be able to investigate its cause according to logs. Also you could try with latest master, which provides better region split functionality, maybe helpful to stabilize the write QPS.

@jaltabike
Copy link
Author

jaltabike commented Sep 10, 2018

@breeswish
I attached my log files:
tikv-logaa.zip
tikv-logab.zip
tikv-logac.zip
tikv-logad.zip

Please notice that there are some errors such as "get snapshot failed".

Thank you.

@breezewish
Copy link
Member

@jaltabike I have got your logs. Will keep you updated once there are some findings.

@ghost
Copy link

ghost commented Aug 3, 2020

I am going to close this issue as stale. Please feel free to re-open it if you are still experiencing issues. Thank you!

@ghost ghost closed this as completed Aug 3, 2020
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question The issue belongs to a question.
Projects
None yet
Development

No branches or pull requests

7 participants