BlockBasedTableReader: automatically adjust tail prefetch size #4156

siying · 2018-07-19T00:49:40Z

Summary: Right now we use one hard-coded prefetch size to prefetch data from the tail of the SST files. However, this may introduce a waste for some use cases, while not efficient for others.
Introduce a way to adjust this prefetch size by tracking 32 recent times, and pick a value with which the wasted read is less than 10%

Test Plan: Add some unit tests for functionality correctnes. Run strace against db_bench to verify it works end to end.

facebook-github-bot

@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2018-07-19T18:35:33Z

@siying has updated the pull request. Re-import the pull request

facebook-github-bot

@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

sagar0

lgtm.

sagar0 · 2018-07-19T21:35:32Z

db/db_test2.cc

+  called = false;
+
+  // Parallel loading SST files
+  options.max_file_opening_threads = 1;


I think you meant to set this value to something greater than 1.

I should fix the comment. I mean to make it 1. With more than 1 I can't check the first one and second one as there is a race condition between fetching the value and calling the sync point callback.

Oh you are right. I meant to do more than 1!

sagar0 · 2018-07-19T22:27:24Z

table/block_based_table_factory.cc

+      max_qualified_size = sorted[i];
+    }
+    prev_size = sorted[i];
+  }


Maybe its just me, but this for loop was somehow more complicated for me to understand.

My thought process was: For read, the value is multiplied by the array size. But for wasted, you are only multiplying with the current index ... why not multiply by the array size here too similar to read? Ultimately I realized that the wasted space for all subsequent reads should be considered as 0; and hence the algorithm works. (You did mention in the comments, but somehow it wasn't immediately obvious even after reading the comment).

sagar0 · 2018-07-19T22:33:40Z

table/block_based_table_factory.h

@@ -64,6 +80,7 @@ class BlockBasedTableFactory : public TableFactory {

 private:
  BlockBasedTableOptions table_options_;
+  mutable TailPrefetchStats tail_prefetch_stats_;


I think it might be useful to export these stats and see a distribution (for later; not immediately needed).

sagar0 · 2018-07-19T22:39:07Z

util/file_reader_writer.h

 private:
  AlignedBuffer buffer_;
  uint64_t buffer_offset_;
  RandomAccessFileReader* file_reader_;
  size_t readahead_size_;
  size_t max_readahead_size_;
+  size_t min_offset_read_;
+  bool enable_;


You might want to add a comment here about enable means.
Reason: Just having a FilePrefetchBuffer instance meant that its valid, until now. But, with this change, enable should should be set to true for certain functionality to be enabled. I presume this was added to narrow the gap between the code for direct IO and buffered IO.

facebook-github-bot · 2018-07-20T00:09:25Z

@siying has updated the pull request. Re-import the pull request

Summary: Right now we use one hard-coded prefetch size to prefetch data from the tail of the SST files. However, this may introduce a waste for some use cases, while not efficient for others. Introduce a way to adjust this prefetch size by tracking 32 recent times, and pick a value with which the wasted read is less than 10% Test Plan: Add some unit tests for functionality correctnes. Run strace against db_bench to verify it works end to end. Reviewers: fix Fix a bug Add comments and fix the test

facebook-github-bot · 2018-07-20T00:29:11Z

@siying has updated the pull request. Re-import the pull request

sagar0

great, thanks!

facebook-github-bot · 2018-07-20T01:10:41Z

@siying has updated the pull request. Re-import the pull request

facebook-github-bot

@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2018-07-20T18:09:46Z

@siying has updated the pull request. Re-import the pull request

facebook-github-bot

@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ajkr

lgtm!

…ook#4156) Summary: Right now we use one hard-coded prefetch size to prefetch data from the tail of the SST files. However, this may introduce a waste for some use cases, while not efficient for others. Introduce a way to adjust this prefetch size by tracking 32 recent times, and pick a value with which the wasted read is less than 10% Pull Request resolved: facebook#4156 Differential Revision: D8916847 Pulled By: siying fbshipit-source-id: 8413f9eb3987e0033ed0bd910f83fc2eeaaf5758

Summary: **Context:** We prefetch the tail part of a SST file (i.e, the blocks after data blocks till the end of the file) during each SST file open in hope to prefetch all the stuff at once ahead of time for later read e.g, footer, meta index, filter/index etc. The existing approach to estimate the tail size to prefetch is through `TailPrefetchStats` heuristics introduced in #4156, which has caused small reads in unlucky case (e.g, small read into the tail buffer during table open in thread 1 under the same BlockBasedTableFactory object can make thread 2's tail prefetching use a small size that it shouldn't) and is hard to debug. Therefore we decide to record the exact tail size and use it directly to prefetch tail of the SST instead of relying heuristics. **Summary:** - Obtain and record in manifest the tail size in `BlockBasedTableBuilder::Finish()` - For backward compatibility, we fall back to TailPrefetchStats and last to simple heuristics that the tail size is a linear portion of the file size - see PR conversation for more. - Make`tail_start_offset` part of the table properties and deduct tail size to record in manifest for external files (e.g, file ingestion, import CF) and db repair (with no access to manifest). Pull Request resolved: #11406 Test Plan: 1. New UT 2. db bench Note: db bench on /tmp/ where direct read is supported is too slow to finish and the default pinning setting in db bench is not helpful to profile # sst read of Get. Therefore I hacked the following to obtain the following comparison. ``` diff --git a/table/block_based/block_based_table_reader.cc b/table/block_based/block_based_table_reader.cc index bd5669f0f..791484c1f 100644 --- a/table/block_based/block_based_table_reader.cc +++ b/table/block_based/block_based_table_reader.cc @@ -838,7 +838,7 @@ Status BlockBasedTable::PrefetchTail( &tail_prefetch_size); // Try file system prefetch - if (!file->use_direct_io() && !force_direct_prefetch) { + if (false && !file->use_direct_io() && !force_direct_prefetch) { if (!file->Prefetch(prefetch_off, prefetch_len, ro.rate_limiter_priority) .IsNotSupported()) { prefetch_buffer->reset(new FilePrefetchBuffer( diff --git a/tools/db_bench_tool.cc b/tools/db_bench_tool.cc index ea40f5fa0..39a0ac385 100644 --- a/tools/db_bench_tool.cc +++ b/tools/db_bench_tool.cc @@ -4191,6 +4191,8 @@ class Benchmark { std::shared_ptr<TableFactory>(NewCuckooTableFactory(table_options)); } else { BlockBasedTableOptions block_based_options; + block_based_options.metadata_cache_options.partition_pinning = + PinningTier::kAll; block_based_options.checksum = static_cast<ChecksumType>(FLAGS_checksum_type); if (FLAGS_use_hash_search) { ``` Create DB ``` ./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none ``` ReadRandom ``` ./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none ``` (a) Existing (Use TailPrefetchStats for tail size + use seperate prefetch buffer in PartitionedFilter/IndexReader::CacheDependencies()) ``` rocksdb.table.open.prefetch.tail.hit COUNT : 3395 rocksdb.sst.read.micros P50 : 5.655570 P95 : 9.931396 P99 : 14.845454 P100 : 585.000000 COUNT : 999905 SUM : 6590614 ``` (b) This PR (Record tail size + use the same tail buffer in PartitionedFilter/IndexReader::CacheDependencies()) ``` rocksdb.table.open.prefetch.tail.hit COUNT : 14257 rocksdb.sst.read.micros P50 : 5.173347 P95 : 9.015017 P99 : 12.912610 P100 : 228.000000 COUNT : 998547 SUM : 5976540 ``` As we can see, we increase the prefetch tail hit count and decrease SST read count with this PR 3. Test backward compatibility by stepping through reading with post-PR code on a db generated pre-PR. Reviewed By: pdillinger Differential Revision: D45413346 Pulled By: hx235 fbshipit-source-id: 7d5e36a60a72477218f79905168d688452a4c064

facebook-github-bot added the CLA Signed label Jul 19, 2018

facebook-github-bot reviewed Jul 19, 2018

View reviewed changes

sagar0 approved these changes Jul 19, 2018

View reviewed changes

siying force-pushed the dynamic_prefetch_end2 branch from 3fa735d to c73c1e5 Compare July 20, 2018 00:29

sagar0 approved these changes Jul 20, 2018

View reviewed changes

Fix

b66c752

facebook-github-bot reviewed Jul 20, 2018

View reviewed changes

Fix

35e82e5

facebook-github-bot reviewed Jul 20, 2018

View reviewed changes

ajkr approved these changes Jul 20, 2018

View reviewed changes

facebook-github-bot closed this in 8425c8b Jul 20, 2018

hx235 mentioned this pull request Apr 28, 2023

Record and use the tail size to prefetch table tail #11406

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BlockBasedTableReader: automatically adjust tail prefetch size #4156

BlockBasedTableReader: automatically adjust tail prefetch size #4156

siying commented Jul 19, 2018

facebook-github-bot left a comment

facebook-github-bot commented Jul 19, 2018

facebook-github-bot left a comment

sagar0 left a comment

sagar0 Jul 19, 2018

siying Jul 19, 2018

siying Jul 19, 2018

sagar0 Jul 19, 2018

sagar0 Jul 19, 2018

sagar0 Jul 19, 2018

facebook-github-bot commented Jul 20, 2018

facebook-github-bot commented Jul 20, 2018

sagar0 left a comment

facebook-github-bot commented Jul 20, 2018

facebook-github-bot left a comment

facebook-github-bot commented Jul 20, 2018

facebook-github-bot left a comment

ajkr left a comment

BlockBasedTableReader: automatically adjust tail prefetch size #4156

BlockBasedTableReader: automatically adjust tail prefetch size #4156

Conversation

siying commented Jul 19, 2018

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jul 19, 2018

facebook-github-bot left a comment

Choose a reason for hiding this comment

sagar0 left a comment

Choose a reason for hiding this comment

sagar0 Jul 19, 2018

Choose a reason for hiding this comment

siying Jul 19, 2018

Choose a reason for hiding this comment

siying Jul 19, 2018

Choose a reason for hiding this comment

sagar0 Jul 19, 2018

Choose a reason for hiding this comment

sagar0 Jul 19, 2018

Choose a reason for hiding this comment

sagar0 Jul 19, 2018

Choose a reason for hiding this comment

facebook-github-bot commented Jul 20, 2018

facebook-github-bot commented Jul 20, 2018

sagar0 left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jul 20, 2018

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jul 20, 2018

facebook-github-bot left a comment

Choose a reason for hiding this comment

ajkr left a comment

Choose a reason for hiding this comment