-
Notifications
You must be signed in to change notification settings - Fork 0
Devirtualization And Key Prefix Cache Benchmark
Devirtualization: Eliminate virtual function calls, in our optimization, mainly Comparator virtual function calls
Prefix cache: Extract the fixed-length (8-byte) prefix from the Key, put it into an independent array, and perform a binary search on the array by integer
For more information, please refer to Devirtualization And Key Prefix Cache Principle
The test machine has dual- E5-2682 V4 with a total of 32 cores and 64 threads, 256G DDR4 2400Hz. Newer and better machines will have better performance.
Only the key parameters are listed here and explain why they are set as these. For simplicity, Distributed Compaction was not used in this test.
parameter name | parameter value | effect |
---|---|---|
num | 100000000 | Set num to be larger, so that various metadata of DB cannot be loaded into CPU Cache |
key_size | 8 | Cause the keys written by db_bench to be consecutive big endian integers |
value_size | 20 | Set it smaller to minimize the impact on performance of memory allocation and memcpy |
json | db_bench_enterprise.yaml | With ToplingZipTable, consecutive big endian integers will use UintIndex_AllOne index, which is the fastest and smallest index in ToplingDB |
json | db_bench_community.yaml | Use BlockBasedTable, and set enough BlockCache, the speed is much lower than ToplingZipTable |
parameter name | parameter value | effect |
---|---|---|
target_file_size_base | 512K | The purpose of setting target_file_size_base very small is to generate a lot of files, thus reflecting the effect of key prefix cache |
max_bytes_for_level_base | 4M | Let each layer have a sufficient number of files |
max_bytes_for_level_multiplier | 4 | Let each layer have a sufficient number of files |
Statistics::stats_level | kDisableAll | Disable performance measurement |
parameter name | parameter value | effect |
---|---|---|
acceptCompressionRatio | 0.8 | When testing a larger value_size, it can be set to a small value to let ToplingZipTable give up the compression of Value |
minDictZipValueSize | 30 | Larger than the value_size=20 specified in this test, let ToplingZipTable give up the compression of Value |
minPreadLen | -1 | Reading value always uses mmap |
enableStatistics | false | Disable performance measurement to avoid unnecessary overhead |
The data of each layer is not compressed and there is no BloomFilter, because according to the distribution of the Key in this test, only the meta data of the SST file can be used to determine whether the key is in the file without BloomFilter.
In order to reflect our optimization effect, a new conditional compilation macro TOPLINGDB_NO_OPT_FindFileInRange is added to control whether to enable our optimization.
int FindFileInRange(const InternalKeyComparator& icmp,
const LevelFilesBrief& file_level, const Slice& key,
uint32_t left, uint32_t right) {
+#ifdef TOPLINGDB_NO_OPT_FindFileInRange
+ #pragma message "TOPLINGDB_NO_OPT_FindFileInRange is defined, intended for benchmark comparing"
+ // here is upstream rocksdb code
auto cmp = [&](const FdWithKeyRange& f, const Slice& k) -> bool {
return icmp.InternalKeyComparator::Compare(f.largest_key, k) < 0;
};
const auto& b = file_level.files;
return static_cast<int>(std::lower_bound(b + left, b + right, key, cmp) - b);
+#else // ToplingDB Devirtualization and Key Prefix Cache optimization
+ if (icmp.IsForwardBytewise()) {
+ BytewiseCompareInternalKey cmp;
+ return (int)FindFileInRangeTmpl(cmp, file_level, key, left, right);
+ }
+ else if (icmp.IsReverseBytewise()) {
+ RevBytewiseCompareInternalKey cmp;
+ return (int)FindFileInRangeTmpl(cmp, file_level, key, left, right);
+ }
+ else {
+ FallbackVirtCmp cmp{&icmp};
+ return (int)FindFileInRangeTmpl(cmp, file_level, key, left, right);
+ }
+#endif
}
Before each compilation, the file version_set.cc where touch FindFileInRange is located is recompiled to ensure that the correct code is compiled.
touch db/version_set.cc
make db_bench -j`nproc` OPTIMIZE_LEVEL='-O3' DEBUG_LEVEL=0 \
EXTRA_CXXFLAGS=-DTOPLINGDB_NO_OPT_FindFileInRange
touch db/version_set.cc
make db_bench -j`nproc` OPTIMIZE_LEVEL='-O3' DEBUG_LEVEL=0
Only fill the data and execute flush to ensure that each layer of the LSM has data. Whether to enable FindFileInRange optimization does not affect the data filling
$ ./db_bench -json=sideplugin/rockside/sample-conf/db_bench_enterprise.yaml \
-key_size=8 -value_size=20 -batch_size=10 -num=100000000 \
-disable_wal=true -benchmarks=fillseq,flush
Set seed to 1672971774512063 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
RocksDB: version 7.10.0
Date: Fri Jan 6 10:22:54 2023
CPU: 64 * Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz
CPUCache: 40960 KB
Keys: 8 bytes each (+ 0 bytes user-defined timestamp)
Values: 20 bytes each (10 bytes after compression)
Entries: 100000000
Prefix: 0 bytes
Keys per prefix: 0
RawSize: 2670.3 MB (estimated)
FileSize: 1716.6 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: CSPPMemTabFactory
Perf Level: 1
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
DB path: [/dev/shm/db_bench_enterprise]
fillseq: 0.452 micros/op 2211711 ops/sec 45.214 seconds 100000000 operations; 59.1 MB/s
flush memtable
** Compaction Stats [default] **
Level Files Size Score omitted...
----------------------------------- omitted...
L0 1/0 1.64 MB 0.4 omitted...
L1 17/0 3.14 MB 0.8 omitted...
L2 76/0 15.26 MB 1.0 omitted...
L3 328/0 63.70 MB 1.0 omitted...
L4 1336/0 255.22 MB 1.0 omitted...
L5 5392/0 1023.65 MB 1.0 omitted...
L6 2952/0 558.28 MB 0.0 omitted...
Sum 10102/0 1.88 GB 0.0 omitted...
You can view the information of a single SST through WebView, and you can see that ToplingZipTable does use the UintIndex_AllOne index.
The most commonly used primary key index auto_increment id in MyTopling (MySQL) will use the UintIndex series index. If there is no hole in the id, it is UintIndex_AllOne. We actively select this index with key_size=8 in db_bench.
$ ./db_bench -json sideplugin/rockside/sample-conf/db_bench_enterprise.yaml \
-num=100000000 -benchmarks=readrandom -use_existing_db=true
The db_bench command performs a separate readrandom test, and use_existing_db=true means to use the previously populated data.
@@
in the following flame graph represents an anonymous namespace.
# ... ... ... Delete some spaces and wrap lines appropriately in the following results
------------------------------------------------
DB path: [/dev/shm/db_bench_enterprise]
readrandom: 1.544 micros/op 647552 ops/sec 154.428 seconds 100000000 operations;
17.3 MB/s (100000000 of 100000000 found)
flame graph (View individually, down to each function)
# ... ... ... Delete some spaces in the following results
------------------------------------------------
DB path: [/dev/shm/db_bench_enterprise]
readrandom: 1.009 micros/op 991406 ops/sec 100.867 seconds 100000000 operations;
26.5 MB/s (100000000 of 100000000 found)
flame graph (View individually, down to each function)
On average, the time-consuming of the optimized version of FindFileInRange is only 65% of the original version, and, according to the flame graph, we can calculate the performance improvement of FindFileInRange itself.
Because the time consumption of other parts except FindFileInRange is an invariant in the two tests, we divide the time consumption of FindFileInRange by the time consumption of other parts, which is the relative time consumption value of FindFileInRange. This relative time-consuming value can be directly compared. In the end, we calculated that the performance improvement of FindFileInRange itself is 4.25 times. This calculation process is roughly at the level of mathematics application problems in the fifth grade of elementary school.
Optimization | time-consuming ratio | relatively time-consuming | multiple |
---|---|---|---|
disable | 42.68% | 42.68/(100-42.68) = 0.7446 |
4.25 = 0.7446/0.1751 |
enable | 14.90% | 14.90/(100-14.90) = 0.1751 | 1.00 |
The parameters are exactly the same as db_bench_enterprise except using db_bench_community.yaml.
$ ./db_bench -json=sideplugin/rockside/sample-conf/db_bench_community.yaml \
-key_size=8 -value_size=20 -batch_size=10 -num=100000000 \
-disable_wal=true -benchmarks=fillseq,flush
Set seed to 1672996553785931 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
RocksDB: version 7.10.0
Date: Fri Jan 6 17:15:53 2023
CPU: 64 * Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz
CPUCache: 40960 KB
Keys: 8 bytes each (+ 0 bytes user-defined timestamp)
Values: 20 bytes each (10 bytes after compression)
Entries: 100000000
Prefix: 0 bytes
Keys per prefix: 0
RawSize: 2670.3 MB (estimated)
FileSize: 1716.6 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: CSPPMemTabFactory
Perf Level: 1
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
DB path: [/dev/shm/db_bench_community]
fillseq : 0.449 micros/op 2225012 ops/sec 44.944 seconds 100000000 operations; 59.4 MB/s
flush memtable
We can see that because the same MemTable is used, the performance of writing data is basically the same.
** Compaction Stats [default] **
Level Files Size Score ...
----------------------------------- ...
L0 1/0 1.16 MB 0.3 ...
L1 1/0 3.22 MB 0.8 ...
L2 4/0 12.89 MB 0.8 ...
L3 19/0 61.16 MB 1.0 ...
L4 79/0 254.35 MB 1.0 ...
L5 318/0 1023.89 MB 1.0 ...
L6 555/0 1.74 GB 0.0 ...
Sum 977/0 3.07 GB 0.0 ...
The number of files is less than when using ToplingZipTable, probably because the implementation of TableBuilder::EstimateFileSize() is different. ToplingZipTable returns the sum of the key value sizes from Add, and BlockBasedTable has its own calculation method.
The number of files is less, and FindFileInRange itself takes less time. At this point, BlockBasedTable takes a small advantage.
The same is not compressed, but the file size of BlockBasedTable is larger. This is because the Index of ToplingZipTable uses UintIndex_AllOne, and the space occupation is O(1). The size of the Index has nothing to do with the data, so it can actually be considered that it does not occupy space.
ToplingDB has an option parameter use_raw_size_as_estimated_file_size for the calculation of BlockBasedTableBuilder::EstimateFileSize(), which is defined as an option of BlockBasedTable Factory in json/yaml. If it is true, this calculation method is the same as ToplingZipTable, but in this test, in order to make the BlockBasedTable Behavior is same as RocksDB upstream, we don't define this option (default false).
ToplingDB also implements the function of viewing SST meta information for BlockBasedTable in WebView.
$ ./db_bench -json sideplugin/rockside/sample-conf/db_bench_community.yaml \
-num=100000000 -benchmarks=readrandom -use_existing_db=true
The db_bench command performs a separate readrandom test, and use_existing_db=true means to use the previously populated data.
@@
in the following flame graph represents an anonymous namespace.
# ... ... ... Delete some spaces and wrap lines appropriately in the following results
------------------------------------------------
DB path: [/dev/shm/db_bench_community]
readrandom: 4.182 micros/op 239138 ops/sec 418.168 seconds 100000000 operations;
6.4 MB/s (100000000 of 100000000 found)
flame graph (View individually, down to each function)
# ... ... ... Delete some spaces in the following results
------------------------------------------------
DB path: [/dev/shm/db_bench_community]
readrandom: 3.644 micros/op 274432 ops/sec 364.389 seconds 100000000 operations;
7.3 MB/s (100000000 of 100000000 found)
flame graph (View individually, down to each function)
- Neither BlockBasedTable nor ToplingZipTable actually uses compression. BlockBasedTable requires BlockCache, and at the same time, there will be PageCache when reading files (unless direct io is used)
- ToplingZipTable does not require BlockCache, only PageCache of the file system
Even if BlockBasedTable is configured with 8G BlockCache, which is almost three times the total data volume, BlockBasedTable is still much slower than ToplingZipTable.
Below we use the same method to calculate the performance difference between the optimized version of FindFileInRange and the original version:
Optimization | time-consuming ratio | relatively time-consuming | multiple |
---|---|---|---|
Disable | 11.65% | 11.65/(100-11.65) = 0.1319 |
4.19 = 0.1319/0.0315 |
Enable | 3.05% | 3.05/(100-3.05) = 0.0315 | 1.00 |
It can be seen that although db_bench_community uses BlockBasedTable, the speed is much slower than ToplingZipTable, but according to the same method, the calculated performance difference between disabling and enabling optimization is 4.19
, and the difference between 4.25
calculated in db_bench_enterprise can be considered an error.
Using the same calculation method, we can calculate the performance improvement of FindFileInRange, and also calculate the performance difference of ToplingZipTable and BlockBasedTable.
The time consumption other than
TableReader::Get
is an invariant. Because the number of files in BlockBaseTable is small, this block will be smaller than ToplingZipTable, so ToplingZipTable will take a little advantage
BlockBasedTableReader::Get
is divided into 4 parts in the flame graph (There may be a problem with the flame graph tool. These 4 parts should have been counted together), and the total time-consuming ratio is 37.53 + 37.42 + 2.09 + 2.42 = 79.46
%
SST | time-consuming ratio | relatively time-consuming | multiple |
---|---|---|---|
BlockBased | 79.46% | 79.46/(100-79.46) = 3.8685 |
7.19 = 3.8685/0.5378 |
ToplingZip | 34.97% | 34.97/(100-34.97) = 0.5378 | 1.00 |
Calculating using a non-optimized FindFileInRange flame graph (34.08 + 36.86 + 2.09 = 73.03
%):
SST | time-consuming ratio | relatively time-consuming | multiple |
---|---|---|---|
BlockBased | 73.03% | 73.03/(100-73.03) = 2.7078 |
8.75 = 2.7078/0.3094 |
ToplingZip | 23.63% | 23.63/(100-23.63) = 0.3094 | 1.00 |
It is calculated that the performance of ToplingZipTable is 7.19 and 8.75 times that of BlockBasedTable. This error is a bit large and difficult to explain.
Note, there are the following prerequisites here:
Options | ToplingZipTable | BlockBasedTable |
---|---|---|
Compression | uncompressed | uncompressed |
Cache | PageCache hit all | BlockCache hit all |
BloomFilter | BloomFilter is always not needed | The SST produced by this test does not require a BloomFilter |
Index | The fastest index UintIndex_AllOne is selected | general index |
It can be seen that the test conditions are the optimal conditions for both sides. When ToplingZipTable uses its own general index NestLoudsTrie, the speed of searching Key will be slower, and when compression is enabled, the speed of getting Value will be slower.