Merge remote-tracking branch 'origin/main' into preserve_preclude_mut…

…able
facebook · Nov 15, 2024 · 1a2aa4f · 1a2aa4f
2 parents 79fc668 + ef119c9
commit 1a2aa4f
Show file tree

Hide file tree

Showing 58 changed files with 2,148 additions and 280 deletions.
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -779,6 +779,7 @@ set(SOURCES
         memtable/hash_skiplist_rep.cc
         memtable/skiplistrep.cc
         memtable/vectorrep.cc
+        memtable/wbwi_memtable.cc
         memtable/write_buffer_manager.cc
         monitoring/histogram.cc
         monitoring/histogram_windowing.cc

diff --git a/HISTORY.md b/HISTORY.md
@@ -45,7 +45,7 @@ by them. Prior to this change they would be orphaned until the DB is re-opened.
 
 ## 9.6.0 (08/19/2024)
 ### New Features
-* *Best efforts recovery supports recovering to incomplete Version with a clean seqno cut that presents a valid point in time view from the user's perspective, if versioning history doesn't include atomic flush.
+* Best efforts recovery supports recovering to incomplete Version with a clean seqno cut that presents a valid point in time view from the user's perspective, if versioning history doesn't include atomic flush.
 * New option `BlockBasedTableOptions::decouple_partitioned_filters` should improve efficiency in serving read queries because filter and index partitions can consistently target the configured `metadata_block_size`. This option is currently opt-in.
 * Introduce a new mutable CF option `paranoid_memory_checks`. It enables additional validation on data integrity during reads/scanning. Currently, skip list based memtable will validate key ordering during look up and scans.
 
@@ -57,11 +57,11 @@ by them. Prior to this change they would be orphaned until the DB is re-opened.
 * There may be less intra-L0 compaction triggered by total L0 size being too small. We now use compensated file size (tombstones are assigned some value size) when calculating L0 size and reduce the threshold for L0 size limit. This is to avoid accumulating too much data/tombstones in L0.
 
 ### Bug Fixes
-* *Make DestroyDB supports slow deletion when it's configured in `SstFileManager`. The slow deletion is subject to the configured `rate_bytes_per_sec`, but not subject to the `max_trash_db_ratio`.
+* Make DestroyDB supports slow deletion when it's configured in `SstFileManager`. The slow deletion is subject to the configured `rate_bytes_per_sec`, but not subject to the `max_trash_db_ratio`.
 * Fixed a bug where we set unprep_seqs_ even when WriteImpl() fails. This was caught by stress test write fault injection in WriteImpl(). This may have incorrectly caused iteration creation failure for unvalidated writes or returned wrong result for WriteUnpreparedTxn::GetUnpreparedSequenceNumbers().
 * Fixed a bug where successful write right after error recovery for last failed write finishes causes duplicate WAL entries
 * Fixed a data race involving the background error status in `unordered_write` mode.
-* *Fix a bug where file snapshot functions like backup, checkpoint may attempt to copy a non-existing manifest file. #12882
+* Fix a bug where file snapshot functions like backup, checkpoint may attempt to copy a non-existing manifest file. #12882
 * Fix a bug where per kv checksum corruption may be ignored in MultiGet().
 * Fix a race condition in pessimistic transactions that could allow multiple transactions with the same name to be registered simultaneously, resulting in a crash or other unpredictable behavior.
 
@@ -140,7 +140,7 @@ by them. Prior to this change they would be orphaned until the DB is re-opened.
 * Added a new API `GetEntityFromBatchAndDB` to `WriteBatchWithIndex` that can be used for wide-column point lookups with read-your-own-writes consistency. Similarly to `GetFromBatchAndDB`, the API can combine data from the write batch with data from the underlying database if needed. See the API comments for more details.
 * [Experimental] Introduce two new cross-column-family iterators - CoalescingIterator and AttributeGroupIterator. The CoalescingIterator enables users to iterate over multiple column families and access their values and columns. During this iteration, if the same key exists in more than one column family, the keys in the later column family will overshadow the previous ones. The AttributeGroupIterator allows users to gather wide columns per Column Family and create attribute groups while iterating over keys across all CFs.
 * Added a new API `MultiGetEntityFromBatchAndDB` to `WriteBatchWithIndex` that can be used for batched wide-column point lookups with read-your-own-writes consistency. Similarly to `MultiGetFromBatchAndDB`, the API can combine data from the write batch with data from the underlying database if needed. See the API comments for more details.
-* *Adds a `SstFileReader::NewTableIterator` API to support programmatically read a SST file as a raw table file.
+* Adds a `SstFileReader::NewTableIterator` API to support programmatically read a SST file as a raw table file.
 * Add an option to `WaitForCompactOptions` - `wait_for_purge` to make `WaitForCompact()` API wait for background purge to complete
 
 ### Public API Changes
@@ -169,24 +169,24 @@ the whole DB to be dropped right after migration if the migrated data is larger
 * Fixed a bug where wrong padded bytes are used to generate file checksum and `DataVerificationInfo::checksum` upon file creation
 * Correctly implemented the move semantics of `PinnableWideColumns`.
 * Fixed a bug when the recycle_log_file_num in DBOptions is changed from 0 to non-zero when a DB is reopened. On a subsequent reopen, if a log file created when recycle_log_file_num==0 was reused previously, is alive and is empty, we could end up inserting stale WAL records into the memtable.
-* *Fix a bug where obsolete files' deletion during DB::Open are not rate limited with `SstFilemManager`'s slow deletion feature even if it's configured.
+* Fix a bug where obsolete files' deletion during DB::Open are not rate limited with `SstFilemManager`'s slow deletion feature even if it's configured.
 
 ## 9.1.0 (03/22/2024)
 ### New Features
 * Added an option, `GetMergeOperandsOptions::continue_cb`, to give users the ability to end `GetMergeOperands()`'s lookup process before all merge operands were found.
-* *Add sanity checks for ingesting external files that currently checks if the user key comparator used to create the file is compatible with the column family's user key comparator.
+* Add sanity checks for ingesting external files that currently checks if the user key comparator used to create the file is compatible with the column family's user key comparator.
 *Support ingesting external files for column family that has user-defined timestamps in memtable only enabled.
 * On file systems that support storage level data checksum and reconstruction, retry SST block reads for point lookups, scans, and flush and compaction if there's a checksum mismatch on the initial read.
 * Some enhancements and fixes to experimental Temperature handling features, including new `default_write_temperature` CF option and opening an `SstFileWriter` with a temperature.
 * `WriteBatchWithIndex` now supports wide-column point lookups via the `GetEntityFromBatch` API. See the API comments for more details.
-* *Implement experimental features: API `Iterator::GetProperty("rocksdb.iterator.write-time")` to allow users to get data's approximate write unix time and write data with a specific write time via `WriteBatch::TimedPut` API.
+* Implement experimental features: API `Iterator::GetProperty("rocksdb.iterator.write-time")` to allow users to get data's approximate write unix time and write data with a specific write time via `WriteBatch::TimedPut` API.
 
 ### Public API Changes
 * Best-effort recovery (`best_efforts_recovery == true`) may now be used together with atomic flush (`atomic_flush == true`). The all-or-nothing recovery guarantee for atomically flushed data will be upheld.
 * Remove deprecated option `bottommost_temperature`, already replaced by `last_level_temperature`
 * Added new PerfContext counters for block cache bytes read - block_cache_index_read_byte, block_cache_filter_read_byte, block_cache_compression_dict_read_byte, and block_cache_read_byte.
 * Deprecate experimental Remote Compaction APIs - StartV2() and WaitForCompleteV2() and introduce Schedule() and Wait(). The new APIs essentially does the same thing as the old APIs. They allow taking externally generated unique id to wait for remote compaction to complete.
-* *For API `WriteCommittedTransaction::GetForUpdate`, if the column family enables user-defined timestamp, it was mandated that argument `do_validate` cannot be false, and UDT based validation has to be done with a user set read timestamp. It's updated to make the UDT based validation optional if user sets `do_validate` to false and does not set a read timestamp. With this, `GetForUpdate` skips UDT based validation and it's users' responsibility to enforce the UDT invariant. SO DO NOT skip this UDT-based validation if users do not have ways to enforce the UDT invariant. Ways to enforce the invariant on the users side include manage a monotonically increasing timestamp, commit transactions in a single thread etc.
+* For API `WriteCommittedTransaction::GetForUpdate`, if the column family enables user-defined timestamp, it was mandated that argument `do_validate` cannot be false, and UDT based validation has to be done with a user set read timestamp. It's updated to make the UDT based validation optional if user sets `do_validate` to false and does not set a read timestamp. With this, `GetForUpdate` skips UDT based validation and it's users' responsibility to enforce the UDT invariant. SO DO NOT skip this UDT-based validation if users do not have ways to enforce the UDT invariant. Ways to enforce the invariant on the users side include manage a monotonically increasing timestamp, commit transactions in a single thread etc.
 * Defined a new PerfLevel `kEnableWait` to measure time spent by user threads blocked in RocksDB other than mutex, such as a write thread waiting to be added to a write group, a write thread delayed or stalled etc.
 * `RateLimiter`'s API no longer requires the burst size to be the refill size. Users of `NewGenericRateLimiter()` can now provide burst size in `single_burst_bytes`. Implementors of `RateLimiter::SetSingleBurstBytes()` need to adapt their implementations to match the changed API doc.
 * Add `write_memtable_time` to the newly introduced PerfLevel `kEnableWait`.
@@ -219,7 +219,7 @@ MultiGetBenchmarks.multiGetList10 no_column_family 10000 16 100 1024 thrpt 25 76
 ## 9.0.0 (02/16/2024)
 ### New Features
 * Provide support for FSBuffer for point lookups. Also added support for scans and compactions that don't go through prefetching.
-* *Make `SstFileWriter` create SST files without persisting user defined timestamps when the `Option.persist_user_defined_timestamps` flag is set to false.
+* Make `SstFileWriter` create SST files without persisting user defined timestamps when the `Option.persist_user_defined_timestamps` flag is set to false.
 * Add support for user-defined timestamps in APIs `DeleteFilesInRanges` and `GetPropertiesOfTablesInRange`.
 * Mark wal\_compression feature as production-ready. Currently only compatible with ZSTD compression.
 
@@ -229,9 +229,9 @@ MultiGetBenchmarks.multiGetList10 no_column_family 10000 16 100 1024 thrpt 25 76
 * Exposed mode option to Rate Limiter via c api.
 * Removed deprecated option `access_hint_on_compaction_start`
 * Removed deprecated option `ColumnFamilyOptions::check_flush_compaction_key_order`
-* *Remove the default `WritableFile::GetFileSize` and `FSWritableFile::GetFileSize` implementation that returns 0 and make it pure virtual, so that subclasses are enforced to explicitly provide an implementation.
+* Remove the default `WritableFile::GetFileSize` and `FSWritableFile::GetFileSize` implementation that returns 0 and make it pure virtual, so that subclasses are enforced to explicitly provide an implementation.
 * Removed deprecated option `ColumnFamilyOptions::level_compaction_dynamic_file_size`
-* *Removed tickers with typos "rocksdb.error.handler.bg.errro.count", "rocksdb.error.handler.bg.io.errro.count", "rocksdb.error.handler.bg.retryable.io.errro.count".
+* Removed tickers with typos "rocksdb.error.handler.bg.errro.count", "rocksdb.error.handler.bg.io.errro.count", "rocksdb.error.handler.bg.retryable.io.errro.count".
 * Remove the force mode for `EnableFileDeletions` API because it is unsafe with no known legitimate use.
 * Removed deprecated option `ColumnFamilyOptions::ignore_max_compaction_bytes_for_input`
 * `sst_dump --command=check` now compares the number of records in a table with `num_entries` in table property, and reports corruption if there is a mismatch. API `SstFileDumper::ReadSequential()` is updated to optionally do this verification. (#12322)

diff --git a/TARGETS b/TARGETS
@@ -149,6 +149,7 @@ cpp_library_wrapper(name="rocksdb_lib", srcs=[
         "memtable/hash_skiplist_rep.cc",
         "memtable/skiplistrep.cc",
         "memtable/vectorrep.cc",
+        "memtable/wbwi_memtable.cc",
         "memtable/write_buffer_manager.cc",
         "monitoring/histogram.cc",
         "monitoring/histogram_windowing.cc",

diff --git a/db/builder.cc b/db/builder.cc
@@ -460,9 +460,18 @@ Status BuildTable(
       Status prepare =
           WritableFileWriter::PrepareIOOptions(tboptions.write_options, opts);
       if (prepare.ok()) {
+        // FIXME: track file for "slow" deletion, e.g. into the
+        // VersionSet::obsolete_files_ pipeline
         Status ignored = fs->DeleteFile(fname, opts, dbg);
         ignored.PermitUncheckedError();
       }
+      // Ensure we don't leak table cache entries when throwing away output
+      // files. (The usual logic in PurgeObsoleteFiles is not applicable because
+      // this function deletes the obsolete file itself, while they should
+      // probably go into the VersionSet::obsolete_files_ pipeline.)
+      TableCache::ReleaseObsolete(table_cache->get_cache().get(),
+                                  meta->fd.GetNumber(), nullptr /*handle*/,
+                                  mutable_cf_options.uncache_aggressiveness);
     }
 
     assert(blob_file_additions || blob_file_paths.empty());

diff --git a/db/compaction/compaction.cc b/db/compaction/compaction.cc
@@ -368,9 +368,10 @@ Compaction::Compaction(
   }
 #endif
 
-  // setup input_levels_
+  // setup input_levels_ and filtered_input_levels_
   {
     input_levels_.resize(num_input_levels());
+    filtered_input_levels_.resize(num_input_levels());
     if (earliest_snapshot_.has_value()) {
       FilterInputsForCompactionIterator();
     } else {
@@ -1085,6 +1086,7 @@ void Compaction::FilterInputsForCompactionIterator() {
           ucmp->CompareWithoutTimestamp(rangedel_end_ukey,
                                         file->largest.user_key()) > 0) {
         non_start_level_input_files_filtered_.back().back() = true;
+        filtered_input_levels_[level].push_back(file);
       } else {
         non_start_level_input_files.back().push_back(file);
       }

diff --git a/db/compaction/compaction.h b/db/compaction/compaction.h
@@ -184,6 +184,16 @@ class Compaction {
     return &input_levels_[compaction_input_level];
   }
 
+  // Returns the filtered input files of the specified compaction input level.
+  // For now, only non start level is filtered.
+  const std::vector<FileMetaData*>& filtered_input_levels(
+      size_t compaction_input_level) const {
+    const std::vector<FileMetaData*>& filtered_input_level =
+        filtered_input_levels_[compaction_input_level];
+    assert(compaction_input_level != 0 || filtered_input_level.size() == 0);
+    return filtered_input_level;
+  }
+
   // Maximum size of files to build during this compaction.
   uint64_t max_output_file_size() const { return max_output_file_size_; }
 
@@ -545,10 +555,13 @@ class Compaction {
 
   // Markers for which non start level input files are filtered out if
   // applicable. Only applicable if earliest_snapshot_ is provided and input
-  // start level has a standalone range deletion file.
+  // start level has a standalone range deletion file. Filtered files are
+  // tracked in `filtered_input_levels_`.
   std::vector<std::vector<bool>> non_start_level_input_files_filtered_;
 
-  //  bool standalone_range_tombstones_used_for_filtering_inputs_;
+  // All files from inputs_ that are filtered.
+  std::vector<std::vector<FileMetaData*>> filtered_input_levels_;
+
   const double score_;  // score that was used to pick this compaction.
 
   // Is this compaction creating a file in the bottom most level?

diff --git a/db/compaction/compaction_job.cc b/db/compaction/compaction_job.cc
@@ -911,19 +911,23 @@ Status CompactionJob::Install(const MutableCFOptions& mutable_cf_options,
   ROCKS_LOG_BUFFER(
       log_buffer_,
       "[%s] compacted to: %s, MB/sec: %.1f rd, %.1f wr, level %d, "
-      "files in(%d, %d) out(%d +%d blob) "
-      "MB in(%.1f, %.1f +%.1f blob) out(%.1f +%.1f blob), "
+      "files in(%d, %d) filtered(%d, %d) out(%d +%d blob) "
+      "MB in(%.1f, %.1f +%.1f blob) filtered(%.1f, %.1f) out(%.1f +%.1f blob), "
       "read-write-amplify(%.1f) write-amplify(%.1f) %s, records in: %" PRIu64
       ", records dropped: %" PRIu64 " output_compression: %s\n",
       column_family_name.c_str(), vstorage->LevelSummary(&tmp),
       bytes_read_per_sec, bytes_written_per_sec,
       compact_->compaction->output_level(),
       stats.num_input_files_in_non_output_levels,
-      stats.num_input_files_in_output_level, stats.num_output_files,
+      stats.num_input_files_in_output_level,
+      stats.num_filtered_input_files_in_non_output_levels,
+      stats.num_filtered_input_files_in_output_level, stats.num_output_files,
       stats.num_output_files_blob, stats.bytes_read_non_output_levels / kMB,
       stats.bytes_read_output_level / kMB, stats.bytes_read_blob / kMB,
-      stats.bytes_written / kMB, stats.bytes_written_blob / kMB, read_write_amp,
-      write_amp, status.ToString().c_str(), stats.num_input_records,
+      stats.bytes_skipped_non_output_levels / kMB,
+      stats.bytes_skipped_output_level / kMB, stats.bytes_written / kMB,
+      stats.bytes_written_blob / kMB, read_write_amp, write_amp,
+      status.ToString().c_str(), stats.num_input_records,
       stats.num_dropped_records,
       CompressionTypeToString(compact_->compaction->output_compression())
           .c_str());
@@ -2007,7 +2011,6 @@ bool CompactionJob::UpdateCompactionStats(uint64_t* num_input_range_del) {
   bool has_error = false;
   const ReadOptions read_options(Env::IOActivity::kCompaction);
   const auto& input_table_properties = compaction->GetInputTableProperties();
-  // TODO(yuzhangyu): add dedicated stats for filtered files.
   for (int input_level = 0;
        input_level < static_cast<int>(compaction->num_input_levels());
        ++input_level) {
@@ -2047,6 +2050,23 @@ bool CompactionJob::UpdateCompactionStats(uint64_t* num_input_range_del) {
         *num_input_range_del += file_num_range_del;
       }
     }
+
+    const std::vector<FileMetaData*>& filtered_flevel =
+        compaction->filtered_input_levels(input_level);
+    size_t num_filtered_input_files = filtered_flevel.size();
+    uint64_t* bytes_skipped;
+    if (compaction->level(input_level) != compaction->output_level()) {
+      compaction_stats_.stats.num_filtered_input_files_in_non_output_levels +=
+          static_cast<int>(num_filtered_input_files);
+      bytes_skipped = &compaction_stats_.stats.bytes_skipped_non_output_levels;
+    } else {
+      compaction_stats_.stats.num_filtered_input_files_in_output_level +=
+          static_cast<int>(num_filtered_input_files);
+      bytes_skipped = &compaction_stats_.stats.bytes_skipped_output_level;
+    }
+    for (const FileMetaData* filtered_file_meta : filtered_flevel) {
+      *bytes_skipped += filtered_file_meta->fd.GetFileSize();
+    }
   }
 
   assert(compaction_job_stats_);
@@ -2071,6 +2091,13 @@ void CompactionJob::UpdateCompactionJobStats(
       stats.num_input_files_in_output_level;
   compaction_job_stats_->num_input_files_at_output_level =
       stats.num_input_files_in_output_level;
+  compaction_job_stats_->num_filtered_input_files =
+      stats.num_filtered_input_files_in_non_output_levels +
+      stats.num_filtered_input_files_in_output_level;
+  compaction_job_stats_->num_filtered_input_files_at_output_level =
+      stats.num_filtered_input_files_in_output_level;
+  compaction_job_stats_->total_skipped_input_bytes =
+      stats.bytes_skipped_non_output_levels + stats.bytes_skipped_output_level;
 
   // output information
   compaction_job_stats_->total_output_bytes = stats.bytes_written;