Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new Iterator API Refresh(const snapshot*) #10594

Closed
wants to merge 12 commits into from

Conversation

rockeet
Copy link
Contributor

@rockeet rockeet commented Aug 28, 2022

This PR resolves #10487 & #10536, user code needs to call Refresh() periodically.

The main code change is to support range deletions. A range tombstone iterator uses a sequence number as upper bound to decide which range tombstones are effective. During Iterator refresh, this sequence number upper bound needs to be updated for all range tombstone iterators under DBIter and LevelIterator. LevelIterator may create new table iterators and range tombstone iterator during scanning, so it needs to be aware of iterator refresh. The code path that propagates this change is db_iter_->set_sequence(read_seq) -> MergingIterator::SetRangeDelReadSeqno() -> TruncatedRangeDelIterator::SetRangeDelReadSeqno() and LevelIterator::SetRangeDelReadSeqno().

This change also fixes an issue where range tombstone iterators created by LevelIterator may access ReadOptions::snapshot, even though we do not explicitly require users to keep a snapshot alive after creating an Iterator.

Test plan:

  • New unit tests.
  • Add Iterator::Refresh(snapshot) to stress test. Note that this change only adds tests for refreshing to the same snapshot since this is the main target use case.

TODO in a following PR:

  • Stress test Iterator::Refresh() to different snapshots or no snapshot.

Copy link
Member

@cbi42 cbi42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! Added some comment and question.

We should decide and document the API for whether Refresh() supports different snapshots. If so, changes are needed to set correct sequence number upperbound for range tombstones.

virtual Status Refresh() {
virtual Status Refresh() { return Refresh(nullptr); }

virtual Status Refresh(const class Snapshot*) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth clarifying if the parameter needs to be the same snapshot used in read options to create the original iterator. If so, maybe there is no need to pass in snapshot since we can access it from ArenaWrappedDBIter.read_options_?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO - does not need to be the same, but we should anyways require the user provide snapshot, regardless of whether there was one in ReadOptions. That way we can guarantee the target snapshot still exists so we don't accidentally refresh to a SuperVersion on which that snapshot is unavailable.

I am glad that the no-argument Refresh() refreshes to latest state rather than the ReadOptions::snapshot. We need to make it clear from the API doc though.

@@ -66,7 +77,7 @@ Status ArenaWrappedDBIter::Refresh() {
new (&arena_) Arena();

SuperVersion* sv = cfd_->GetReferencedSuperVersion(db_impl_);
SequenceNumber latest_seq = db_impl_->GetLatestSequenceNumber();
SequenceNumber latest_seq = GetSeqNum(db_impl_, snap);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If snap is different from read_options_.snapshot, we may have incorrect sequence number upperbound for range tombstones. For example, see here

rocksdb/db/memtable_list.cc

Lines 231 to 233 in f3b359a

SequenceNumber read_seq = options.snapshot != nullptr
? options.snapshot->GetSequenceNumber()
: kMaxSequenceNumber;
and here
if (read_options.snapshot != nullptr) {
snapshot = read_options.snapshot->GetSequenceNumber();
}

where we use the sequence number from read_options_.snapshot as the upper bound for sequence number for range tombstones.

This applies to the case when SV number did not change too.

@@ -92,7 +92,9 @@ class Iterator : public Cleanable {
// If supported, renew the iterator to represent the latest state. The
// iterator will be invalidated after the call. Not supported if
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here regarding snapshot can be updated.

@rockeet
Copy link
Contributor Author

rockeet commented Sep 28, 2022

Thanks for the contribution! Added some comment and question.

We should decide and document the API for whether Refresh() supports different snapshots. If so, changes are needed to set correct sequence number upperbound for range tombstones.

I agree, it is no need to pass a same snapshot object, snapshot in ArenaWrappedDBIter.read_options_ should be updated as the passing in snapshot, this is complicated with del_range tombstone, I have missed this case(I just use this in our myrocks branch with del_range disabled).

@cbi42 cbi42 force-pushed the Add.Iterator.Refresh.snapshot branch 2 times, most recently from 6027754 to d3415cf Compare August 17, 2023 23:10
@cbi42 cbi42 changed the title Add.iterator.refresh.snapshot Add new Iterator API Refresh(const snapshot*) Aug 18, 2023
@facebook-github-bot
Copy link
Contributor

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@cbi42 cbi42 force-pushed the Add.Iterator.Refresh.snapshot branch from 2aee1c1 to abc12c5 Compare August 18, 2023 19:31
@facebook-github-bot
Copy link
Contributor

@rockeet has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@cbi42 cbi42 requested a review from ajkr August 28, 2023 19:30
Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

db/db_range_del_test.cc Show resolved Hide resolved
@cbi42 cbi42 force-pushed the Add.Iterator.Refresh.snapshot branch from abc12c5 to f882f6d Compare September 14, 2023 18:55
@facebook-github-bot
Copy link
Contributor

@rockeet has updated the pull request. You must reimport the pull request before landing.

@cbi42 cbi42 force-pushed the Add.Iterator.Refresh.snapshot branch from f882f6d to 8d5e5b5 Compare September 15, 2023 04:34
@facebook-github-bot
Copy link
Contributor

@rockeet has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@rockeet has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@cbi42
Copy link
Member

cbi42 commented Sep 15, 2023

Added support for setting new read sequence number in LevelIterator, so it will use it when creating new range tombstone iterator.

@facebook-github-bot
Copy link
Contributor

@cbi42 merged this pull request in 68ce5d8.

facebook-github-bot pushed a commit that referenced this pull request Sep 18, 2023
…#11848)

Summary:
CI has been hitting assertion error like
```
#8  0x00007fafd9294fd6 in __GI___assert_fail (assertion=assertion@entry=0x7fafda270300 "!*memtable_range_tombstone_iter_ || sv_number_ != cfd_->GetSuperVersionNumber()", file=file@entry=0x7fafda270350 "db/arena_wrapped_db_iter.cc", line=line@entry=124, function=function@entry=0x7fafda270288 "virtual rocksdb::Status rocksdb::ArenaWrappedDBIter::Refresh(const rocksdb::Snapshot*)") at assert.c:101
```

This is due to
* Iterator::Refresh() passing in `cur_sv_number` instead of `sv->version_number` here: https://github.com/facebook/rocksdb/blob/1c6faf35871a236222bcbf0b69718ee43376a951/db/arena_wrapped_db_iter.cc#L94-L96

* `super_version_number_` can be incremented before thread local SV is installed: https://github.com/facebook/rocksdb/blob/main/db/column_family.cc#L1287-L1306

* The optimization in #11452 removed the check for SV number, such that `cur_sv_number > sv.version_number` is possible in the following code.
```
uint64_t cur_sv_number = cfd_->GetSuperVersionNumber();
SuperVersion* sv = cfd_->GetReferencedSuperVersion(db_impl_);
```

Not sure why assertion only started failing after #10594, maybe it's because Refresh() is called more often in stress test.

Pull Request resolved: #11848

Test Plan:
* This repros hits the assertion pretty consistently before this change:
```
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=0 --atomic_flush=1 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_one_in=0 --block_size=16384 --bloom_bits=0.7161318870366848 --cache_index_and_filter_blocks=0 --cache_size=8388608 --charge_table_reader=0 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_readahead_size=0 --compaction_ttl=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=none --flush_one_in=1000000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=14 --index_type=2 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=30 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --long_running_snapshots=1 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=2500000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=1048576 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=500000 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=30 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --subcompactions=1 --sync=0 --sync_fault_injection=1 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=1 --top_level_index_pinning=3 --unpartitioned_pinning=3 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1048576 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=35 --use_io_uring=0 --db=/tmp/rocksdb_crashtest_blackboxnf3pyv_0 --expected_values_dir=/tmp/rocksdb_crashtest_expected_6opy9nqg
```

Reviewed By: ajkr

Differential Revision: D49344066

Pulled By: cbi42

fbshipit-source-id: d5373ddb48d933acb42a5dd8fae3f3019b0241e5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Iterator::Refresh() with a snapshot
4 participants