storage: utilities to implement sliding window compaction #14368

andrwng · 2023-10-23T18:08:02Z

Adds utility methods that will be used in sliding window comapction. There are a few pieces to this:

Selecting the "sliding range", the range of segments that are eligible for compaction in a given round of compaction. Only "traditionally" compactible segments (i.e. stable, can't be active) are viable, and segments that have had their keys deduplicated in a previous been sliding window compaction no longer need deduplication unless new non-compacted data shows up.
Building the offset map, a mapping of keys to the latest offset seen for that key, with keys from multiple segments. Within the sliding range, we look at the newest segment and add all keys to the map, iterating on older and older segments. We keep track of the earliest segment that has been fully indexed, and in the next round of compaction, that is the new upper bound of the sliding range.
Deduplicating the keys with the offset map. In this implementation, one segment is rewritten and replaced at a time, similar to self compaction.

This PR introduces the mechanisms to perform the above steps, but doesn't orchestrate them into a method to be used by compaction. This will come in a follow-up PR.

Fixes #14363

Backports Required

Release Notes

none

vbotbuildovich · 2023-10-23T20:40:42Z

ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/39616#018b5e0a-1531-4479-926a-fe185f4c9460

src/v/storage/compaction_reducers.cc

VladLazar · 2023-10-24T10:55:05Z

src/v/storage/compaction_reducers.h

+    // Compacted index writer for the newly written segment. May not be
+    // supplied if the compacted index isn't expected to change, e.g. when
+    // rewriting a single segment filtering with its own compacted index.
+    compacted_index_writer* _compacted_idx;
    index_state _idx;


I'm a bit confused since we are writing to two indices. What will be the end state here?

I see. We need to self compact first, so we still need to build the old compacted segment index. Still, I don't get the purpose of the new one.

In general having thew new compacted index seems useful in cases where there are subsequent self compactions, to avoid having to rebuild it by reading the segment. That will happen after restarting, since the segment bitflags are in memory only.

I'm still confused. What does _compacted_idx point to?

I've re-read deduplicate_segment and it makes more sense now. Is the idea to update both the compaction index via (compaction_index_writer) and the segment index via the return value of deduplicate_segment?

Right. FWIW I have a follow-up change to pass the index_state in as an input too, so the compacted index writer, index, and appender would all be decoupled from a single segment reducer, the idea there is that rather than rewriting a segment at a time, we could merge 2 (or more eventually) segments at a time by having the segment reducers share the writers.

VladLazar · 2023-10-24T11:05:04Z

src/v/storage/segment_deduplication_utils.cc

+    }
+    // We should only keep the record if its offset is equal or
+    // higher than that indexed.
+    co_return map_offset.value() <= o;


Can the offset ever be equal to the indexed one?

Yeah, the offset being equal means that this record was the highest indexed offset for the key.

The offset being higher means that the record has a higher offset than what we've indexed, e.g. because we only partially indexed a segment and a later offset in the segment had the given key at a higher offset.

Right. We iterate through the segments in reverse order, but read from the compaction index in the natural order.

Right, exactly

src/v/storage/segment_deduplication_utils.h

src/v/storage/segment_deduplication_utils.cc

VladLazar · 2023-10-24T11:13:12Z

src/v/storage/disk_log_impl.h

@@ -291,6 +298,7 @@ class disk_log_impl final : public log {
    mutex _segments_rolling_lock;

    std::optional<model::offset> _cloud_gc_offset;
+    std::optional<model::offset> _last_compaction_window_start_offset;


Should this be persisted on disk?

I think we should, but as a short term follow-up task. Currently compaction is already susceptible to duplicating work after restarts since we don't persist bitflags.

VladLazar

LGTM

src/v/storage/compaction_reducers.cc

dotnwat · 2023-10-25T17:53:38Z

src/v/storage/compaction_reducers.cc

@@ -287,6 +287,19 @@ ss::future<ss::stop_iteration> copy_data_segment_reducer::do_compaction(
    if (to_copy == std::nullopt) {
        co_return stop_t::no;
    }
+    if (_compacted_idx && is_compactible(to_copy.value())) {


rather than reusing a segment's existing
compacted index, we will need to rewrite an index as the segment is
rewritten with deduplication context from other segments

I'm a bit unsure why anything related to the compacted index needs to be changed. It's a reflection of what is in a segment--(key,offset) pairs--and I wouldn't think that that changes based on how the segment is created (ie single segment compaction vs compaction with larger sliding window).

This method builds a new segment, and the keys in the compacted index should reflect that, no? It's possible that the new segment has virtually no keys because all of the new values were written in newer segments, in which case keeping the existing compacted index would be wasteful.

To your point, we could just delete the compacted index and instead rely on a subsequent round of self compaction to create the indexes. OTOH I'm thinking that windowed compaction would be a sort of supserset of self compaction: if you've done windowed compaction, there's no need to self compact because the indexes are already there.

src/v/storage/segment_deduplication_utils.cc

With coroutines, it can be more convenient to just pass in some procedural method to the iteration function, rather than implementing a reducer. I intend on using this to build a map using multiple readers. Puts into this map will be async, given the possibility of stalls during probing.

This will be useful in testing the sliding window approach.

This replaces the inlined should_keep() method in the copy_data_segment_reducer with a function, so that a subsequent commit can base deduplication on a key_offset_map.

Upcoming changes to use a key_offset_map will require async calls to put/get. Inherently our implementation won't do async work, but there will be potential for long-running tasks with some form of linear probing of a map that should be broken up by yielding. So, this commit changes the filter interface with an async version.

Subsequent commits will introduced sliding window compaction by which segments will be deduplicated with keys from multiple segments. To distinguish such compacted segments (e.g. to tell if there are new uncompacted segments that need deduplicating), this adds a new segment bitflag.

In sliding window compaction, it will be possible that the entire segment's data will be removed. To maintain that we still have data in each segment, this allows passing the last segment offset to the segment reducer, allowing it to tell whether it's empty by the end of its reduce and force keeping a record accordingly.

In sliding window compaction, rather than reusing a segment's existing compacted index, we will need to rewrite an index as the segment is rewritten with deduplication context from other segments. A compacted index writer is now passed to the segment reducer.

There's some functional overlap with compaction_reducers, but the utilities include methods to build a key_offset_map and to rewrite a segment and compacted index removing duplicates.

Adds a method to collect the set of segments over which to slide during a sliding window compaction.

andrwng · 2023-10-25T19:36:45Z

Latest force pushes:

andrwng · 2023-10-26T01:04:09Z

CI failure is #14451

github-actions bot added the area/redpanda label Oct 23, 2023

VladLazar reviewed Oct 24, 2023

View reviewed changes

andrwng force-pushed the compaction-dedupe-utils branch 2 times, most recently from a8a9daa to 8086174 Compare October 24, 2023 21:43

andrwng requested review from dotnwat and VladLazar October 24, 2023 21:54

VladLazar previously approved these changes Oct 25, 2023

View reviewed changes

dotnwat reviewed Oct 25, 2023

View reviewed changes

andrwng dismissed VladLazar’s stale review via 6d269ae October 25, 2023 19:16

andrwng force-pushed the compaction-dedupe-utils branch from 73453cb to 6d269ae Compare October 25, 2023 19:16

andrwng added 12 commits October 25, 2023 12:16

make copy_data_segment_reducer::filter take by copy

97a52d0

storage: allow controlling the number of keys indexed in compaction

25a887c

This will be useful in testing the sliding window approach.

storage: pass lambda to compaction reducer

b838f50

This replaces the inlined should_keep() method in the copy_data_segment_reducer with a function, so that a subsequent commit can base deduplication on a key_offset_map.

storage: add utils for segment deduplication

4f79b9c

There's some functional overlap with compaction_reducers, but the utilities include methods to build a key_offset_map and to rewrite a segment and compacted index removing duplicates.

storage: add method to collect sliding window compaction range

2f2229d

Adds a method to collect the set of segments over which to slide during a sliding window compaction.

disk_log_impl: make some methods public for tests

9059677

tests: adds cases for segment deduplication utilities

e1d9ff3

andrwng force-pushed the compaction-dedupe-utils branch from 6d269ae to e1d9ff3 Compare October 25, 2023 19:35

andrwng requested review from dotnwat and VladLazar October 25, 2023 19:38

dotnwat approved these changes Oct 25, 2023

View reviewed changes

andrwng self-assigned this Oct 25, 2023

andrwng mentioned this pull request Oct 25, 2023

storage: implement sliding window compaction #14328

Merged

7 tasks

dotnwat merged commit 3eaf1ba into redpanda-data:dev Oct 26, 2023
24 of 26 checks passed

github-actions bot mentioned this pull request Dec 22, 2023

update redpanda appVersion from v23.2.21 to v23.3.1 redpanda-data/helm-charts#950

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: utilities to implement sliding window compaction #14368

storage: utilities to implement sliding window compaction #14368

andrwng commented Oct 23, 2023 •

edited

Loading

vbotbuildovich commented Oct 23, 2023

VladLazar Oct 24, 2023

VladLazar Oct 24, 2023

andrwng Oct 24, 2023

VladLazar Oct 25, 2023

VladLazar Oct 25, 2023

andrwng Oct 25, 2023

VladLazar Oct 24, 2023

andrwng Oct 24, 2023

VladLazar Oct 25, 2023

andrwng Oct 25, 2023

VladLazar Oct 24, 2023

andrwng Oct 24, 2023

VladLazar left a comment

dotnwat Oct 25, 2023

andrwng Oct 25, 2023

andrwng commented Oct 25, 2023

andrwng commented Oct 26, 2023

storage: utilities to implement sliding window compaction #14368

storage: utilities to implement sliding window compaction #14368

Conversation

andrwng commented Oct 23, 2023 • edited Loading

Backports Required

Release Notes

vbotbuildovich commented Oct 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VladLazar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrwng commented Oct 25, 2023

andrwng commented Oct 26, 2023

andrwng commented Oct 23, 2023 •

edited

Loading