-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: utilities to implement sliding window compaction #14368
Conversation
ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/39616#018b5e0a-1531-4479-926a-fe185f4c9460 |
// Compacted index writer for the newly written segment. May not be | ||
// supplied if the compacted index isn't expected to change, e.g. when | ||
// rewriting a single segment filtering with its own compacted index. | ||
compacted_index_writer* _compacted_idx; | ||
index_state _idx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused since we are writing to two indices. What will be the end state here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. We need to self compact first, so we still need to build the old compacted segment index. Still, I don't get the purpose of the new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general having thew new compacted index seems useful in cases where there are subsequent self compactions, to avoid having to rebuild it by reading the segment. That will happen after restarting, since the segment bitflags are in memory only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still confused. What does _compacted_idx
point to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've re-read deduplicate_segment
and it makes more sense now. Is the idea to update both the compaction index via (compaction_index_writer
) and the segment index via the return value of deduplicate_segment
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. FWIW I have a follow-up change to pass the index_state
in as an input too, so the compacted index writer, index, and appender would all be decoupled from a single segment reducer, the idea there is that rather than rewriting a segment at a time, we could merge 2 (or more eventually) segments at a time by having the segment reducers share the writers.
} | ||
// We should only keep the record if its offset is equal or | ||
// higher than that indexed. | ||
co_return map_offset.value() <= o; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the offset ever be equal to the indexed one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the offset being equal means that this record was the highest indexed offset for the key.
The offset being higher means that the record has a higher offset than what we've indexed, e.g. because we only partially indexed a segment and a later offset in the segment had the given key at a higher offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. We iterate through the segments in reverse order, but read from the compaction index in the natural order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, exactly
@@ -291,6 +298,7 @@ class disk_log_impl final : public log { | |||
mutex _segments_rolling_lock; | |||
|
|||
std::optional<model::offset> _cloud_gc_offset; | |||
std::optional<model::offset> _last_compaction_window_start_offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be persisted on disk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should, but as a short term follow-up task. Currently compaction is already susceptible to duplicating work after restarts since we don't persist bitflags.
a8a9daa
to
8086174
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -287,6 +287,19 @@ ss::future<ss::stop_iteration> copy_data_segment_reducer::do_compaction( | |||
if (to_copy == std::nullopt) { | |||
co_return stop_t::no; | |||
} | |||
if (_compacted_idx && is_compactible(to_copy.value())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than reusing a segment's existing
compacted index, we will need to rewrite an index as the segment is
rewritten with deduplication context from other segments
I'm a bit unsure why anything related to the compacted index needs to be changed. It's a reflection of what is in a segment--(key,offset) pairs--and I wouldn't think that that changes based on how the segment is created (ie single segment compaction vs compaction with larger sliding window).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method builds a new segment, and the keys in the compacted index should reflect that, no? It's possible that the new segment has virtually no keys because all of the new values were written in newer segments, in which case keeping the existing compacted index would be wasteful.
To your point, we could just delete the compacted index and instead rely on a subsequent round of self compaction to create the indexes. OTOH I'm thinking that windowed compaction would be a sort of supserset of self compaction: if you've done windowed compaction, there's no need to self compact because the indexes are already there.
73453cb
to
6d269ae
Compare
With coroutines, it can be more convenient to just pass in some procedural method to the iteration function, rather than implementing a reducer. I intend on using this to build a map using multiple readers. Puts into this map will be async, given the possibility of stalls during probing.
This will be useful in testing the sliding window approach.
This replaces the inlined should_keep() method in the copy_data_segment_reducer with a function, so that a subsequent commit can base deduplication on a key_offset_map.
Upcoming changes to use a key_offset_map will require async calls to put/get. Inherently our implementation won't do async work, but there will be potential for long-running tasks with some form of linear probing of a map that should be broken up by yielding. So, this commit changes the filter interface with an async version.
Subsequent commits will introduced sliding window compaction by which segments will be deduplicated with keys from multiple segments. To distinguish such compacted segments (e.g. to tell if there are new uncompacted segments that need deduplicating), this adds a new segment bitflag.
In sliding window compaction, it will be possible that the entire segment's data will be removed. To maintain that we still have data in each segment, this allows passing the last segment offset to the segment reducer, allowing it to tell whether it's empty by the end of its reduce and force keeping a record accordingly.
In sliding window compaction, rather than reusing a segment's existing compacted index, we will need to rewrite an index as the segment is rewritten with deduplication context from other segments. A compacted index writer is now passed to the segment reducer.
There's some functional overlap with compaction_reducers, but the utilities include methods to build a key_offset_map and to rewrite a segment and compacted index removing duplicates.
Adds a method to collect the set of segments over which to slide during a sliding window compaction.
6d269ae
to
e1d9ff3
Compare
CI failure is #14451 |
Adds utility methods that will be used in sliding window comapction. There are a few pieces to this:
This PR introduces the mechanisms to perform the above steps, but doesn't orchestrate them into a method to be used by compaction. This will come in a follow-up PR.
Fixes #14363
Backports Required
Release Notes