Compaction lossy hash #14283

dotnwat · 2023-10-19T22:44:01Z

Introduces a new offset key map implementation that maps compaction key space into sha256(key) space.

Backports Required

Release Notes

none

This is useful for the open-source build when vtools is not available. Signed-off-by: Noah Watkins <[email protected]>

vbotbuildovich · 2023-10-24T00:06:22Z

ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/39629#018b5ebb-73a0-46d9-82bd-b910bbd31174

vbotbuildovich · 2023-10-24T00:16:42Z

ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/39629#018b5ebb-73a4-4727-af38-9ae04aeb92d8

emaxerrno · 2023-10-24T03:26:55Z

i was skeptical at first w/ the regular byte comparison hash. this makes sense.

VladLazar

Very nice. A few questions/suggestions

src/v/utils/fragmented_vector.h

VladLazar · 2023-10-24T09:32:48Z

src/v/storage/key_offset_map.h

+    /**
+     * Uses successive chunks of sizeof(index_type) bytes taken from hash(key)
+     * as probes into the hash table. When `next()` returns null then the caller
+     * should switch to linear probing.
+     */
+    struct probe {
+        using index_type = uint32_t;
+        static_assert(sizeof(index_type) <= hash_type::digest_size);
+
+        explicit probe(const hash_type::digest_type&);
+
+        std::optional<index_type> next();
+
+        hash_type::digest_type::const_pointer iter;
+        hash_type::digest_type::const_pointer end;
+    };


So the idea here is to probe at random places in the backing vector. This is neat. Does this smartness have a name?

I looked extensively to see if this had a name, but I don't think it does.

IMO this is probably just a different flavor/generalization of double hashing where if the first hash position is a collision a different hash function is used as the next probe. Here the hash output is large enough that we wouldn't use all those bits anyway for the first position.

src/v/storage/key_offset_map.cc

andrwng

LGTM! Just some nits

src/v/storage/tests/key_offset_map_test.cc

tools/format-cc

andrwng · 2023-10-23T21:58:02Z

src/v/utils/fragmented_vector.h

+     * The expected use case for this is to allocate a large vector in a fiber
+     * using a series of smaller resize() invocations allowing for cooperative
+     * yield calls to be inserted to avoid reactor stalls. The optimal strategy


nit: I wonder if there's a magic number over which it's not safe to call this?

Alternatively, I wonder if there are async helper methods worth adding that encapsulates this expected use case?

I don't think there are any magic numbers here. Since fragmented vector isn't futurized, it be the same concern as resizing a std::vector or avoiding calling fragmented_vector::copy on a large fragmented vector.

src/v/storage/key_offset_map.cc

Signed-off-by: Noah Watkins <[email protected]>

vbotbuildovich · 2023-10-24T20:19:02Z

ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/39695#018b6311-32e1-459d-91ec-ba8d198872d7

Signed-off-by: Noah Watkins <[email protected]>

dotnwat · 2023-10-24T21:34:45Z

force-push

added cap on load factor at 95%
enforced capacity on put()
removed fragmented_vector::resize its a bit awkward
added helper fragmented_vector_clear_async
added helper fragmented_vector_fill_async
added hash_key_offset_map::initialize for resetting hash table context that avoids reallocation with resize()
switched to large fragmented vector variant

vbotbuildovich · 2023-10-25T00:12:41Z

ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/39719#018b63ea-6c0d-4d9e-a600-2a2d69a8ae33

dotnwat · 2023-10-25T00:55:56Z

Failure is #14218

andrwng · 2023-10-24T23:28:31Z

src/v/storage/key_offset_map.cc

+    };
+
+    // handle a non-normalized probe position
+    // returns true if key is inserted


nit: comment needs an update

andrwng · 2023-10-25T02:41:56Z

src/v/storage/key_offset_map.cc

+    probe_count_ = 0;
+}
+
+seastar::future<> hash_key_offset_map::initialize() {


nit: maybe consider naming this reset() or something? As is it seems like something we should always call before using the map, but I don't think that's the case (presumably we just need to use reset(size_bytes)?

reset(size) implies initialize(). but in practice you really only want to call reset(size) once at boot-up, and then call initialize() each time you want to use the table in a new context because it avoids freeing and reallocating all the memory.

Signed-off-by: Noah Watkins <[email protected]>

github-actions bot added the area/redpanda label Oct 19, 2023

dotnwat force-pushed the compaction-lossy-hash branch 2 times, most recently from 95e0561 to e40cf8b Compare October 20, 2023 22:42

dotnwat marked this pull request as ready for review October 20, 2023 22:43

dotnwat requested a review from andrwng October 20, 2023 22:43

dotnwat force-pushed the compaction-lossy-hash branch 2 times, most recently from c03dece to 45a02cc Compare October 21, 2023 00:42

tools: add cc format helper

524cfb9

This is useful for the open-source build when vtools is not available. Signed-off-by: Noah Watkins <[email protected]>

dotnwat force-pushed the compaction-lossy-hash branch 2 times, most recently from 3a971d2 to 0ce2a75 Compare October 23, 2023 21:47

VladLazar reviewed Oct 24, 2023

View reviewed changes

andrwng reviewed Oct 24, 2023

View reviewed changes

dotnwat added 4 commits October 24, 2023 12:36

utils: add async fill operation for fragmented vector

845d66a

Signed-off-by: Noah Watkins <[email protected]>

utils: add fragmented vector async clear interface

23af3f6

Signed-off-by: Noah Watkins <[email protected]>

storage: add size interface to key offset map

5e152eb

Signed-off-by: Noah Watkins <[email protected]>

storage: add key offset map capacity interface

51cc85d

Signed-off-by: Noah Watkins <[email protected]>

dotnwat added 3 commits October 24, 2023 14:31

storage: add hash based compaction index

df03641

Signed-off-by: Noah Watkins <[email protected]>

storage: add tests for key offset map

d9d15a0

Signed-off-by: Noah Watkins <[email protected]>

storage: add license header

0b9f5fe

Signed-off-by: Noah Watkins <[email protected]>

dotnwat force-pushed the compaction-lossy-hash branch from 1fb5ca9 to 0b9f5fe Compare October 24, 2023 21:32

dotnwat requested review from VladLazar and andrwng October 25, 2023 00:55

andrwng previously approved these changes Oct 25, 2023

View reviewed changes

storage: update comment

bd74fb4

Signed-off-by: Noah Watkins <[email protected]>

dotnwat dismissed andrwng’s stale review via bd74fb4 October 25, 2023 17:03

dotnwat merged commit 1be948f into redpanda-data:dev Oct 25, 2023
10 of 15 checks passed

andrwng mentioned this pull request Oct 26, 2023

implement hash-based lossy map #14365

Closed

github-actions bot mentioned this pull request Dec 22, 2023

update redpanda appVersion from v23.2.21 to v23.3.1 redpanda-data/helm-charts#950

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compaction lossy hash #14283

Compaction lossy hash #14283

dotnwat commented Oct 19, 2023 •

edited

Loading

vbotbuildovich commented Oct 24, 2023

vbotbuildovich commented Oct 24, 2023

emaxerrno commented Oct 24, 2023

VladLazar left a comment

VladLazar Oct 24, 2023

dotnwat Oct 24, 2023

andrwng left a comment

andrwng Oct 23, 2023

dotnwat Oct 24, 2023

vbotbuildovich commented Oct 24, 2023

dotnwat commented Oct 24, 2023

vbotbuildovich commented Oct 25, 2023

dotnwat commented Oct 25, 2023

andrwng Oct 24, 2023

dotnwat Oct 25, 2023

andrwng Oct 25, 2023

dotnwat Oct 25, 2023 •

edited

Loading

Compaction lossy hash #14283

Compaction lossy hash #14283

Conversation

dotnwat commented Oct 19, 2023 • edited Loading

Backports Required

Release Notes

vbotbuildovich commented Oct 24, 2023

vbotbuildovich commented Oct 24, 2023

emaxerrno commented Oct 24, 2023

VladLazar left a comment

Choose a reason for hiding this comment

VladLazar Oct 24, 2023

Choose a reason for hiding this comment

dotnwat Oct 24, 2023

Choose a reason for hiding this comment

andrwng left a comment

Choose a reason for hiding this comment

andrwng Oct 23, 2023

Choose a reason for hiding this comment

dotnwat Oct 24, 2023

Choose a reason for hiding this comment

vbotbuildovich commented Oct 24, 2023

dotnwat commented Oct 24, 2023

vbotbuildovich commented Oct 25, 2023

dotnwat commented Oct 25, 2023

andrwng Oct 24, 2023

Choose a reason for hiding this comment

dotnwat Oct 25, 2023

Choose a reason for hiding this comment

andrwng Oct 25, 2023

Choose a reason for hiding this comment

dotnwat Oct 25, 2023 • edited Loading

Choose a reason for hiding this comment

dotnwat commented Oct 19, 2023 •

edited

Loading

dotnwat Oct 25, 2023 •

edited

Loading