Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Builder::with_capacity enrichment #428

Merged
merged 2 commits into from
Nov 24, 2023

Conversation

frankmcsherry
Copy link
Member

This PR does a few things brought to light by the intended RHH batch, around providing meaningful capacity estimates to the builder. We were previously .. not .. providing any meaningful capacity estimates, using Builder::new() which provides a capacity of 0.

This has been modified, at least in the Batcher implementations, although there are two additional locations where we build batches without a batcher; we'll want to think more carefully there (upsert.rs and reduce.rs).

The Builder::with_capacity method was extended to take key, value, and update capacities separately, but in some cases certain arguments are ignored.

cc: @antiguru

@@ -23,7 +23,7 @@ use std::hash::Hasher;
/// can take advantage of the smaller size.
pub trait Hashable {
/// The type of the output value.
type Output: Into<u64>+Copy;
type Output: Into<u64>+Copy+Ord;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Erm, ignore this. Incoming change related to the RHH batch.

@frankmcsherry
Copy link
Member Author

@antiguru: the main change here is that the MergeBatchers each do a bit more compute, and in exchange pre-allocate right-sized builders. These builders are perhaps a bit oversized for ord implementations, but they should be spot-on for ord_neu implementations (except for the singleton update trick; sigh). This may have MZ consequences we should investigate, but one of them could be that without mis-sized builders we needn't repeatedly double allocations, and may end up with a lower memory envelope overall.

@frankmcsherry
Copy link
Member Author

The implementations/rhh module contains a prototype Robin Hood Hashing based batch, which is a surprisingly small change to the standard batch we use (just some more spacing, and a bit of fiddly logic all over the place). For now the module is public but there are no exported traces, and no one should use it yet, other than for testing which I am doing now!

@frankmcsherry frankmcsherry merged commit 615c688 into TimelyDataflow:master Nov 24, 2023
1 check passed
This was referenced Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant