Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: store min, max, null count, and true count in column metadata #1164

Merged
merged 16 commits into from
Nov 1, 2024

Conversation

danking
Copy link
Member

@danking danking commented Oct 29, 2024

No description provided.

@danking
Copy link
Member Author

danking commented Oct 29, 2024

fyi @lwwmanning I rebased your PR on develop.

@danking danking changed the title Dk/write stats feat: store min, max, null count, and true count in column metadata Oct 30, 2024
@danking danking marked this pull request as ready for review October 30, 2024 21:46
@danking
Copy link
Member Author

danking commented Oct 30, 2024

Alright, this does not break the current read/write tests (which might only be in Python doc tests?). We need to actually read and use the data to really test it though.

@robert3005
Copy link
Member

@danking i will do a proper review in ~1h

Copy link
Member

@robert3005 robert3005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly passing the conversion function seems like a hack. I think we would also like to specialize BoolMetadataAccumulator to hold true counts since they don't appear otherwise

@@ -258,6 +226,5 @@ mod tests {
buffer[buffer_begin..buffer_end].len(),
FOOTER_POSTSCRIPT_SIZE
);
assert_eq!(buffer[buffer_begin..buffer_end].len(), 32);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This asserts that we don't accidentally change FOOTER_POSTSCRIPT_SIZE

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -83,74 +82,58 @@ impl<W: VortexWrite> LayoutWriter<W> {

async fn write_column_chunks<S>(&mut self, mut stream: S, column_idx: usize) -> VortexResult<()>
where
S: Stream<Item = VortexResult<Array>> + Unpin,
S: Stream<Item = VortexResult<Array>> + Unpin + ArrayStream,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess ArrayStream implies Stream? Can get rid of the earlier bound

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

impl<T> ExtremaAccumulator<T> {
fn new(size_hint: usize, to_array: fn(Vec<Option<T>>) -> Array) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird to pass this function here directly. All you want is impl Into<Array> for ExtremaAccumulator<...>


fn push_batch_byte_offsets(&mut self, batch_byte_offsets: Vec<u64>);

fn into_layouts_and_metadata(self: Box<Self>) -> VortexResult<(VecDeque<Layout>, StructArray)>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you split this method into separate trait you can avoid passing the conversion function and instead specialize as necessary.

pub trait IntoArrayMetadata {
    fn into_array_metadata(self: Box<Self>) -> VortexResult<(VecDeque<Layout>, StructArray)>;
}

@danking
Copy link
Member Author

danking commented Oct 31, 2024

Erm. Ok. I might have gone rogue here so push back, if so.

I implemented From<Vec<Option<T>>> for the standard types that I need. I define three accumulators: BoolAccumulator, StandardAccumulator (has extrema), and BasicAccumulator (just row offsets and null counts).

I also switched the writer.rs to use a new ColumnWriter which encapsulates all the per-column work.


pub fn new_metadata_accumulator(dtype: &DType) -> Box<dyn MetadataAccumulator> {
match dtype {
DType::Null => Box::new(BasicAccumulator::new()),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that we could probably also specialize all these to have/not-have null counts based on the nullability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can defer this one for later. I’m not sure the null ability matters for performance here

@lwwmanning lwwmanning enabled auto-merge (squash) November 1, 2024 14:14
@lwwmanning lwwmanning merged commit f83a093 into develop Nov 1, 2024
5 checks passed
@lwwmanning lwwmanning deleted the dk/write-stats branch November 1, 2024 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants