Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix several bugs in parquet writer statistics generation, add EnabledStatistics to control level of statistics generated #2022

Merged
merged 5 commits into from
Jul 8, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jul 7, 2022

Which issue does this PR close?

Closes #2015
Closes #2016

Rationale for this change

Fixes a number of bugs

What changes are included in this PR?

Removes max_statistics_size as it isn't used, and changes statistics to be enabled based on an enumeration. This allows consistently only producing chunk level statistics, potentially precomputed, or not.

Are there any user-facing changes?

No

@tustvold tustvold added the api-change Changes to the arrow API label Jul 7, 2022
@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 7, 2022
self
}
}

/// Controls the level of statistics to be computed by the writer
#[derive(Debug, Clone, Copy, Eq, PartialEq)]
pub enum EnabledStatistics {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a massive fan of this, but it was the only way I could think of to preserve the existing behaviour whilst also fixing the related bugs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually like this API -- as for some use cases page level statistics are likely to have a higher overhead in storage and writing than benefit (e.g. large string columns for small row groups)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to clarify in here if Page also computes Chunk level statistics.

If Page does not also include Chunk that seems strange to me (as I would expect lower level stats to also have coarser grained statistics).

Copy link
Contributor Author

@tustvold tustvold Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad one of us likes it 😆

FWIW at least in the case of the column index, parquet allows using truncated values in statistics. e.g. instead of a min of "lorem ipsum ..." you could just store "lorem"

@tustvold tustvold force-pushed the fix-parquet-writer-statistics branch from 034be68 to 06cfa35 Compare July 7, 2022 13:07
if let Some(distinct) = distinct_count {
self.column_distinct_count =
Some(self.column_distinct_count.unwrap_or(0) + distinct);
// We can only set the distinct count if there are no other writes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix for #2016

Some(self.make_page_statistics())
} else {
None
let has_min_max = self.min_page_value.is_some() && self.max_page_value.is_some();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is moved from flush_data_pages

@codecov-commenter
Copy link

Codecov Report

Merging #2022 (428f0d5) into master (38370cb) will decrease coverage by 0.16%.
The diff coverage is 66.40%.

❗ Current head 428f0d5 differs from pull request most recent head 3049111. Consider uploading reports for the commit 3049111 to get more accurate results

@@            Coverage Diff             @@
##           master    #2022      +/-   ##
==========================================
- Coverage   83.59%   83.43%   -0.17%     
==========================================
  Files         222      222              
  Lines       57529    57911     +382     
==========================================
+ Hits        48091    48317     +226     
- Misses       9438     9594     +156     
Impacted Files Coverage Δ
parquet/src/arrow/array_reader/byte_array.rs 84.47% <0.00%> (-1.24%) ⬇️
...et/src/arrow/array_reader/byte_array_dictionary.rs 82.26% <0.00%> (-1.66%) ⬇️
...uet/src/arrow/array_reader/complex_object_array.rs 93.20% <0.00%> (-1.07%) ⬇️
parquet/src/arrow/array_reader/empty_array.rs 45.45% <0.00%> (-10.11%) ⬇️
parquet/src/arrow/array_reader/list_array.rs 92.69% <0.00%> (-0.72%) ⬇️
parquet/src/arrow/array_reader/map_array.rs 58.82% <0.00%> (-8.98%) ⬇️
parquet/src/arrow/array_reader/mod.rs 88.23% <ø> (ø)
parquet/src/arrow/array_reader/null_array.rs 81.48% <0.00%> (-6.52%) ⬇️
parquet/src/arrow/array_reader/primitive_array.rs 88.63% <0.00%> (-1.02%) ⬇️
parquet/src/arrow/array_reader/struct_array.rs 78.99% <0.00%> (-9.69%) ⬇️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 38370cb...3049111. Read the comment docs.

@alamb alamb changed the title Fix parquet writer statistics Fix several bugs in parquet writer statistics generation Jul 7, 2022
@alamb alamb changed the title Fix several bugs in parquet writer statistics generation Fix several bugs in parquet writer statistics generation, add EnabledStatistics to control level of statistics generated Jul 7, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some comments on the API design and deprecation, but the actual code looks good and like a significant improvement to me. 👍

@@ -152,7 +152,10 @@ pub(crate) fn add_encoded_arrow_schema_to_metadata(
value: Some(encoded),
};

let mut meta = props.key_value_metadata.clone().unwrap_or_default();
let meta = props
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change have any effect (other than to make the code more readable)?

Copy link
Contributor Author

@tustvold tustvold Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no user-facing change, but it does avoid an unnecessary clone

self
}
}

/// Controls the level of statistics to be computed by the writer
#[derive(Debug, Clone, Copy, Eq, PartialEq)]
pub enum EnabledStatistics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually like this API -- as for some use cases page level statistics are likely to have a higher overhead in storage and writing than benefit (e.g. large string columns for small row groups)

.map_or(true, |v| self.compare_greater(v, min))
{
self.min_column_value = Some(min.clone());
if self.statistics_enabled == EnabledStatistics::Chunk {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally find match self.statistics_enabled ... to be easier to validate that other variants of EnabledStatistics::Chunk did not apply.

Or perhaps a function EnabledStatistics::compute_chunk_statistics().

Basically I am thinking about "what if someone adds a new variant" -- it would be nice if the compiler told them where to update

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, no that I re-read this, shouldn't this also be set if we are computing page level statistics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment explaining why we only care if it is exactly this variant, any other variants would need to be computed at a different level

(None, None) => {}
(None, None) => {
for val in values {
Self::update_min(&self.descr, val, &mut self.min_column_value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the fix for #2015?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

/// Writer may optionally provide pre-calculated statistics for this batch
///
/// Note: Unless disabled using by using [`WriterProperties`] to set
/// enabled statistics to [`EnabledStatistics::Chunk`], this will still compute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment almost reads the opposite of what I think the code is doing (using the passed in pre-calculated statistics or computing them if passed in)

self
}
}

/// Controls the level of statistics to be computed by the writer
#[derive(Debug, Clone, Copy, Eq, PartialEq)]
pub enum EnabledStatistics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to clarify in here if Page also computes Chunk level statistics.

If Page does not also include Chunk that seems strange to me (as I would expect lower level stats to also have coarser grained statistics).

self.statistics_enabled = Some(enabled);
}

/// Sets max size for statistics for this column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any mention of max_statistics_size in this PR description or the linked tickets.

I think it is a reasonable feature (to limit bloating parquet files). Is the issue that it was ignored? If so perhaps we can leave the API in with a warn or deprecate that it is ignored and file a ticket to properly support it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will back out this change

/// Perform a conditional update of `cur`, skipping any NaN values
///
/// If `cur` is `None`, sets `cur` to `Some(val)`, otherwise calls `should_update` with
/// the value of `cur`, and updates `cur` to `Some(val)` if it returns `true`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @crepererum as I think you added the initial NaN handling a while ago

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I read this correctly the NaN handling here is to avoid "poisoning" which is OK.

IMHO the stats should include a flag if there were any NaN values, but this is far outside the scope of this PR since this would affect the parquet format as a whole.

assert_eq!(pages[1].page_type(), PageType::DATA_PAGE);

let page_statistics = pages[1].statistics().unwrap();
assert_eq!(page_statistics.min_bytes(), 1_i32.to_le_bytes());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
4 participants