-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix several bugs in parquet writer statistics generation, add EnabledStatistics
to control level of statistics generated
#2022
Fix several bugs in parquet writer statistics generation, add EnabledStatistics
to control level of statistics generated
#2022
Conversation
self | ||
} | ||
} | ||
|
||
/// Controls the level of statistics to be computed by the writer | ||
#[derive(Debug, Clone, Copy, Eq, PartialEq)] | ||
pub enum EnabledStatistics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a massive fan of this, but it was the only way I could think of to preserve the existing behaviour whilst also fixing the related bugs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually like this API -- as for some use cases page level statistics are likely to have a higher overhead in storage and writing than benefit (e.g. large string columns for small row groups)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to clarify in here if Page
also computes Chunk
level statistics.
If Page
does not also include Chunk
that seems strange to me (as I would expect lower level stats to also have coarser grained statistics).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad one of us likes it 😆
FWIW at least in the case of the column index, parquet allows using truncated values in statistics. e.g. instead of a min of "lorem ipsum ..."
you could just store "lorem"
034be68
to
06cfa35
Compare
if let Some(distinct) = distinct_count { | ||
self.column_distinct_count = | ||
Some(self.column_distinct_count.unwrap_or(0) + distinct); | ||
// We can only set the distinct count if there are no other writes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the fix for #2016
Some(self.make_page_statistics()) | ||
} else { | ||
None | ||
let has_min_max = self.min_page_value.is_some() && self.max_page_value.is_some(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is moved from flush_data_pages
Codecov Report
@@ Coverage Diff @@
## master #2022 +/- ##
==========================================
- Coverage 83.59% 83.43% -0.17%
==========================================
Files 222 222
Lines 57529 57911 +382
==========================================
+ Hits 48091 48317 +226
- Misses 9438 9594 +156
Continue to review full report at Codecov.
|
EnabledStatistics
to control level of statistics generated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some comments on the API design and deprecation, but the actual code looks good and like a significant improvement to me. 👍
@@ -152,7 +152,10 @@ pub(crate) fn add_encoded_arrow_schema_to_metadata( | |||
value: Some(encoded), | |||
}; | |||
|
|||
let mut meta = props.key_value_metadata.clone().unwrap_or_default(); | |||
let meta = props |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change have any effect (other than to make the code more readable)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no user-facing change, but it does avoid an unnecessary clone
self | ||
} | ||
} | ||
|
||
/// Controls the level of statistics to be computed by the writer | ||
#[derive(Debug, Clone, Copy, Eq, PartialEq)] | ||
pub enum EnabledStatistics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually like this API -- as for some use cases page level statistics are likely to have a higher overhead in storage and writing than benefit (e.g. large string columns for small row groups)
.map_or(true, |v| self.compare_greater(v, min)) | ||
{ | ||
self.min_column_value = Some(min.clone()); | ||
if self.statistics_enabled == EnabledStatistics::Chunk { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would personally find match self.statistics_enabled
... to be easier to validate that other variants of EnabledStatistics::Chunk
did not apply.
Or perhaps a function EnabledStatistics::compute_chunk_statistics()
.
Basically I am thinking about "what if someone adds a new variant" -- it would be nice if the compiler told them where to update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, no that I re-read this, shouldn't this also be set if we are computing page level statistics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a comment explaining why we only care if it is exactly this variant, any other variants would need to be computed at a different level
(None, None) => {} | ||
(None, None) => { | ||
for val in values { | ||
Self::update_min(&self.descr, val, &mut self.min_column_value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the fix for #2015?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
parquet/src/column/writer.rs
Outdated
/// Writer may optionally provide pre-calculated statistics for this batch | ||
/// | ||
/// Note: Unless disabled using by using [`WriterProperties`] to set | ||
/// enabled statistics to [`EnabledStatistics::Chunk`], this will still compute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment almost reads the opposite of what I think the code is doing (using the passed in pre-calculated statistics or computing them if passed in)
self | ||
} | ||
} | ||
|
||
/// Controls the level of statistics to be computed by the writer | ||
#[derive(Debug, Clone, Copy, Eq, PartialEq)] | ||
pub enum EnabledStatistics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to clarify in here if Page
also computes Chunk
level statistics.
If Page
does not also include Chunk
that seems strange to me (as I would expect lower level stats to also have coarser grained statistics).
parquet/src/file/properties.rs
Outdated
self.statistics_enabled = Some(enabled); | ||
} | ||
|
||
/// Sets max size for statistics for this column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any mention of max_statistics_size
in this PR description or the linked tickets.
I think it is a reasonable feature (to limit bloating parquet files). Is the issue that it was ignored? If so perhaps we can leave the API in with a warn
or deprecate
that it is ignored and file a ticket to properly support it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will back out this change
/// Perform a conditional update of `cur`, skipping any NaN values | ||
/// | ||
/// If `cur` is `None`, sets `cur` to `Some(val)`, otherwise calls `should_update` with | ||
/// the value of `cur`, and updates `cur` to `Some(val)` if it returns `true` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @crepererum as I think you added the initial NaN
handling a while ago
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I read this correctly the NaN handling here is to avoid "poisoning" which is OK.
IMHO the stats should include a flag if there were any NaN values, but this is far outside the scope of this PR since this would affect the parquet format as a whole.
assert_eq!(pages[1].page_type(), PageType::DATA_PAGE); | ||
|
||
let page_statistics = pages[1].statistics().unwrap(); | ||
assert_eq!(page_statistics.min_bytes(), 1_i32.to_le_bytes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Which issue does this PR close?
Closes #2015
Closes #2016
Rationale for this change
Fixes a number of bugs
What changes are included in this PR?
Removes max_statistics_size as it isn't used, and changes statistics to be enabled based on an enumeration. This allows consistently only producing chunk level statistics, potentially precomputed, or not.
Are there any user-facing changes?
No