-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering #197
Conversation
src/main/thrift/parquet.thrift
Outdated
* When present there is expected to be one element corresponding to each repetition (i.e. size=max repetition_leve) | ||
* where each element represens the count of the number of times that level occurs in the page/column chunk. | ||
*/ | ||
8: optional list<i64> repetition_level_histogram; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems it's a per-rowgroup statistics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand, I thought statistics can be on either page or column chunk. the histograms would vary based on column chunk I think, but I might be misunderstanding your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry for missunderstand this. How can we make full use of this histogram?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These levels are rather complicated for high-level applications, we may need to provide some utility to translate to norms (e.g. struct/map/array/null) that are much easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree there is complexity here but I think this is the simplest and most complete set of information that we can provide readers. I think the utility methods would likely be per system but we could likely provide some in core packages that give leaf value estimates that follow the rules above.
@mapleFU in terms of usage these would be used similar to how row level reconstruction is done with them. For instance total number of nulls at the leaf can be computed assuming no repeated fields as the cumulative sum of the first n-1 entries in definition level. Similarly number of nested lists assuming no null lists should be computable with a similar cumulative sum. When lists are nullable repetition and definition level need to be looked at together to determine null vs empty lists. Similarly the number of empty lists at level can be inferred by looking at definition and repetition levels together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could provide utility methods that account for common permutations of memory models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems It can help pushdown some filter on List/Map, and helping constructing the list. It's great, but I think maybe we need some samples? Because it's a bit hard to understand how to make full use of it. Like some rules in ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So some simple examples in from Arrow (my rep/def level calculations are rusty so please double check to make sure they make sense):
- If all you have are nested structures (say 3 nullable levels), then the number of null values at the leaf is the sum of the first 3 elements of the definition level histogram. Individual level nullability for Arrow isn't super important because it always leaves the space for nulls so it is pretty much square. You could however determine that there are no nulls at the second level of nesting by checking the appropriate histogram box (and potentially save the allocation of the bit vector in that case)
- If you have two nested lists and assuming lists and elements are not nullable. The number of inner nested lists is
rep_hist[0] - def_hist[0] + {rep_hist[1]
(outer list starts - empty outer lists + inner list starts).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds feasible. Then we are good to store level histograms separately.
@gszadovszky @shangxinli Have time take a look? |
Seems this patch doesn't need to consider backward-compatibility rules? |
src/main/thrift/parquet.thrift
Outdated
* Special calculations: | ||
* - Enum: plain-encoded BYTE_ARRAY size | ||
* - Integers (same size used for signed and unsigned): int8 - 1 bytes, int16 - 2 | ||
* - Decimal - Each value is assumed to take the minimal number of bytes necessary to encode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that small Decimal can be encoded as FLBA or BYTE_ARRAY, but big decimal cannot be stored as i32. Should we force use the physical type or related with physical type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally had this. I think given the two different opinions expressed, I'm going to change this field to only record variable width bytes, and say all other calcutions can be performed by readers based on type and number of values
Since we are storing rep/def levels directly I don't think so since those rules are applied on top of this data to make the correct inferences. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy on this if some common util for histogram can be described and added. Thanks!
* | ||
* This value is optional when max_definition_level is 0. | ||
*/ | ||
3: optional list<i64> definition_level_histogram; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, do we need to add an extra histogram for pair<def_level, rep_level>
if both exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. But I might not be following what you are suggesting (I gave two examples from Arrow below on usage).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we store def_levels and ref_levels separately, how can we derive number of nulls in each level precisely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking of supporting pushing down filters like IS_NULL
or IS_NOT_NULL
to nested fields. So I want to make sure if this can satisfy the use case. Maybe we don't need precise null_count of each level but it would be great to answer yes or no to the filters above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might pay to illustrate exact queries, but if this is just answering a question is there any null element at a particular nesting level I think definition level histogram by itself gives that information.
Take a nested lists where both lists and elements can be nullable at each level. IIRC, the definition levels would represent as follows:
0 - Null top level list.
1 - empty top level list
2 - null nested list
3 - empty nested list
4 - null leaf element
5 - present leaf element
So if the query is for top level list is null
, one could prune when def_level[0] == 0
. For is not null
one could prune if def_level[0] == num_values from page (i.e. all values are null)
.
I believe similar logic holds for def_level[2]
but could get more complicated depending on the semantics of whether a top level null element should imply a the nested list is also null or if an empty list implies the nested list should be considered null (but should still be derivable by using histogram indices 0,1 and 2).
One thing the joint histogram (pairs of rep/def level counts) could give you is the number first list elements that are null, but I'm not sure how useful that is. I would need to think about other queries the joint histogram would enable (or if you have more examples of supported queries we can figure out if one is needed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to include these statistics at both row group (column chunk) and page level? For the latter I am not sure it is the right approach. We implemented column indexes so one would not need to read the page header to get the related statistics. We even stopped writing Statistics
into page headers in parquet-mr.
If we only want these for the column chunk level then I would suggest having it under ColumnMetaData
directly.
@gszadovsky |
@emkornfield, I am not against re-adding such statistics to page headers if it have benefits. Meanwhile, I would only expect filling values that are really make sense and not redundant at the related level (page or column chunk). If a value make sense for filtering then it should be added to the column index instead of to the page header. |
This sounds reasonable, let me refactor this a little bit so we can add it to column index as well, with a little bit more description on each of the use-cases |
@gszadovszky updated the PR based on your feedback, let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I can accept having the old statistics saved in the page headers as well as column indexes (for backward compatibility), you also highlights the superiority of the column indexes. As so I would not extend the statistics saved in the page headers.
What do you think about adding SizeEstimationStatistics
directly to ColumnMetaData
and the histograms to the ColumnIndex
?
this sounds fine to me. We can always add separately to page headers if people want them. @wgtmac concerns with this? |
I am good with this approach. |
@@ -764,6 +810,14 @@ struct ColumnMetaData { | |||
* in a single I/O. | |||
*/ | |||
15: optional i32 bloom_filter_length; | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not related to this line, but to ColumnMetaData
in general)
For completeness-reasons, we might also want to add unencoded_byte_array_data_bytes
and num_entries
for the dictionary page (if existent) into the ColumnMetadata, i.e., dictionary_unencoded_byte_array_data_bytes
and num_dictionary_entries
.
This way, readers could plan how much memory the dictionary of a column chunk will take. This can help in decisions whether, e.g., to load the dictionary up-front to perform pre-filtering on the dictionary. It also helps to right-size the buffer that will hold the dictionary.
I'm not suggesting that this is a must-have for this commit or at all, so feel free to drop this issue. I just wanted to voice that if we already want to provide tools for size estimation, the dictionary is currently not really accounted for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this could be useful, my preference is to leave this out for now as it hasn't come up in discussion before we can always add this as follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These fields are already presented in the page header but it requires an in-efficient hop to read it.
Co-authored-by: JFinis <[email protected]>
Java PoC implementation: apache/parquet-java#1177 |
C++ PoC as well rapidsai/cudf#14000. Do we still need arrow-cpp to proceed? If so, @mapleFU have you started on apache/arrow#37869 yet? If not, I have cycles to try a first pass at an implementation if that would be helpful. |
IMHO, two PoC impls do not imply that they have to be parquet-mr and arrow-cpp. So should we proceed to a formal vote on this PR? @emkornfield @gszadovszky Once this PR has been merged, I can proceed the release process of parquet format v2.10. After that, all pending PoCs can be formally reviewed and merged. |
I think it is reasonable for two different implementations that can be demonstrated to interop with each other. We should really document the exact policy thought. @gszadovszky @pitrou |
I have @wgtmac's branch compiled locally, and can test the interoperability between the two existing PoC implementations. FWIW I've been using a modified parquet-cli in parquet-mr to print out the statistics I generate, so I don't foresee any big problems there. Update: I've verified that spark with Gang's changes can read my files, and my implementation can read the spark generated files. Also verified the histograms and unencoded sizes match at the chunk level (pages are problematic because we choose different cutoff strategies, so even if the limits are set to the same values we get different page sizes). Update 2: I set rows per page to 100 and then got congruent files. Can now confirm both PoCs produce the same histograms and unencoded sizes at the page level as well. This is for a file with a wide variety of nesting with and without null values. Let me know if there are any specific files from parquet-testing you'd like to see run through both PoCs. |
Thanks @etseidl for verifying two implementations! As they are still in the PoC state, I think the manual verification is sufficient and prefer delaying the work on interoperability by adding parquet files with SizeStatistics to the parquet-testing repo. We can add testing files after the implementations are formally reviewed and merged. WDYT? @emkornfield |
I think this is sufficient for now. @wgtmac if we think the Java PR is close enough I can start a vote for approval on this PR unless you want to do it? |
I think it is good time to start a vote. @emkornfield I will take care of the release process once the proposal passes the vote :) |
Started vote on mailing list. |
Could you help the vote on mailing list: https://lists.apache.org/thread/wgobz41mfldbhqpg9q4mdwypghg2cxg2? Help is needed from the PMC members. @ggershinsky @gszadovszky @julienledem @rdblue @wesm @xhochy |
The vote passed. Is the standard way of merging her using a squash commit? |
@emkornfield I don't think there is a formal agreement on it on the Parquet project. parquet-format/dev/merge_parquet_pr.py Line 139 in 6f3f909
|
Adds Parquet size statistics introduced in apache/parquet-format#197. Authors: - Ed Seidl (https://github.com/etseidl) - Nghia Truong (https://github.com/ttnghia) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) URL: #14000
Adds Parquet size statistics introduced in apache/parquet-format#197. Authors: - Ed Seidl (https://github.com/etseidl) - Nghia Truong (https://github.com/ttnghia) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) URL: rapidsai#14000
No description provided.