-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support parquet bloom filter pruning for decimal128 #8930
Conversation
benchmarks/src/parquet_filter.rs
Outdated
@@ -89,16 +89,19 @@ impl RunOpt { | |||
pushdown_filters: false, | |||
reorder_filters: false, | |||
enable_page_index: false, | |||
enable_bloom_filter: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will support soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Maps column name to the parquet bloom filter | ||
column_sbbf: HashMap<String, Sbbf>, | ||
/// Maps column name to the parquet bloom filter and parquet physical type | ||
column_sbbf: HashMap<String, (Sbbf, Type)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the sbbf check
input is parquet type, so we need know the read parquet physical type
when construct check value 🤔 , is there any another way
@@ -80,15 +81,15 @@ async fn single_file() { | |||
// request_method = 'GET' | |||
.with_filter(col("request_method").eq(lit("GET"))) | |||
.with_pushdown_expected(PushdownExpected::Some) | |||
.with_expected_rows(688); | |||
.with_expected_rows(745); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb https://github.com/apache/arrow-datafusion/blob/72fb87e27bfaf62580f30fb6e85aed2a22fbe744/test-utils/src/data_gen.rs#L172 I think here every running test should get diff result 🤔 , but from the fact only change the generate method this would change. Is there something i miss?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite sure what you are asking here @Ted-Jiang . It seems like most of the tests in this file have been updated as well
Thanks @Ted-Jiang -- I hope to review this more carefully tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Ted-Jiang -- I think this PR is really nicely done.
I didn't quite make it through all the changes in datafusion/core/tests/parquet/filter_pushdown.rs
-- would it be possible to avoid changes in the existing tests?
Either by avoiding changes to the data created or maybe by using a different data generator?
ScalarValue::Int16(Some(v)) => sbbf.check(v), | ||
ScalarValue::Int8(Some(v)) => sbbf.check(v), | ||
ScalarValue::Decimal128(Some(v), p, s) => match parquet_type { | ||
Type::INT32 => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the value of the i128 the same as the value of this i32 (aka are we sure the top 12 bytes are always 0) as this code is ignores those bytes?
The same basic question applies to the Int64 case too (are we sure its top 8 bytes are always zero)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think add precision check avoid this situation
//https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/Encodings.md?plain=1#L35-L42 | ||
// All physical type are little-endian | ||
let b = (*v as i32).to_le_bytes(); | ||
let decimal = Decimal::Int32 { | ||
value: b, | ||
precision: *p as i32, | ||
scale: *s as i32, | ||
}; | ||
sbbf.check(&decimal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be clearer to use the constructors instead: https://docs.rs/parquet/latest/parquet/data_type/enum.Decimal.html#method.from_i32
//https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/Encodings.md?plain=1#L35-L42 | |
// All physical type are little-endian | |
let b = (*v as i32).to_le_bytes(); | |
let decimal = Decimal::Int32 { | |
value: b, | |
precision: *p as i32, | |
scale: *s as i32, | |
}; | |
sbbf.check(&decimal) | |
let decimal = Decimal::from_i32(* v as i32, *p as i32, *s as i32)' | |
sbbf.check(&decimal) |
(the same applies to the other types here as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file apache/arrow-rs#5325 to look at the bytes_order issue
datafusion/core/tests/parquet/mod.rs
Outdated
@@ -66,7 +66,9 @@ enum Scenario { | |||
Int32Range, | |||
Float64, | |||
Decimal, | |||
DecimalBloomFilter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are only two scenarios here but three encodings (Int32, Int64 and Bytes) handled by the code. Is there some way to over all three encodings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add test
@@ -80,15 +81,15 @@ async fn single_file() { | |||
// request_method = 'GET' | |||
.with_filter(col("request_method").eq(lit("GET"))) | |||
.with_pushdown_expected(PushdownExpected::Some) | |||
.with_expected_rows(688); | |||
.with_expected_rows(745); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite sure what you are asking here @Ted-Jiang . It seems like most of the tests in this file have been updated as well
41a61f0
to
f52f5e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you @Ted-Jiang . I had some comment / test suggestions but I think they could be done as a follow on PR too
Co-authored-by: Andrew Lamb <[email protected]>
🎉 |
Which issue does this PR close?
related #8880.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?