-
Notifications
You must be signed in to change notification settings - Fork 786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bloom filters for i8 and i16 always return false negatives #5550
Comments
I suspect this is probably an issue with sign extension, as parquet doesn't natively support these types and they must be widened to int32 |
I started looking into this. Hoping to have a PR open in the next few days. :) |
As @tustvold mentioned, this appears to be caused by widening I think a general solution would be to apply the same Arrow-to-Parquet type coercion that occurs on write prior to checking the bloom filter. |
OverviewThe general issue is that Parquet types are a subset of Arrow types, so the Arrow writer must coerce to Parquet types. In some cases, this changes the physical representation. Thus, bloom filter checks might produce false negatives if passing the original Arrow representation. Correctness is only guaranteed when checking with the coerced Parquet value. For the issue mentioned above, Parquet only supports
However, this issue is not limited to #[test]
fn date64_column_bloom_filter() {
use chrono::NaiveDate;
// construct input data
let epoch = NaiveDate::from_ymd_opt(1970, 1, 1).unwrap();
let dates = vec![
NaiveDate::from_ymd_opt(1915, 11, 25).unwrap(),
NaiveDate::from_ymd_opt(1964, 4, 14).unwrap(),
NaiveDate::from_ymd_opt(1992, 9, 2).unwrap(),
NaiveDate::from_ymd_opt(2024, 1, 1).unwrap(),
];
let values: Vec<_> = dates
.iter()
.map(|d| d.signed_duration_since(epoch).num_milliseconds())
.collect();
// write (asserts written values match input)
let array = Arc::new(Date64Array::from(values.clone()));
let mut options = RoundTripOptions::new(array, false);
options.bloom_filter = true;
let files = one_column_roundtrip_with_options(options);
// checks bloom filter with original input
// fails with message: Value [0, 100, 252, 121, 114, 254, 255, 255] should be in bloom filter
check_bloom_filter(
files,
"col".to_string(),
values,
(0..SMALL_SIZE as i64).collect(),
);
// convert to "days since epoch"
let values: Vec<_> = dates
.iter()
.map(|d| d.signed_duration_since(epoch).num_days() as i32)
.collect();
// passes
check_bloom_filter(
files,
"col".to_string(),
values,
(0..SMALL_SIZE as i64).collect(),
);
} I think the same would apply to decimal and interval types. SolutionThis only affects scenarios where Parquet was written via Since the SBBF is checked by the crate user, part of the solution involves documenting how the Arrow-Parquet type coercion affects the physical representation and, therefore, the bloom filter. However, I think we can make some changes to improve the safety/convenience of working with bloom filters. Some options:
The first two options are simple but still push some complexity to Arrow users. I think the last option is ideal but definitely a much larger effort. But, these options aren't mutually exclusive. Would love to hear other thoughts on this. :) |
FWIW it appears that the same issue applies to unsigned |
Thank you for the analysis @mr-brobot. I think your suggestions 1 or 2 in #5550 (comment) sound reasonable to me (as that function would be needed even if we were to implement 3. cc @Ted-Jiang |
Ok, I'll move forward with options 1 and 2. Will let folks decide if we keep both or just one in the PR discussion. 🙂 I'll also open a separate issue for option 3. I think that topic deserves a dedicated conversation thread. I'm traveling this week but should have time to make progress this weekend. 🙂 |
Describe the bug
It is unclear to me if this is an issue when building or checking the Bloom filter; but either way, building a Bloom filter with
i8
ori16
values (as opposed toi32
ori64
) always returnsfalse
when checked.To Reproduce
returns:
Expected behavior
These tests should pass
Additional context
I found this from Datafusion: apache/datafusion#9779
The text was updated successfully, but these errors were encountered: