-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix MapArrayReader
(#2484) (#1699) (#1561)
#2500
Conversation
use arrow::record_batch::RecordBatch; | ||
use bytes::Bytes; | ||
|
||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Credit to @d-willis for this test
ArrowType::Map(element, _) => match element.data_type() { | ||
ArrowType::Struct(fields) if fields.len() == 2 => { | ||
// The inner map field must always non-nullable (#1697) | ||
assert!(!element.is_nullable(), "map struct cannot be nullable"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As described in #1697 the encoding of DataType::Map permits greater nullability than the type can actually contain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1697 is a question -- so is the solution "we will assume that the inner struct can not be nullable until we have an existence proof to the contrary?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say it is more, if this level is nullable the schema is inconsistent as there is no way to represent that in parquet 😅 The schema inference logic will never generate this - https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/schema/complex.rs#L350
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe worth a comment there
Hi! It's quite a coincidence, but we've been hit by this very problem and spent some time on it. We created a small project to show how to reproduce our issue and just found #2484 as we were about to open our issue. Thanks a lot! Cheers, |
MapArrayReader
(#2484) (#1699) (#1561)
Giving this a test against our code now. Hopefully will have run by tomorrow... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tustvold - I think this looks really nice (it is awesome to fix bugs by deleting code).
I think the test should be updated to assert that nullness
on the read array was correct, but otherwise this PR looks good to go to me.
The feedback from @tvincent2 (thanks for that btw 👋 ) also makes a compelling case for the benefits of this PR. 👍
ArrowType::Map(element, _) => match element.data_type() { | ||
ArrowType::Struct(fields) if fields.len() == 2 => { | ||
// The inner map field must always non-nullable (#1697) | ||
assert!(!element.is_nullable(), "map struct cannot be nullable"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1697 is a question -- so is the solution "we will assume that the inner struct can not be nullable until we have an existence proof to the contrary?"
map_def_level: rep_level, | ||
map_rep_level: def_level, | ||
} | ||
let struct_def_level = match nullable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could document what nullable
means in this context as part of a docstring
assert!(!element.is_nullable(), "map struct cannot be nullable"); | ||
element | ||
} | ||
_ => unreachable!("expected struct with two fields"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eventually it might be a nicer UX to return an error rather than panic'ing here (e.g. MapArrayReader::try_new()
) but that would also change the API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a public API so I could change it, but that also makes me less inclined to do so. It is a bug in the builder if you run into this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense 👍
let data = array.data().clone(); | ||
let builder = data.into_builder().data_type(self.data_type.clone()); | ||
Ok(Arc::new(MapArray::from(unsafe { | ||
builder.build_unchecked() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This unsafe
is OK because the inner StructArrayReader
already validated the contents of the structs, and MapArrayReader::new()
already validated the DataType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much, will add a doc comment
map_def_level: i16, | ||
#[allow(unused)] | ||
map_rep_level: i16, | ||
reader: ListArrayReader<i32>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for maybe_record_batch in record_batch_reader { | ||
let record_batch = maybe_record_batch.expect("Getting current batch"); | ||
let col = record_batch.column(0); | ||
let map_entry = array::as_map_array(col).value(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this test also validate that the first two entries are null? Something like
assert!(!array::as_map_array(col).is_valid(0));
assert!(!array::as_map_array(col).is_valid(1));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test was originally written to test my very specific failure condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly a worthwhile addition though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sorry I probably should have phrased that as "I suggest this test also validate the first two entries are null"
Benchmark runs are scheduled for baseline = f3afdd2 and contender = c34f07f. c34f07f is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
@@ -193,6 +201,8 @@ mod tests { | |||
for maybe_record_batch in record_batch_reader { | |||
let record_batch = maybe_record_batch.expect("Getting current batch"); | |||
let col = record_batch.column(0); | |||
assert!(col.is_null(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
Thanks for the quick turn around on this! |
Which issue does this PR close?
Closes #2484
Closes #1699
Closes #1561
Rationale for this change
The ListArrayReader logic is better tested, more complete and can be used to read MapArrays
What changes are included in this PR?
Replaces the logic in MapArrayReader with ListArrayReader
Are there any user-facing changes?
No