-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix reading/writing nested null arrays (#1480) (#1036) (#1399) #1481
Conversation
@@ -887,7 +886,7 @@ fn remove_indices( | |||
Ok(Arc::new(StructArray::from((new_columns, valid.finish())))) | |||
} | |||
} | |||
ArrowType::Null => Ok(Arc::new(NullArray::new(arr.len()))), | |||
ArrowType::Null => Ok(Arc::new(NullArray::new(arr.len() - indices.len()))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I remember by deducting the null indices, the resulting record batch will be empty, but actually there are a row with empty list in previous issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was caused by #1448 (comment) if I'm not mistaken
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -1513,6 +1513,43 @@ mod tests { | |||
required_and_optional::<LargeStringArray, _>(raw_strs); | |||
} | |||
|
|||
#[test] | |||
fn null_list_single_column() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an adapted version of the test added by @novemberkilo in https://github.com/apache/arrow-rs/pull/1477/files#diff-92fdd8b1e0a40e7072901fe74855da759d2d97f02c4f7224f4217468d545ecacR1276
@@ -1564,6 +1601,25 @@ mod tests { | |||
one_column_roundtrip(values, true, Some(SMALL_SIZE / 2)); | |||
} | |||
|
|||
#[test] | |||
fn list_nested_nulls() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was just something I wrote whilst exploring the problem space, and I think is generally useful
// TODO: The behaviour of a <list<null>> is untested | ||
DataType::Null => vec![list_level], | ||
DataType::Boolean | ||
DataType::Null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the fix for #1036
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, i think this is correct.
DataType::Null | ||
| DataType::Boolean | ||
// A NullArray is entirely nulls, despite not containing a null buffer | ||
DataType::Null => ((0..=(len as i64)).collect(), vec![false; len]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the fix for #1480
|
||
// convert to arrays | ||
let array = arrow::array::NullArray::new(records_read); | ||
let array = arrow::array::NullArray::new(self.record_reader.num_values()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bug, the NullArray should have the value count number of elements, i.e. the same number of elements as the definition/repetition levels if any.
In the non-nested case the two quantities will be the same, but in the event of a nested array a single record might correspond to 1 or more values.
This would then cause this line to fire - https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/array_reader.rs#L930
Codecov Report
@@ Coverage Diff @@
## master #1481 +/- ##
==========================================
- Coverage 82.72% 82.71% -0.02%
==========================================
Files 187 187
Lines 54186 54239 +53
==========================================
+ Hits 44824 44862 +38
- Misses 9362 9377 +15
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the PR, the changes that you made make sense to me, and I can see how they address the issues that you've listed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good.
Thanks all! |
Which issue does this PR close?
Closes #1480
Closes #1036
Closes #1399
Rationale for this change
See tickets
What changes are included in this PR?
Fixes various bugs in handling of null arrays
Are there any user-facing changes?
Previous incorrect behaviour is now fixed