-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized writing of byte array to parquet (#1764) (2x faster) #2221
Conversation
/// Returns the min and max values in this collection, skipping any NaN values | ||
/// | ||
/// Returns `None` if no values found | ||
fn min_max(&self, descr: &ColumnDescriptor) -> Option<(&Self::T, &Self::T)>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is moved onto Encoder so that ColumnValues
can be a type-erased type, e.g. ArrayRef. This will be critical to support dictionaries without needing GATs, as the TypedDictionary (#2136) contains a lifetime.
_ => self.encoder.put(slice), | ||
} | ||
fn write_gather(&mut self, values: &Self::Values, indices: &[usize]) -> Result<()> { | ||
let slice: Vec<_> = indices.iter().map(|idx| values[*idx].clone()).collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pushed down from get_numeric_array_slice in arrow writer
mut valid: impl Iterator<Item = usize>, | ||
) -> Option<(ByteArray, ByteArray)> | ||
where | ||
T: ArrayAccessor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the new ArrayAccessor 😄
} else { | ||
num_required_bits(num_entries as u64 - 1) | ||
} | ||
num_required_bits(self.num_entries().saturating_sub(1) as u64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic was actually previously incorrect as it would return a bit_width of 1 for num_entries == 1
when it only needed to be 0. This is largely harmless, but is worth fixing.
|
||
impl ColumnValueEncoder for ByteArrayEncoder { | ||
type T = ByteArray; | ||
type Values = ArrayRef; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I had the concrete type here, i.e. StringArray
. This works, however, would present difficulties in adapting this to preserve dictionaries, as TypedDictionary (#2136) will contain a lifetime, which would then require GATs here
Codecov Report
@@ Coverage Diff @@
## master #2221 +/- ##
==========================================
- Coverage 82.29% 82.27% -0.02%
==========================================
Files 244 245 +1
Lines 62443 62654 +211
==========================================
+ Hits 51386 51549 +163
- Misses 11057 11105 +48
Help us with your feedback. Take ten seconds to tell us how you rate us. |
af94d6c
to
f90e5ae
Compare
} | ||
|
||
// TODO: These methods don't handle non null indices correctly (#1753) | ||
def_get_binary_array_fn!(get_binary_array, arrow_array::BinaryArray); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing these fixes #1753
T::Item: AsRef<[u8]>, | ||
{ | ||
self.num_values += indices.len(); | ||
match &mut self.encoder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://github.com/apache/parquet-format/blob/master/Encodings.md for what the various encodings are. They are all relatively self-explantory
f90e5ae
to
7d6a5b9
Compare
Benchmark runs are scheduled for baseline = 42b15a8 and contender = 2c09ba4. 2c09ba4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Part of #1764
Closes #1753
Rationale for this change
And there is still low-hanging fruit for optimisation here
What changes are included in this PR?
Switches encoding arrow arrays to a specialized write path
Are there any user-facing changes?
No