-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert rows to arrays (#2677) #2826
Conversation
type Encoded = [u8; 16]; | ||
/// The raw bytes of a decimal | ||
#[derive(Copy, Clone)] | ||
pub struct RawDecimal<const N: usize>(pub [u8; N]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is necessary because Decimal128 encodes the precision, etc...
Longer-term this will all get removed by #2637
let mut converter = RowConverter::new(vec![SortField::new(DataType::UInt64)]); | ||
black_box(converter.convert_columns(&cols)) | ||
let mut converter = RowConverter::new(fields.clone()); | ||
black_box(converter.convert_columns(&cols).unwrap()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before, the dictionary benchmarks were actually benchmarking the time to error on the schema, as the DataType was incorrect 😱
This unwrap should prevent something similar happening in future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this PR pretty carefully, it looks good to me and I think it is well tested. I had some stylistic suggestions but nothing that I can see that would prevent merge
Reading this PR makes me wonder if making an arrow-row
crate would be a good idea 🤔
arrow/src/row/dictionary.rs
Outdated
let null_sentinel = match options.nulls_first { | ||
true => 0_u8, | ||
false => 0xFF, | ||
}; | ||
|
||
let null_terminator = match options.descending { | ||
true => 0xFF, | ||
false => 0_u8, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These have to match what happens in encode_column
-- I wonder if there is some way to make it clearer in the code they must match / that assumption
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the encoding logic from encode_column into this module
.skip(1) | ||
.position(|x| *x == null_terminator) | ||
.unwrap(); | ||
let key = &row[1..key_offset + 2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the 2 to include the null terminator? I would have thought we could skip that terminator when copying values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't copying values, it is just copying a fat pointer (address + slice length). The interner expects the same key it spat out, which includes the null terminator
/// | ||
/// Values must be valid UTF-8 | ||
fn decode_binary<O: OffsetSizeTrait>(values: &[&[u8]]) -> ArrayData { | ||
let capacity = values.iter().map(|x| x.len()).sum(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a future optimization, it feels like these arrays could be built directly (rather than copying all the values over and then copying them again into an arrow buffer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't copying the values, but references to the values, but yes this potentially could be optimised
let mut rows: Vec<_> = rows | ||
.into_iter() | ||
.map(|row| { | ||
assert!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
arrow/src/row/mod.rs
Outdated
DataType::Boolean => lengths.iter_mut().for_each(|x| *x += bool::ENCODED_LEN), | ||
DataType::Decimal128(_, _) => lengths.iter_mut().for_each(|x| *x += Decimal128::ENCODED_LEN), | ||
DataType::Decimal256(_, _) => lengths.iter_mut().for_each(|x| *x += Decimal256::ENCODED_LEN), | ||
DataType::Decimal128(_, _) => lengths.iter_mut().for_each(|x| *x += 17), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make these numbers named constants so it is clearer what is going on?
arrow/src/row/variable.rs
Outdated
options: SortOptions, | ||
) -> GenericBinaryArray<I> { | ||
let len = rows.len(); | ||
let null_sentinel = match options.nulls_first { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is another example of the implict null encoding that might be nice to make explicit
An |
Benchmark runs are scheduled for baseline = a0a263f and contender = f8c4037. f8c4037 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #2677
Rationale for this change
The mostly mechanical process of converting the row format back to arrow arrays
What changes are included in this PR?
Adds the ability to convert the row format back to arrow arrays
Are there any user-facing changes?
No