-
Notifications
You must be signed in to change notification settings - Fork 224
Added comparable row-oriented representation of a collection of [Array
].
#1287
Conversation
Codecov ReportBase: 83.04% // Head: 83.11% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #1287 +/- ##
==========================================
+ Coverage 83.04% 83.11% +0.07%
==========================================
Files 364 369 +5
Lines 39259 40187 +928
==========================================
+ Hits 32602 33401 +799
- Misses 6657 6786 +129
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really great. Thanks a lot, @RinChanNOWWW !
I left minor comments that I do not think need an action - they are improvements to the API and performance optimizations that I thought about while going through it.
I will wait for the CI to be green and we can merge it 👍
columns.iter().zip(self.fields.iter()).zip(dictionaries) | ||
{ | ||
// We encode a column at a time to minimise dispatch overheads | ||
encode_column(&mut rows, column, field.options, dictionary.as_deref()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to be embarassibly parallel. Given that this is a transpose of O(N x C)
where N is length and C number of columns, I wonder if we could split this so users can parallelize. Not for this PR; just something to be aware of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is almost parallelizable - it is changing rows
. However, there is still an optimization since modifying rows
is O(1)
but encoding is O(C)
. Will continue to think about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm puzzled about how to convert this into a parallel one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
The idea would be that we can build Vec<u8>
per column in parallel and only write them to the rows afterwards. However, I realized later that "encoding" is very simple and low overhead, so I am not sure it is worth.
Edit: List types encoding is hard. Let's ignore these types now. |
@@ -0,0 +1,430 @@ | |||
// Licensed to the Apache Software Foundation (ASF) under one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this is quite similar (possibly copied directly from ) arrow-rs: https://github.com/apache/arrow-rs/blob/65d5576/arrow/src/row/interner.rs
Which is fine, I just wanted to give @tustvold credit where credit is due
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the basic idea is the same as arrow-rs, and I adapted the codes to fit arrow2 data structures. Thanks for arrow-rs and @tustvold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I think it is pretty cool that the flow of technology is now from arrow to arrow2, it's a nice validation of the work I put into porting the ideas behind arrow2 across.Perhaps eventually we can combine efforts 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I think it is pretty cool that the flow of technology is now from arrow to arrow2
I believe it has always been both ways - which is part of the beauty of open source - people can do (almost) whatever they want without having to be worried about who "owns" what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work on this @tustvold!
|
||
fn main() { | ||
if version_meta().unwrap().channel == Channel::Nightly { | ||
println!("cargo:rustc-cfg=nightly_build"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to introduce a new feature called nightly
instead of adding a new dep in build-dependencies
.
And I think the current implementation is wrong.
- #![cfg_attr(feature = "nightly_build", feature(build_hasher_simple_hash_one))]
+ #![cfg_attr(nightly_build, feature(build_hasher_simple_hash_one))]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can take hashbrown
's implementation for a look: rust-lang/hashbrown#292
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to introduce a new feature called nightly instead of adding a new dep in build-dependencies.
But users need to add this to the features list. build.rs
can help to select automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But users need to add this to the features list.
build.rs
can help to select automatically.
I prefer adding a new feature instead. Leave this for @jorgecarleitao to decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also prefer a specific feature, but both are mergable :) - we use this for simd
also where users activate simd if they are in nightly. Maybe the nightly feature can activate both simd and this path? I can work out the details once we merge this :)
use std::ops::Index; | ||
|
||
use hashbrown::hash_map::RawEntryMut; | ||
use hashbrown::HashMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my best knowledge, rust std's HashMap
uses the same implementation as hashbrown
. Is there any reason to hashbrown directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to hashbrown directly?
No, it just follows arrow-rs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using std's HashMap
instead? So we don't need to introduce hashbrown
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least historically the raw entry API, which this requires to store the values separately from the HashMap, isn't exposed in std
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least historically the raw entry API, which this requires to store the values separately from the HashMap, isn't exposed in std
Thanks for explanation!
Is the raw entry API we use here something different from https://doc.rust-lang.org/std/collections/hash_map/enum.Entry.html?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, we have https://doc.rust-lang.org/std/collections/hash_map/enum.RawEntryMut.html under nightly feature hash_raw_entry
.
For the latest CI failing in error: the borrowed expression implements the required traits
--> src/io/parquet/write/pages.rs:301:48
|
301 | let boolean = BooleanArray::from_slice(&[false, false, true, true]).boxed();
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: change this to: `[false, false, true, true]`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_borrow Welcome to rust 1.65! We can fix them by |
62 files have clippy problems under rust 1.65 (83 files under rust 1.67). We can fix them in another pr. |
Array
].Array
].
Thank you @tustvold for the contribution (eagerly await the blog post!), @RinChanNOWWW for the port, and @Xuanwo and @ritchie46 for reviewing it! 🙇 |
In case anyone missed it, here is the blog post about the format https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/ |
The demo in https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-2/ seems be wrong. |
Thanks for the report @RinChanNOWWW -- here is a PR to correct this apache/arrow-site#270 |
Implemtation of a comparable row format in arrow2.
The
row
module is used to convert columns into rows bySortOption
s. After converting columns to rows, we can compare two 'row's directly withoutbuild_compare
. This is very useful fororder by
execution in query engines and can help to improve the performance.A
Rows
is fix-sized, and useBox<[u8]>
to store data. So there won't be too much heap allocation likeVec<u8>
.Struture:
closes #1280