Added comparable row-oriented representation of a collection of [`Array`]. #1287

RinChanNOWWW · 2022-10-31T10:54:51Z

Implemtation of a comparable row format in arrow2.

The row module is used to convert columns into rows by SortOptions. After converting columns to rows, we can compare two 'row's directly without build_compare. This is very useful for order by execution in query engines and can help to improve the performance.

A Rows is fix-sized, and use Box<[u8]> to store data. So there won't be too much heap allocation like Vec<u8>.

Struture:

fixed: encode fix-sized types.
variale: encode variable-sized types.
dictionary: encode dictionary type.
interner: dictionary type sort helper.

closes #1280

src/lib.rs

codecov · 2022-11-02T05:16:13Z

Codecov Report

Base: 83.04% // Head: 83.11% // Increases project coverage by +0.07% 🎉

Coverage data is based on head (8e057bd) compared to base (5bd0c7a).
Patch coverage: 80.10% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1287      +/-   ##
==========================================
+ Coverage   83.04%   83.11%   +0.07%     
==========================================
  Files         364      369       +5     
  Lines       39259    40187     +928     
==========================================
+ Hits        32602    33401     +799     
- Misses       6657     6786     +129

Impacted Files	Coverage Δ
src/compute/sort/mod.rs	`38.07% <0.00%> (ø)`
src/lib.rs	`100.00% <ø> (ø)`
src/types/native.rs	`67.44% <0.00%> (+3.16%)`	⬆️
src/compute/sort/row/fixed.rs	`56.17% <56.17%> (ø)`
src/compute/sort/row/dictionary.rs	`65.21% <65.21%> (ø)`
src/compute/sort/row/mod.rs	`77.41% <77.41%> (ø)`
src/compute/sort/row/interner.rs	`96.88% <96.88%> (ø)`
src/compute/sort/row/variable.rs	`100.00% <100.00%> (ø)`
src/io/orc/read/mod.rs	`83.92% <0.00%> (-0.07%)`	⬇️
src/io/ipc/read/common.rs	`94.90% <0.00%> (-0.02%)`	⬇️
... and 16 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

jorgecarleitao

This looks really great. Thanks a lot, @RinChanNOWWW !

I left minor comments that I do not think need an action - they are improvements to the API and performance optimizations that I thought about while going through it.

I will wait for the CI to be green and we can merge it 👍

src/compute/sort/row/mod.rs

jorgecarleitao · 2022-11-02T05:48:18Z

src/compute/sort/row/mod.rs

+            columns.iter().zip(self.fields.iter()).zip(dictionaries)
+        {
+            // We encode a column at a time to minimise dispatch overheads
+            encode_column(&mut rows, column, field.options, dictionary.as_deref())


this seems to be embarassibly parallel. Given that this is a transpose of O(N x C) where N is length and C number of columns, I wonder if we could split this so users can parallelize. Not for this PR; just something to be aware of.

This is almost parallelizable - it is changing rows. However, there is still an optimization since modifying rows is O(1) but encoding is O(C). Will continue to think about this.

I'm puzzled about how to convert this into a parallel one?

I agree.

The idea would be that we can build Vec<u8> per column in parallel and only write them to the rows afterwards. However, I realized later that "encoding" is very simple and low overhead, so I am not sure it is worth.

src/compute/sort/row/mod.rs

RinChanNOWWW · 2022-11-02T07:17:47Z

~~I'm adding the row encoding for List types. And then I will apply the advices above.~~

Edit: List types encoding is hard. Let's ignore these types now.

Cargo.toml

alamb · 2022-11-03T20:53:00Z

src/compute/sort/row/interner.rs

@@ -0,0 +1,430 @@
+// Licensed to the Apache Software Foundation (ASF) under one


FWIW this is quite similar (possibly copied directly from ) arrow-rs: https://github.com/apache/arrow-rs/blob/65d5576/arrow/src/row/interner.rs

Which is fine, I just wanted to give @tustvold credit where credit is due

Yes, the basic idea is the same as arrow-rs, and I adapted the codes to fit arrow2 data structures. Thanks for arrow-rs and @tustvold.

FWIW I think it is pretty cool that the flow of technology is now from arrow to arrow2, it's a nice validation of the work I put into porting the ideas behind arrow2 across.Perhaps eventually we can combine efforts 😄

FWIW I think it is pretty cool that the flow of technology is now from arrow to arrow2

I believe it has always been both ways - which is part of the beauty of open source - people can do (almost) whatever they want without having to be worried about who "owns" what.

Awesome work on this @tustvold!

Xuanwo · 2022-11-04T04:30:59Z

build.rs

+
+fn main() {
+    if version_meta().unwrap().channel == Channel::Nightly {
+        println!("cargo:rustc-cfg=nightly_build");


I prefer to introduce a new feature called nightly instead of adding a new dep in build-dependencies.

And I think the current implementation is wrong.

- #![cfg_attr(feature = "nightly_build", feature(build_hasher_simple_hash_one))] + #![cfg_attr(nightly_build, feature(build_hasher_simple_hash_one))]

We can take hashbrown's implementation for a look: rust-lang/hashbrown#292

I prefer to introduce a new feature called nightly instead of adding a new dep in build-dependencies.

But users need to add this to the features list. build.rs can help to select automatically.

But users need to add this to the features list. build.rs can help to select automatically.

I prefer adding a new feature instead. Leave this for @jorgecarleitao to decide.

I would also prefer a specific feature, but both are mergable :) - we use this for simd also where users activate simd if they are in nightly. Maybe the nightly feature can activate both simd and this path? I can work out the details once we merge this :)

Xuanwo · 2022-11-04T04:38:01Z

src/compute/sort/row/interner.rs

+use std::ops::Index;
+
+use hashbrown::hash_map::RawEntryMut;
+use hashbrown::HashMap;


To my best knowledge, rust std's HashMap uses the same implementation as hashbrown. Is there any reason to hashbrown directly?

Is there any reason to hashbrown directly?

No, it just follows arrow-rs.

How about using std's HashMap instead? So we don't need to introduce hashbrown.

At least historically the raw entry API, which this requires to store the values separately from the HashMap, isn't exposed in std

At least historically the raw entry API, which this requires to store the values separately from the HashMap, isn't exposed in std

Thanks for explanation!

Is the raw entry API we use here something different from https://doc.rust-lang.org/std/collections/hash_map/enum.Entry.html?

By the way, we have https://doc.rust-lang.org/std/collections/hash_map/enum.RawEntryMut.html under nightly feature hash_raw_entry.

Xuanwo · 2022-11-04T05:48:22Z

For the latest CI failing in Check and test / Clippy (pull_request): https://github.com/jorgecarleitao/arrow2/actions/runs/3390608894/jobs/5636710157

error: the borrowed expression implements the required traits
   --> src/io/parquet/write/pages.rs:301:48
    |
301 |         let boolean = BooleanArray::from_slice(&[false, false, true, true]).boxed();
    |                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: change this to: `[false, false, true, true]`
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_borrow

Welcome to rust 1.65! We can fix them by cargo clippy --fix.

RinChanNOWWW · 2022-11-04T06:16:04Z

62 files have clippy problems under rust 1.65 (83 files under rust 1.67). We can fix them in another pr.

jorgecarleitao · 2022-11-04T06:28:27Z

Thank you @tustvold for the contribution (eagerly await the blog post!), @RinChanNOWWW for the port, and @Xuanwo and @ritchie46 for reviewing it! 🙇

…ay`]. (jorgecarleitao#1287)

alamb · 2022-11-09T00:55:22Z

Thank you @tustvold for the contribution (eagerly await the blog post!),

In case anyone missed it, here is the blog post about the format

https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/
https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-2/

RinChanNOWWW · 2022-11-09T13:58:23Z

and is then followed by the unpadded length of this final block as a single byte in place of a continuation token

The demo in https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-2/ seems be wrong.

alamb · 2022-11-09T15:46:52Z

Thanks for the report @RinChanNOWWW -- here is a PR to correct this apache/arrow-site#270

RinChanNOWWW force-pushed the row branch from fdcbccd to 5b75367 Compare October 31, 2022 10:57

A comparable row-oriented representation of a collection of [Array].

78c61d5

RinChanNOWWW force-pushed the row branch from 9bf3629 to 78c61d5 Compare November 1, 2022 01:43

Remove unstable codes.

b394ea6

ritchie46 reviewed Nov 1, 2022

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

use feature build_hasher_simple_hash_one when build on nightly.

286ad53

RinChanNOWWW marked this pull request as draft November 2, 2022 02:27

jorgecarleitao approved these changes Nov 2, 2022

View reviewed changes

Improve codes.

8bd6417

RinChanNOWWW marked this pull request as ready for review November 2, 2022 10:17

RinChanNOWWW mentioned this pull request Nov 3, 2022

feat(query): improve sort. databendlabs/databend#8452

Merged

jorgecarleitao approved these changes Nov 3, 2022

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

Update Cargo.toml.

fb6c085

alamb reviewed Nov 3, 2022

View reviewed changes

Ignore miri test.

8e057bd

Xuanwo reviewed Nov 4, 2022

View reviewed changes

jorgecarleitao changed the title ~~feat: A comparable row-oriented representation of a collection of [Array].~~ Added comparable row-oriented representation of a collection of [Array]. Nov 4, 2022

jorgecarleitao merged commit 562de6a into jorgecarleitao:main Nov 4, 2022

RinChanNOWWW deleted the row branch November 4, 2022 06:33

This was referenced Nov 4, 2022

Fixed clippy warnings from 1.65 #1289

Merged

refactor(storage): replace RwLocks with DashMap databendlabs/databend#8639

Merged

ritchie46 pushed a commit to ritchie46/arrow2 that referenced this pull request Nov 6, 2022

Added comparable row-oriented representation of a collection of [`Arr…

7d2ad16

…ay`]. (jorgecarleitao#1287)

alamb mentioned this pull request Nov 9, 2022

MINOR: [Website] Fix incorrect diagram for 2022-11-07-multi-column-sorts-in-arrow-rust-part-2.md apache/arrow-site#270

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added comparable row-oriented representation of a collection of [`Array`]. #1287

Added comparable row-oriented representation of a collection of [`Array`]. #1287

RinChanNOWWW commented Oct 31, 2022 •

edited

Loading

codecov bot commented Nov 2, 2022 •

edited

Loading

jorgecarleitao left a comment

jorgecarleitao Nov 2, 2022

jorgecarleitao Nov 2, 2022

RinChanNOWWW Nov 3, 2022

jorgecarleitao Nov 3, 2022

RinChanNOWWW commented Nov 2, 2022 •

edited

Loading

alamb Nov 3, 2022

RinChanNOWWW Nov 4, 2022

tustvold Nov 4, 2022

jorgecarleitao Nov 4, 2022

ritchie46 Nov 4, 2022 •

edited

Loading

Xuanwo Nov 4, 2022

Xuanwo Nov 4, 2022

RinChanNOWWW Nov 4, 2022

Xuanwo Nov 4, 2022

jorgecarleitao Nov 4, 2022

Xuanwo Nov 4, 2022

RinChanNOWWW Nov 4, 2022

Xuanwo Nov 4, 2022

tustvold Nov 4, 2022 •

edited

Loading

Xuanwo Nov 4, 2022

Xuanwo Nov 4, 2022

Xuanwo commented Nov 4, 2022

RinChanNOWWW commented Nov 4, 2022

jorgecarleitao commented Nov 4, 2022

alamb commented Nov 9, 2022

RinChanNOWWW commented Nov 9, 2022

alamb commented Nov 9, 2022 •

edited

Loading

		@@ -0,0 +1,430 @@
		// Licensed to the Apache Software Foundation (ASF) under one

Added comparable row-oriented representation of a collection of [Array]. #1287

Added comparable row-oriented representation of a collection of [Array]. #1287

Conversation

RinChanNOWWW commented Oct 31, 2022 • edited Loading

codecov bot commented Nov 2, 2022 • edited Loading

Codecov Report

jorgecarleitao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RinChanNOWWW commented Nov 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ritchie46 Nov 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Nov 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xuanwo commented Nov 4, 2022

RinChanNOWWW commented Nov 4, 2022

jorgecarleitao commented Nov 4, 2022

alamb commented Nov 9, 2022

RinChanNOWWW commented Nov 9, 2022

alamb commented Nov 9, 2022 • edited Loading

Added comparable row-oriented representation of a collection of [`Array`]. #1287

Added comparable row-oriented representation of a collection of [`Array`]. #1287

RinChanNOWWW commented Oct 31, 2022 •

edited

Loading

codecov bot commented Nov 2, 2022 •

edited

Loading

RinChanNOWWW commented Nov 2, 2022 •

edited

Loading

ritchie46 Nov 4, 2022 •

edited

Loading

tustvold Nov 4, 2022 •

edited

Loading

alamb commented Nov 9, 2022 •

edited

Loading