Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Added comparable row-oriented representation of a collection of [Array]. #1287

Merged
merged 6 commits into from
Nov 4, 2022

Conversation

RinChanNOWWW
Copy link
Contributor

@RinChanNOWWW RinChanNOWWW commented Oct 31, 2022

Implemtation of a comparable row format in arrow2.

The row module is used to convert columns into rows by SortOptions. After converting columns to rows, we can compare two 'row's directly without build_compare. This is very useful for order by execution in query engines and can help to improve the performance.

A Rows is fix-sized, and use Box<[u8]> to store data. So there won't be too much heap allocation like Vec<u8>.

Struture:

  • fixed: encode fix-sized types.
  • variale: encode variable-sized types.
  • dictionary: encode dictionary type.
  • interner: dictionary type sort helper.

closes #1280

src/lib.rs Outdated Show resolved Hide resolved
@RinChanNOWWW RinChanNOWWW marked this pull request as draft November 2, 2022 02:27
@codecov
Copy link

codecov bot commented Nov 2, 2022

Codecov Report

Base: 83.04% // Head: 83.11% // Increases project coverage by +0.07% 🎉

Coverage data is based on head (8e057bd) compared to base (5bd0c7a).
Patch coverage: 80.10% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1287      +/-   ##
==========================================
+ Coverage   83.04%   83.11%   +0.07%     
==========================================
  Files         364      369       +5     
  Lines       39259    40187     +928     
==========================================
+ Hits        32602    33401     +799     
- Misses       6657     6786     +129     
Impacted Files Coverage Δ
src/compute/sort/mod.rs 38.07% <0.00%> (ø)
src/lib.rs 100.00% <ø> (ø)
src/types/native.rs 67.44% <0.00%> (+3.16%) ⬆️
src/compute/sort/row/fixed.rs 56.17% <56.17%> (ø)
src/compute/sort/row/dictionary.rs 65.21% <65.21%> (ø)
src/compute/sort/row/mod.rs 77.41% <77.41%> (ø)
src/compute/sort/row/interner.rs 96.88% <96.88%> (ø)
src/compute/sort/row/variable.rs 100.00% <100.00%> (ø)
src/io/orc/read/mod.rs 83.92% <0.00%> (-0.07%) ⬇️
src/io/ipc/read/common.rs 94.90% <0.00%> (-0.02%) ⬇️
... and 16 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Owner

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really great. Thanks a lot, @RinChanNOWWW !

I left minor comments that I do not think need an action - they are improvements to the API and performance optimizations that I thought about while going through it.

I will wait for the CI to be green and we can merge it 👍

src/compute/sort/row/mod.rs Outdated Show resolved Hide resolved
src/compute/sort/row/mod.rs Outdated Show resolved Hide resolved
src/compute/sort/row/mod.rs Outdated Show resolved Hide resolved
columns.iter().zip(self.fields.iter()).zip(dictionaries)
{
// We encode a column at a time to minimise dispatch overheads
encode_column(&mut rows, column, field.options, dictionary.as_deref())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be embarassibly parallel. Given that this is a transpose of O(N x C) where N is length and C number of columns, I wonder if we could split this so users can parallelize. Not for this PR; just something to be aware of.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost parallelizable - it is changing rows. However, there is still an optimization since modifying rows is O(1) but encoding is O(C). Will continue to think about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm puzzled about how to convert this into a parallel one?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

The idea would be that we can build Vec<u8> per column in parallel and only write them to the rows afterwards. However, I realized later that "encoding" is very simple and low overhead, so I am not sure it is worth.

src/compute/sort/row/mod.rs Outdated Show resolved Hide resolved
src/compute/sort/row/mod.rs Outdated Show resolved Hide resolved
src/compute/sort/row/mod.rs Show resolved Hide resolved
src/compute/sort/row/mod.rs Show resolved Hide resolved
src/compute/sort/row/mod.rs Outdated Show resolved Hide resolved
src/compute/sort/row/mod.rs Show resolved Hide resolved
@RinChanNOWWW
Copy link
Contributor Author

RinChanNOWWW commented Nov 2, 2022

I'm adding the row encoding for List types. And then I will apply the advices above.

Edit: List types encoding is hard. Let's ignore these types now.

@RinChanNOWWW RinChanNOWWW marked this pull request as ready for review November 2, 2022 10:17
Cargo.toml Outdated Show resolved Hide resolved
@@ -0,0 +1,430 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this is quite similar (possibly copied directly from ) arrow-rs: https://github.com/apache/arrow-rs/blob/65d5576/arrow/src/row/interner.rs

Which is fine, I just wanted to give @tustvold credit where credit is due

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the basic idea is the same as arrow-rs, and I adapted the codes to fit arrow2 data structures. Thanks for arrow-rs and @tustvold.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I think it is pretty cool that the flow of technology is now from arrow to arrow2, it's a nice validation of the work I put into porting the ideas behind arrow2 across.Perhaps eventually we can combine efforts 😄

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I think it is pretty cool that the flow of technology is now from arrow to arrow2

I believe it has always been both ways - which is part of the beauty of open source - people can do (almost) whatever they want without having to be worried about who "owns" what.

Copy link
Collaborator

@ritchie46 ritchie46 Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work on this @tustvold!


fn main() {
if version_meta().unwrap().channel == Channel::Nightly {
println!("cargo:rustc-cfg=nightly_build");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to introduce a new feature called nightly instead of adding a new dep in build-dependencies.

And I think the current implementation is wrong.

- #![cfg_attr(feature = "nightly_build", feature(build_hasher_simple_hash_one))]
+ #![cfg_attr(nightly_build, feature(build_hasher_simple_hash_one))]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can take hashbrown's implementation for a look: rust-lang/hashbrown#292

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to introduce a new feature called nightly instead of adding a new dep in build-dependencies.

But users need to add this to the features list. build.rs can help to select automatically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But users need to add this to the features list. build.rs can help to select automatically.

I prefer adding a new feature instead. Leave this for @jorgecarleitao to decide.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also prefer a specific feature, but both are mergable :) - we use this for simd also where users activate simd if they are in nightly. Maybe the nightly feature can activate both simd and this path? I can work out the details once we merge this :)

use std::ops::Index;

use hashbrown::hash_map::RawEntryMut;
use hashbrown::HashMap;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my best knowledge, rust std's HashMap uses the same implementation as hashbrown. Is there any reason to hashbrown directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to hashbrown directly?

No, it just follows arrow-rs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using std's HashMap instead? So we don't need to introduce hashbrown.

Copy link
Contributor

@tustvold tustvold Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least historically the raw entry API, which this requires to store the values separately from the HashMap, isn't exposed in std

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least historically the raw entry API, which this requires to store the values separately from the HashMap, isn't exposed in std

Thanks for explanation!

Is the raw entry API we use here something different from https://doc.rust-lang.org/std/collections/hash_map/enum.Entry.html?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, we have https://doc.rust-lang.org/std/collections/hash_map/enum.RawEntryMut.html under nightly feature hash_raw_entry.

@Xuanwo
Copy link
Contributor

Xuanwo commented Nov 4, 2022

For the latest CI failing in Check and test / Clippy (pull_request): https://github.com/jorgecarleitao/arrow2/actions/runs/3390608894/jobs/5636710157

error: the borrowed expression implements the required traits
   --> src/io/parquet/write/pages.rs:301:48
    |
301 |         let boolean = BooleanArray::from_slice(&[false, false, true, true]).boxed();
    |                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: change this to: `[false, false, true, true]`
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_borrow

Welcome to rust 1.65! We can fix them by cargo clippy --fix.

@RinChanNOWWW
Copy link
Contributor Author

62 files have clippy problems under rust 1.65 (83 files under rust 1.67). We can fix them in another pr.

@jorgecarleitao jorgecarleitao changed the title feat: A comparable row-oriented representation of a collection of [Array]. Added comparable row-oriented representation of a collection of [Array]. Nov 4, 2022
@jorgecarleitao jorgecarleitao merged commit 562de6a into jorgecarleitao:main Nov 4, 2022
@jorgecarleitao
Copy link
Owner

Thank you @tustvold for the contribution (eagerly await the blog post!), @RinChanNOWWW for the port, and @Xuanwo and @ritchie46 for reviewing it! 🙇

@alamb
Copy link
Collaborator

alamb commented Nov 9, 2022

Thank you @tustvold for the contribution (eagerly await the blog post!),

In case anyone missed it, here is the blog post about the format

https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/
https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-2/

@RinChanNOWWW
Copy link
Contributor Author

image

and is then followed by the unpadded length of this final block as a single byte in place of a continuation token

The demo in https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-2/ seems be wrong.

@alamb
Copy link
Collaborator

alamb commented Nov 9, 2022

Thanks for the report @RinChanNOWWW -- here is a PR to correct this apache/arrow-site#270

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Is there a module like arrow-rs's row in arrow2?
6 participants