Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Primitive Iterator API #689

Merged
merged 49 commits into from
Aug 30, 2024
Merged

Primitive Iterator API #689

merged 49 commits into from
Aug 30, 2024

Conversation

AdamGS
Copy link
Contributor

@AdamGS AdamGS commented Aug 27, 2024

This PR includes a new batched iterator API, that yeilds a Vec<T> and validity data for batches of items from arrays. It can dynamically dispatch over the type of underlying arrays, so recursive compression should hold.
The PR includes implementations for primitve arrays, as well as ALP as a test case for more complex encodings and constant array as its fairly trivial.

I also added a bunch of benchmarks to various pieces it touches to have some initial performance numbers and test basic correctness.

@AdamGS AdamGS changed the title [WIP] Iterator API Primitive Iterator API Aug 28, 2024
@AdamGS AdamGS requested a review from robert3005 August 28, 2024 14:02
Copy link
Member

@robert3005 robert3005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to get rid of per batch vector allocation. Otherwise there's some bounds that are redundant.

encodings/alp/src/array.rs Outdated Show resolved Hide resolved
vortex-array/src/array/constant/variants.rs Show resolved Hide resolved
vortex-array/src/iter/mod.rs Outdated Show resolved Hide resolved
vortex-array/src/iter/mod.rs Outdated Show resolved Hide resolved
vortex-array/src/iter/mod.rs Outdated Show resolved Hide resolved
vortex-array/src/iter/mod.rs Outdated Show resolved Hide resolved
encodings/alp/src/array.rs Show resolved Hide resolved
Comment on lines 314 to 320
match self.dtype() {
DType::Primitive(PType::I64, _) => {
let accessor = Arc::new(self.clone());
Some(accessor)
}
_ => None,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this repeated unwrapping can probably be shoved into a macro, but doesn't have to be

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

introduced the primitive_accessor_ref macro

encodings/alp/src/array.rs Show resolved Hide resolved
@@ -25,6 +25,7 @@ vortex-scalar = { workspace = true }

[dev-dependencies]
divan = { workspace = true }
arrow = { workspace = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can probably disable the default-features for this import since you don't use ipc/json/csv encoding. probably saves a bit of compile time 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clippy wants me to disable it at the top-level and then have every crate in the workspace pull its own members, which is probably a good idea

vortex-array/src/iter/mod.rs Show resolved Hide resolved
@@ -0,0 +1,100 @@
use criterion::{criterion_group, criterion_main, BatchSize, Criterion};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the results on my laptop:

std_iter_no_option      time:   [46.625 µs 47.232 µs 47.982 µs]
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

std_iter                time:   [394.24 µs 398.92 µs 403.92 µs]
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe

vortex_iter             time:   [746.95 µs 750.73 µs 754.90 µs]

vortex_iter_flat        time:   [2.6330 ms 2.6477 ms 2.6635 ms]
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

arrow_iter              time:   [44.055 µs 44.165 µs 44.300 µs]
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

I'm surprised the Arrow iter is so much faster (and faster than std::iter). My guess is the assertions and alignment checking in our PrimitiveArray make up the difference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow doesn't actually iterator over Option just over T and in this case there's no nulls so arrow is the same as iterating Vec. To get to the level you need monomorphisation of every function call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What @robert3005 said, spent a lot of time trying to understand everything they do and while I can't say I have full clarity, Arrow's overall design does make it easier to have fast iterators - both having fully typed arrays and not having to support compression.
I am hopeful that eventually we'll find a more performant way of iterating arrays, but just having a much smaller memory footprint should be useful IMO.

@AdamGS AdamGS requested review from robert3005 and a10y August 30, 2024 10:26
Copy link
Contributor

@a10y a10y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

encodings/alp/Cargo.toml Outdated Show resolved Hide resolved
@AdamGS AdamGS enabled auto-merge (squash) August 30, 2024 13:35
@AdamGS AdamGS disabled auto-merge August 30, 2024 13:35
@AdamGS AdamGS enabled auto-merge (squash) August 30, 2024 13:35
@AdamGS AdamGS merged commit 7c017cb into develop Aug 30, 2024
4 checks passed
@AdamGS AdamGS deleted the adamg/iter-2 branch August 30, 2024 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants