Primitive Iterator API #689

AdamGS · 2024-08-27T12:56:36Z

This PR includes a new batched iterator API, that yeilds a Vec<T> and validity data for batches of items from arrays. It can dynamically dispatch over the type of underlying arrays, so recursive compression should hold.
The PR includes implementations for primitve arrays, as well as ALP as a test case for more complex encodings and constant array as its fairly trivial.

I also added a bunch of benchmarks to various pieces it touches to have some initial performance numbers and test basic correctness.

robert3005

We need to get rid of per batch vector allocation. Otherwise there's some bounds that are redundant.

encodings/alp/src/array.rs

vortex-array/src/array/constant/variants.rs

vortex-array/src/iter/mod.rs

encodings/alp/src/array.rs

a10y · 2024-08-28T16:05:29Z

vortex-array/src/array/primitive/mod.rs

+        match self.dtype() {
+            DType::Primitive(PType::I64, _) => {
+                let accessor = Arc::new(self.clone());
+                Some(accessor)
+            }
+            _ => None,
+        }


this repeated unwrapping can probably be shoved into a macro, but doesn't have to be

introduced the primitive_accessor_ref macro

encodings/alp/src/array.rs

a10y · 2024-08-29T13:44:41Z

encodings/alp/Cargo.toml

@@ -25,6 +25,7 @@ vortex-scalar = { workspace = true }

 [dev-dependencies]
 divan = { workspace = true }
+arrow = { workspace = true }


you can probably disable the default-features for this import since you don't use ipc/json/csv encoding. probably saves a bit of compile time 🤷

clippy wants me to disable it at the top-level and then have every crate in the workspace pull its own members, which is probably a good idea

vortex-array/src/iter/mod.rs

a10y · 2024-08-29T14:17:17Z

vortex-array/benches/iter.rs

@@ -0,0 +1,100 @@
+use criterion::{criterion_group, criterion_main, BatchSize, Criterion};


These are the results on my laptop:

std_iter_no_option time: [46.625 µs 47.232 µs 47.982 µs] Found 9 outliers among 100 measurements (9.00%) 5 (5.00%) high mild 4 (4.00%) high severe std_iter time: [394.24 µs 398.92 µs 403.92 µs] Found 11 outliers among 100 measurements (11.00%) 2 (2.00%) low severe 1 (1.00%) low mild 2 (2.00%) high mild 6 (6.00%) high severe vortex_iter time: [746.95 µs 750.73 µs 754.90 µs] vortex_iter_flat time: [2.6330 ms 2.6477 ms 2.6635 ms] Found 8 outliers among 100 measurements (8.00%) 7 (7.00%) high mild 1 (1.00%) high severe arrow_iter time: [44.055 µs 44.165 µs 44.300 µs] Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) high mild

I'm surprised the Arrow iter is so much faster (and faster than std::iter). My guess is the assertions and alignment checking in our PrimitiveArray make up the difference?

Arrow doesn't actually iterator over Option just over T and in this case there's no nulls so arrow is the same as iterating Vec. To get to the level you need monomorphisation of every function call

What @robert3005 said, spent a lot of time trying to understand everything they do and while I can't say I have full clarity, Arrow's overall design does make it easier to have fast iterators - both having fully typed arrays and not having to support compression.
I am hopeful that eventually we'll find a more performant way of iterating arrays, but just having a much smaller memory footprint should be useful IMO.

a10y

Looks good!

encodings/alp/Cargo.toml

AdamGS added 28 commits August 21, 2024 18:36

.

9c369f8

initial work

2531987

things

cc8f63c

ALP!

26469f3

batched version

6e48d11

Merge branch 'develop' into adamg/iter-2

ed89a43

.

6464598

working state

784c0f8

much faster

44ddc82

.

e38a058

.

5a82159

.

82eaf35

.

88fd85e

.

b2359ca

Vectorized iter

fbfa5f5

Merge branch 'develop' into adamg/iter-2

37051ad

more things

6d4e9ca

checkpoint pre arc

f03941e

yet another checkpoint

ca16a7f

.

78d824b

.

3c1920b

Accessor dispatch

0e06f14

some tests

b2701e0

some cleanup

feef333

.

b841b19

ALP benchmarks

979dc50

better benchmarks

d9fe8db

comment

f2c24e7

AdamGS changed the title ~~[WIP] Iterator API~~ Primitive Iterator API Aug 28, 2024

AdamGS requested a review from robert3005 August 28, 2024 14:02

AdamGS added 4 commits August 28, 2024 16:12

simplify

e40b741

.

b59ca53

.

e66d22b

.

091b6b8

robert3005 requested changes Aug 28, 2024

View reviewed changes

.

420fa43

AdamGS force-pushed the adamg/iter-2 branch from fc401bc to 420fa43 Compare August 29, 2024 09:58

AdamGS added 7 commits August 29, 2024 11:00

.

f65a84a

Merge branch 'develop' into adamg/iter-2

e674487

.

d9a1e3c

Merge branch 'develop' into adamg/iter-2

7b1dc98

.

da17eb9

.

4c76ece

.

8c98d7d

a10y reviewed Aug 29, 2024

View reviewed changes

AdamGS added 3 commits August 29, 2024 17:02

.

ad4505b

.

c78e692

Macro primitive accessor

3797994

AdamGS requested review from robert3005 and a10y August 30, 2024 10:26

.

bc54e5a

a10y approved these changes Aug 30, 2024

View reviewed changes

encodings/alp/Cargo.toml Outdated Show resolved Hide resolved

AdamGS added 2 commits August 30, 2024 13:50

.

9440e44

remove BatchData

3918d82

AdamGS enabled auto-merge (squash) August 30, 2024 13:35

AdamGS disabled auto-merge August 30, 2024 13:35

AdamGS enabled auto-merge (squash) August 30, 2024 13:35

robert3005 approved these changes Aug 30, 2024

View reviewed changes

AdamGS merged commit 7c017cb into develop Aug 30, 2024
4 checks passed

AdamGS deleted the adamg/iter-2 branch August 30, 2024 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Primitive Iterator API #689

Primitive Iterator API #689

AdamGS commented Aug 27, 2024 •

edited

Loading

robert3005 left a comment

a10y Aug 28, 2024

AdamGS Aug 30, 2024

a10y Aug 29, 2024

AdamGS Aug 30, 2024

a10y Aug 29, 2024

robert3005 Aug 29, 2024

AdamGS Aug 30, 2024

a10y left a comment

		@@ -0,0 +1,100 @@
		use criterion::{criterion_group, criterion_main, BatchSize, Criterion};

Primitive Iterator API #689

Primitive Iterator API #689

Conversation

AdamGS commented Aug 27, 2024 • edited Loading

robert3005 left a comment

Choose a reason for hiding this comment

a10y Aug 28, 2024

Choose a reason for hiding this comment

AdamGS Aug 30, 2024

Choose a reason for hiding this comment

a10y Aug 29, 2024

Choose a reason for hiding this comment

AdamGS Aug 30, 2024

Choose a reason for hiding this comment

a10y Aug 29, 2024

Choose a reason for hiding this comment

robert3005 Aug 29, 2024

Choose a reason for hiding this comment

AdamGS Aug 30, 2024

Choose a reason for hiding this comment

a10y left a comment

Choose a reason for hiding this comment

AdamGS commented Aug 27, 2024 •

edited

Loading