[FEAT] Add Sparse Tensor logical type #2722

michaelvay · 2024-08-25T07:34:14Z

It's still WIP, wanted to get some feedback on the draft and see if I am in the right path.

Still having issues with:

When casting From COOSparseTensorArray to TensorArray keeping the values dynamically type. What I currently do is assuming the type in order to be able to iterate over the values and in insert the non zero values in the relevant indices.

michaelvay · 2024-08-25T07:51:41Z

src/daft-core/src/array/ops/cast.rs

+                            .unwrap()
+                            .as_arrow()
+                            .value(j);
+                    }


Here I assume the type of the values not sure how to both be able to edit the physical memory and keep it dynamically typed.

Not sure it is the idiomatic way to do it but, you can create a macro that accepts the inner_dtype and the primitive type it is mapped to, initialize the vector with the primitive and downcast the series with the array type, for example:

macro_rules! implement_cast_for_dense_with_inner_dtype { ($dtype:ty, $array_type:ty, $n_values:ident, $non_zero_indices_array:ident, $non_zero_values_array:ident, $offsets:ident) => {{ let mut values = vec![0 as $dtype; $n_values]; for (i, indices) in $non_zero_indices_array.into_iter().enumerate() { for j in 0..indices.unwrap().len() { let index = $non_zero_indices_array .get(i) .unwrap() .u64() .unwrap() .as_arrow() .value(j) as usize; let list_start_offset = $offsets.start_end(i).0; values[list_start_offset + index] = $non_zero_values_array .get(i) .unwrap() .downcast::<$array_type>() .unwrap() .as_arrow() .value(j); } } Box::new(arrow2::array::PrimitiveArray::from_vec(values)) }}; } impl COOSparseTensorArray { pub fn cast(&self, dtype: &DataType) -> DaftResult<Series> { match dtype { DataType::Tensor(inner_dtype) => { let non_zero_values_array = self.values_array(); let non_zero_indices_array = self.indices_array(); let shape_array = self.shape_array(); let mut sizes_vec: Vec<usize> = vec![0; shape_array.len()]; for (i, shape) in shape_array.into_iter().enumerate() { match shape { Some(shape) => { let shape = shape.u64().unwrap().as_arrow(); let num_elements = shape.values().clone().into_iter().product::<u64>() as usize; sizes_vec[i] = num_elements; } _ => {} } } let offsets: Offsets<i64> = Offsets::try_from_iter(sizes_vec.iter().cloned())?; let n_values = sizes_vec.iter().sum::<usize>() as usize; let item: Box<dyn arrow2::array::Array> = match inner_dtype.as_ref() { DataType::Float32 => implement_cast_for_dense_with_inner_dtype!(f32, Float32Array, n_values, non_zero_indices_array, non_zero_values_array, offsets), DataType::Int64 => implement_cast_for_dense_with_inner_dtype!(i64, Int64Array, n_values, non_zero_indices_array, non_zero_values_array, offsets), _ => panic!("Hi") }; let list_arr = ListArray::new( Field::new( "data", DataType::List(Box::new(inner_dtype.as_ref().clone())), ), Series::try_from(( "item", item, ))?, offsets.into(), None, ).into_series(); let physical_type = dtype.to_physical(); let struct_array = StructArray::new( Field::new(self.name(), physical_type), vec![list_arr, shape_array.clone().into_series()], None ); Ok( TensorArray::new(Field::new(self.name(), dtype.clone()), struct_array) .into_series(), ) } (_) => self.physical.cast(dtype), } } }

codspeed-hq · 2024-08-27T13:59:32Z

CodSpeed Performance Report

Merging #2722 will not alter performance

_{Comparing michaelvay:michaelva/sparse-tensor-logical-type (ae722ae) with main (b0f31e3)}

Summary

✅ 7 untouched benchmarks

🆕 10 new benchmarks
⁉️ 10 dropped benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`main`	`michaelvay:michaelva/sparse-tensor-logical-type`	Change
🆕	`test_tpch[1-in-memory-native-10]`	N/A	197 ms	N/A
🆕	`test_tpch[1-in-memory-native-1]`	N/A	450.1 ms	N/A
🆕	`test_tpch[1-in-memory-native-2]`	N/A	96.1 ms	N/A
🆕	`test_tpch[1-in-memory-native-3]`	N/A	138.6 ms	N/A
🆕	`test_tpch[1-in-memory-native-4]`	N/A	143.1 ms	N/A
🆕	`test_tpch[1-in-memory-native-5]`	N/A	381.8 ms	N/A
🆕	`test_tpch[1-in-memory-native-6]`	N/A	29 ms	N/A
🆕	`test_tpch[1-in-memory-native-7]`	N/A	134.4 ms	N/A
🆕	`test_tpch[1-in-memory-native-8]`	N/A	339.3 ms	N/A
🆕	`test_tpch[1-in-memory-native-9]`	N/A	366.5 ms	N/A
⁉️	`test_tpch[gen_tpch:1-get_df:in-memory-engine:native-q:10]`	192.4 ms	N/A	N/A
⁉️	`test_tpch[gen_tpch:1-get_df:in-memory-engine:native-q:1]`	445.7 ms	N/A	N/A
⁉️	`test_tpch[gen_tpch:1-get_df:in-memory-engine:native-q:2]`	96.3 ms	N/A	N/A
⁉️	`test_tpch[gen_tpch:1-get_df:in-memory-engine:native-q:3]`	140 ms	N/A	N/A
⁉️	`test_tpch[gen_tpch:1-get_df:in-memory-engine:native-q:4]`	142.7 ms	N/A	N/A
⁉️	`test_tpch[gen_tpch:1-get_df:in-memory-engine:native-q:5]`	377.1 ms	N/A	N/A
⁉️	`test_tpch[gen_tpch:1-get_df:in-memory-engine:native-q:6]`	29.8 ms	N/A	N/A
⁉️	`test_tpch[gen_tpch:1-get_df:in-memory-engine:native-q:7]`	129.9 ms	N/A	N/A
⁉️	`test_tpch[gen_tpch:1-get_df:in-memory-engine:native-q:8]`	332.3 ms	N/A	N/A
⁉️	`test_tpch[gen_tpch:1-get_df:in-memory-engine:native-q:9]`	366 ms	N/A	N/A

samster25 · 2024-08-29T08:03:06Z

Hi @michaelvay! Great to see the PR up! Is this ready for me to start taking a look?

michaelvay · 2024-09-01T05:28:53Z

Hi @michaelvay! Great to see the PR up! Is this ready for me to start taking a look?

Hi @samster25, yes its ready, let me know if anything is missing

universalmind303

overall looks pretty good, I think we don't need to expose what kind of sparse tensor this is, so I'd prefer dropping the coo prefix everywhere. I think just adding a doccomment or comment somewhere explaining what kind of sparse tensor this is should suffice.

…docstring

michaelvay · 2024-09-05T15:09:35Z

@universalmind303 Thanks for the quick review! Is there anything else needed for this PR?

universalmind303 · 2024-09-05T23:58:08Z

@universalmind303 Thanks for the quick review! Is there anything else needed for this PR?

no this looks good to me, I'd like if @samster25 or @jaychia could take a look at it as well though. They are a bit more knowledgeable than me on this part of the codebase.

samster25 · 2024-09-06T00:30:21Z

src/daft-core/src/array/ops/cast.rs

+                let data_iterator = self.data_array().into_iter();
+                let validity = self.data_array().validity();
+                let shape_and_data_iter = shape_iterator.zip(data_iterator);
+                let zero_series = Int64Array::from((


you should be able to do Int64Array::from(("item", [0].as_slice()))

samster25 · 2024-09-06T00:35:33Z

src/daft-core/src/array/ops/cast.rs

+                    if !is_valid {
+                        // Handle invalid row by populating dummy data.
+                        offsets.push(1);
+                        non_zero_values.push(Series::empty("dummy", inner_dtype.as_ref()));


I believe you don't have to push anything here since it doesn't contribute to the final set of values in the series.

samster25 · 2024-09-06T00:37:20Z

src/daft-core/src/array/ops/cast.rs

+                    let is_valid = validity.map_or(true, |v| v.get_bit(i));
+                    if !is_valid {
+                        // Handle invalid row by populating dummy data.
+                        offsets.push(1);


It doesn't seem like these offsets are used. What we would do here is push the count_so_far into this offset vec that can be converted directly into Offsets<i64>

Correct the offsets and dummy values doesnt contribute anything to the final values in the series. I kept them in order to preserve the validity of the original dense tensor. If I drop it the offsets and validity would not match.

samster25 · 2024-09-06T00:40:03Z

src/daft-core/src/array/ops/cast.rs

+    non_zero_values_array: &ListArray,
+    offsets: &Offsets<i64>,
+) -> DaftResult<Box<dyn arrow2::array::Array>> {
+    let item: Box<dyn arrow2::array::Array> = match inner_dtype {


for these types we have the with_match_numeric_daft_types macro.
see

Daft/src/daft-core/src/series/ops/between.rs

Line 36 in 6fe408c

_ => with_match_numeric_daft_types!(comp_type, |$T| {

as an example.

samster25 · 2024-09-06T00:43:32Z

src/daft-core/src/array/ops/cast.rs

+                continue;
+            }
+            for j in 0..indices.unwrap().len() {
+                let index = $non_zero_indices_array


we should move these

let index_array = $non_zero_indices_array .get(i) .unwrap() .u64() .unwrap(); let values_array = $non_zero_values_array .get(i) .unwrap() .downcast::<$array_type>() .unwrap() .as_arrow();

to the outer loop. and then iterate over the values via
for (idx, val) in index_array.into_iter().zip(values_array.into_iter()) {

}

This will let us avoid bounds checks and let this be vectorized.

samster25 · 2024-09-06T00:50:16Z

src/daft-core/src/array/ops/cast.rs

+                let validity = self.physical.validity();
+                let zero_series = Int64Array::from((
+                    "item",
+                    Box::new(arrow2::array::Int64Array::from_iter([Some(0)].iter())),


can do Int64Array::from((name, [0].as_slice()))

samster25 · 2024-09-06T00:50:50Z

src/daft-core/src/array/ops/cast.rs

+                    let is_valid = validity.map_or(true, |v| v.get_bit(i));
+                    if !is_valid {
+                        // Handle invalid row by populating dummy data.
+                        offsets.push(1);


same feedback about offsets and elliding the empty Series as above.

samster25 · 2024-09-06T00:51:36Z

hi @michaelvay Great work! Just let some minor feedback but overall looks great!

samster25

Looks great, Amazing first contribution! (I think the biggest we have seen so far!) @universalmind303 Do you mind helping out to resolve the merge conflicts? For context, we had a clean-up week which might have been the cause for conflicts.

… sparse tensor api

jaychia · 2024-09-23T20:16:54Z

Thanks @michaelvay! Looks like we might be good for merge after conflict resolution?

samster25 · 2024-09-23T20:51:46Z

Just merged! Thanks @michaelvay for all your hard work :)

WIP: sparse - dense tensor roundtip test

1b606e2

github-actions bot added the enhancement New feature or request label Aug 25, 2024

michaelvay commented Aug 25, 2024

View reviewed changes

michaelvay added 3 commits August 26, 2024 15:07

fix None handlng for COOSparseTensor

4c58ffb

extend casting coo sparse tensor to tensor for all arrow numeric types

8d356c6

update dtype panic msg when casting sparse to dense

abe8251

Add casting for coo sparse tensors

6fab44c

Add sparse tensor parquet roundtrip test, add repr test

fff1c3f

Merge branch 'main' into michaelva/sparse-tensor-logical-type

105793a

universalmind303 reviewed Sep 4, 2024

View reviewed changes

michaelvay added 2 commits September 5, 2024 14:34

Remove COO prefix from sparse tensors, note coo representation in py …

75414ee

…docstring

fix sparse tensro repr test

37a1a5e

samster25 reviewed Sep 6, 2024

View reviewed changes

michaelvay added 2 commits September 12, 2024 11:44

PR changes: use with_match_numeric_daft_types, cleaner zeros_series init

4853057

Merge branch 'main' into michaelva/sparse-tensor-logical-type

b8220d8

universalmind303 requested a review from samster25 September 16, 2024 19:21

samster25 approved these changes Sep 19, 2024

View reviewed changes

michaelvay added 2 commits September 22, 2024 16:36

Fix fixed_shape_sparse_tensor to fixed_shape_tensor cast, add missing…

59c9fd5

… sparse tensor api

Merge branch 'main' into michaelva/sparse-tensor-logical-type

ae722ae

samster25 approved these changes Sep 23, 2024

View reviewed changes

samster25 merged commit d5b9a95 into Eventual-Inc:main Sep 23, 2024
35 of 36 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Add Sparse Tensor logical type #2722

[FEAT] Add Sparse Tensor logical type #2722

michaelvay commented Aug 25, 2024 •

edited

Loading

michaelvay Aug 25, 2024

GuyPozner Aug 26, 2024

codspeed-hq bot commented Aug 27, 2024 •

edited

Loading

samster25 commented Aug 29, 2024

michaelvay commented Sep 1, 2024

universalmind303 left a comment

michaelvay commented Sep 5, 2024

universalmind303 commented Sep 5, 2024

samster25 Sep 6, 2024

samster25 Sep 6, 2024

samster25 Sep 6, 2024

michaelvay Sep 9, 2024 •

edited

Loading

samster25 Sep 6, 2024

samster25 Sep 6, 2024

samster25 Sep 6, 2024

samster25 Sep 6, 2024

samster25 Sep 6, 2024

samster25 commented Sep 6, 2024

samster25 left a comment

jaychia commented Sep 23, 2024

samster25 commented Sep 23, 2024

[FEAT] Add Sparse Tensor logical type #2722

[FEAT] Add Sparse Tensor logical type #2722

Conversation

michaelvay commented Aug 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codspeed-hq bot commented Aug 27, 2024 • edited Loading

Merging #2722 will not alter performance

Summary

Benchmarks breakdown

samster25 commented Aug 29, 2024

michaelvay commented Sep 1, 2024

universalmind303 left a comment

Choose a reason for hiding this comment

michaelvay commented Sep 5, 2024

universalmind303 commented Sep 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelvay Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samster25 commented Sep 6, 2024

samster25 left a comment

Choose a reason for hiding this comment

jaychia commented Sep 23, 2024

samster25 commented Sep 23, 2024

michaelvay commented Aug 25, 2024 •

edited

Loading

codspeed-hq bot commented Aug 27, 2024 •

edited

Loading

michaelvay Sep 9, 2024 •

edited

Loading