Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Add Sparse Tensor logical type #2722

Merged

Conversation

michaelvay
Copy link
Contributor

@michaelvay michaelvay commented Aug 25, 2024

Closes #2494

It's still WIP, wanted to get some feedback on the draft and see if I am in the right path.

Still having issues with:

  • When casting From COOSparseTensorArray to TensorArray keeping the values dynamically type. What I currently do is assuming the type in order to be able to iterate over the values and in insert the non zero values in the relevant indices.

@github-actions github-actions bot added the enhancement New feature or request label Aug 25, 2024
.unwrap()
.as_arrow()
.value(j);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I assume the type of the values not sure how to both be able to edit the physical memory and keep it dynamically typed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it is the idiomatic way to do it but, you can create a macro that accepts the inner_dtype and the primitive type it is mapped to, initialize the vector with the primitive and downcast the series with the array type, for example:

macro_rules! implement_cast_for_dense_with_inner_dtype {
    ($dtype:ty, $array_type:ty, $n_values:ident, $non_zero_indices_array:ident, $non_zero_values_array:ident, $offsets:ident) => {{
        let mut values = vec![0 as $dtype; $n_values];
        for (i, indices) in $non_zero_indices_array.into_iter().enumerate() {
            for j in 0..indices.unwrap().len() {
                let index = $non_zero_indices_array
                    .get(i)
                    .unwrap()
                    .u64()
                    .unwrap()
                    .as_arrow()
                    .value(j) as usize;
                let list_start_offset = $offsets.start_end(i).0;
                values[list_start_offset + index] = $non_zero_values_array
                    .get(i)
                    .unwrap()
                    .downcast::<$array_type>()
                    .unwrap()
                    .as_arrow()
                    .value(j);
            }
        }
        Box::new(arrow2::array::PrimitiveArray::from_vec(values))
    }};
}

impl COOSparseTensorArray {
    pub fn cast(&self, dtype: &DataType) -> DaftResult<Series> {
        match dtype {
            DataType::Tensor(inner_dtype) => {
                let non_zero_values_array = self.values_array();
                let non_zero_indices_array = self.indices_array();
                let shape_array = self.shape_array();
                let mut sizes_vec: Vec<usize> = vec![0; shape_array.len()];
                for (i, shape) in shape_array.into_iter().enumerate() {
                    match shape {
                        Some(shape) => {
                            let shape = shape.u64().unwrap().as_arrow();
                            let num_elements =
                                shape.values().clone().into_iter().product::<u64>() as usize;
                            sizes_vec[i] = num_elements;
                        }
                        _ => {}
                    }
                }
                let offsets: Offsets<i64> = Offsets::try_from_iter(sizes_vec.iter().cloned())?;
                let n_values = sizes_vec.iter().sum::<usize>() as usize;
                let item: Box<dyn arrow2::array::Array> = match inner_dtype.as_ref() {
                    DataType::Float32 => implement_cast_for_dense_with_inner_dtype!(f32, Float32Array, n_values, non_zero_indices_array, non_zero_values_array, offsets),
                    DataType::Int64 => implement_cast_for_dense_with_inner_dtype!(i64, Int64Array, n_values, non_zero_indices_array, non_zero_values_array, offsets),
                    _ => panic!("Hi")
                };
                let list_arr = ListArray::new(
                    Field::new(
                        "data",
                        DataType::List(Box::new(inner_dtype.as_ref().clone())),
                    ),
                    Series::try_from((
                        "item",
                        item,
                    ))?,
                    offsets.into(),
                    None,
                ).into_series();
                let physical_type = dtype.to_physical();
                let struct_array = StructArray::new(
                    Field::new(self.name(), physical_type),
                    vec![list_arr, shape_array.clone().into_series()],
                    None
                );
                Ok(
                    TensorArray::new(Field::new(self.name(), dtype.clone()), struct_array)
                        .into_series(),
                )
            }
            (_) => self.physical.cast(dtype),
        }
    }
}

Copy link

codspeed-hq bot commented Aug 27, 2024

CodSpeed Performance Report

Merging #2722 will not alter performance

Comparing michaelvay:michaelva/sparse-tensor-logical-type (ae722ae) with main (b0f31e3)

Summary

✅ 7 untouched benchmarks

🆕 10 new benchmarks
⁉️ 10 dropped benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main michaelvay:michaelva/sparse-tensor-logical-type Change
🆕 test_tpch[1-in-memory-native-10] N/A 197 ms N/A
🆕 test_tpch[1-in-memory-native-1] N/A 450.1 ms N/A
🆕 test_tpch[1-in-memory-native-2] N/A 96.1 ms N/A
🆕 test_tpch[1-in-memory-native-3] N/A 138.6 ms N/A
🆕 test_tpch[1-in-memory-native-4] N/A 143.1 ms N/A
🆕 test_tpch[1-in-memory-native-5] N/A 381.8 ms N/A
🆕 test_tpch[1-in-memory-native-6] N/A 29 ms N/A
🆕 test_tpch[1-in-memory-native-7] N/A 134.4 ms N/A
🆕 test_tpch[1-in-memory-native-8] N/A 339.3 ms N/A
🆕 test_tpch[1-in-memory-native-9] N/A 366.5 ms N/A
⁉️ test_tpch[gen_tpch:1-get_df:in-memory-engine:native-q:10] 192.4 ms N/A N/A
⁉️ test_tpch[gen_tpch:1-get_df:in-memory-engine:native-q:1] 445.7 ms N/A N/A
⁉️ test_tpch[gen_tpch:1-get_df:in-memory-engine:native-q:2] 96.3 ms N/A N/A
⁉️ test_tpch[gen_tpch:1-get_df:in-memory-engine:native-q:3] 140 ms N/A N/A
⁉️ test_tpch[gen_tpch:1-get_df:in-memory-engine:native-q:4] 142.7 ms N/A N/A
⁉️ test_tpch[gen_tpch:1-get_df:in-memory-engine:native-q:5] 377.1 ms N/A N/A
⁉️ test_tpch[gen_tpch:1-get_df:in-memory-engine:native-q:6] 29.8 ms N/A N/A
⁉️ test_tpch[gen_tpch:1-get_df:in-memory-engine:native-q:7] 129.9 ms N/A N/A
⁉️ test_tpch[gen_tpch:1-get_df:in-memory-engine:native-q:8] 332.3 ms N/A N/A
⁉️ test_tpch[gen_tpch:1-get_df:in-memory-engine:native-q:9] 366 ms N/A N/A

@samster25
Copy link
Member

Hi @michaelvay! Great to see the PR up! Is this ready for me to start taking a look?

@michaelvay
Copy link
Contributor Author

Hi @michaelvay! Great to see the PR up! Is this ready for me to start taking a look?

Hi @samster25, yes its ready, let me know if anything is missing

Copy link
Contributor

@universalmind303 universalmind303 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks pretty good, I think we don't need to expose what kind of sparse tensor this is, so I'd prefer dropping the coo prefix everywhere. I think just adding a doccomment or comment somewhere explaining what kind of sparse tensor this is should suffice.

@michaelvay
Copy link
Contributor Author

@universalmind303 Thanks for the quick review! Is there anything else needed for this PR?

@universalmind303
Copy link
Contributor

@universalmind303 Thanks for the quick review! Is there anything else needed for this PR?

no this looks good to me, I'd like if @samster25 or @jaychia could take a look at it as well though. They are a bit more knowledgeable than me on this part of the codebase.

let data_iterator = self.data_array().into_iter();
let validity = self.data_array().validity();
let shape_and_data_iter = shape_iterator.zip(data_iterator);
let zero_series = Int64Array::from((
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should be able to do Int64Array::from(("item", [0].as_slice()))

if !is_valid {
// Handle invalid row by populating dummy data.
offsets.push(1);
non_zero_values.push(Series::empty("dummy", inner_dtype.as_ref()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you don't have to push anything here since it doesn't contribute to the final set of values in the series.

let is_valid = validity.map_or(true, |v| v.get_bit(i));
if !is_valid {
// Handle invalid row by populating dummy data.
offsets.push(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like these offsets are used. What we would do here is push the count_so_far into this offset vec that can be converted directly into Offsets<i64>

Copy link
Contributor Author

@michaelvay michaelvay Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct the offsets and dummy values doesnt contribute anything to the final values in the series. I kept them in order to preserve the validity of the original dense tensor. If I drop it the offsets and validity would not match.

non_zero_values_array: &ListArray,
offsets: &Offsets<i64>,
) -> DaftResult<Box<dyn arrow2::array::Array>> {
let item: Box<dyn arrow2::array::Array> = match inner_dtype {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for these types we have the with_match_numeric_daft_types macro.
see

_ => with_match_numeric_daft_types!(comp_type, |$T| {
as an example.

continue;
}
for j in 0..indices.unwrap().len() {
let index = $non_zero_indices_array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should move these

 let index_array = $non_zero_indices_array
                    .get(i)
                    .unwrap()
                    .u64()
                    .unwrap();
let values_array = $non_zero_values_array
                    .get(i)
                    .unwrap()
                    .downcast::<$array_type>()
                    .unwrap()
                    .as_arrow();

to the outer loop. and then iterate over the values via
for (idx, val) in index_array.into_iter().zip(values_array.into_iter()) {

}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will let us avoid bounds checks and let this be vectorized.

let validity = self.physical.validity();
let zero_series = Int64Array::from((
"item",
Box::new(arrow2::array::Int64Array::from_iter([Some(0)].iter())),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can do Int64Array::from((name, [0].as_slice()))

let is_valid = validity.map_or(true, |v| v.get_bit(i));
if !is_valid {
// Handle invalid row by populating dummy data.
offsets.push(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same feedback about offsets and elliding the empty Series as above.

@samster25
Copy link
Member

hi @michaelvay Great work! Just let some minor feedback but overall looks great!

Copy link
Member

@samster25 samster25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, Amazing first contribution! (I think the biggest we have seen so far!) @universalmind303 Do you mind helping out to resolve the merge conflicts? For context, we had a clean-up week which might have been the cause for conflicts.

@jaychia
Copy link
Contributor

jaychia commented Sep 23, 2024

Thanks @michaelvay! Looks like we might be good for merge after conflict resolution?

@samster25 samster25 merged commit d5b9a95 into Eventual-Inc:main Sep 23, 2024
35 of 36 checks passed
@samster25
Copy link
Member

Just merged! Thanks @michaelvay for all your hard work :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Daft Sparse Tensor Logical Type
5 participants