-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Add Sparse Tensor logical type #2722
[FEAT] Add Sparse Tensor logical type #2722
Conversation
src/daft-core/src/array/ops/cast.rs
Outdated
.unwrap() | ||
.as_arrow() | ||
.value(j); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I assume the type of the values not sure how to both be able to edit the physical memory and keep it dynamically typed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure it is the idiomatic way to do it but, you can create a macro that accepts the inner_dtype and the primitive type it is mapped to, initialize the vector with the primitive and downcast the series with the array type, for example:
macro_rules! implement_cast_for_dense_with_inner_dtype {
($dtype:ty, $array_type:ty, $n_values:ident, $non_zero_indices_array:ident, $non_zero_values_array:ident, $offsets:ident) => {{
let mut values = vec![0 as $dtype; $n_values];
for (i, indices) in $non_zero_indices_array.into_iter().enumerate() {
for j in 0..indices.unwrap().len() {
let index = $non_zero_indices_array
.get(i)
.unwrap()
.u64()
.unwrap()
.as_arrow()
.value(j) as usize;
let list_start_offset = $offsets.start_end(i).0;
values[list_start_offset + index] = $non_zero_values_array
.get(i)
.unwrap()
.downcast::<$array_type>()
.unwrap()
.as_arrow()
.value(j);
}
}
Box::new(arrow2::array::PrimitiveArray::from_vec(values))
}};
}
impl COOSparseTensorArray {
pub fn cast(&self, dtype: &DataType) -> DaftResult<Series> {
match dtype {
DataType::Tensor(inner_dtype) => {
let non_zero_values_array = self.values_array();
let non_zero_indices_array = self.indices_array();
let shape_array = self.shape_array();
let mut sizes_vec: Vec<usize> = vec![0; shape_array.len()];
for (i, shape) in shape_array.into_iter().enumerate() {
match shape {
Some(shape) => {
let shape = shape.u64().unwrap().as_arrow();
let num_elements =
shape.values().clone().into_iter().product::<u64>() as usize;
sizes_vec[i] = num_elements;
}
_ => {}
}
}
let offsets: Offsets<i64> = Offsets::try_from_iter(sizes_vec.iter().cloned())?;
let n_values = sizes_vec.iter().sum::<usize>() as usize;
let item: Box<dyn arrow2::array::Array> = match inner_dtype.as_ref() {
DataType::Float32 => implement_cast_for_dense_with_inner_dtype!(f32, Float32Array, n_values, non_zero_indices_array, non_zero_values_array, offsets),
DataType::Int64 => implement_cast_for_dense_with_inner_dtype!(i64, Int64Array, n_values, non_zero_indices_array, non_zero_values_array, offsets),
_ => panic!("Hi")
};
let list_arr = ListArray::new(
Field::new(
"data",
DataType::List(Box::new(inner_dtype.as_ref().clone())),
),
Series::try_from((
"item",
item,
))?,
offsets.into(),
None,
).into_series();
let physical_type = dtype.to_physical();
let struct_array = StructArray::new(
Field::new(self.name(), physical_type),
vec![list_arr, shape_array.clone().into_series()],
None
);
Ok(
TensorArray::new(Field::new(self.name(), dtype.clone()), struct_array)
.into_series(),
)
}
(_) => self.physical.cast(dtype),
}
}
}
CodSpeed Performance ReportMerging #2722 will not alter performanceComparing Summary
Benchmarks breakdown
|
Hi @michaelvay! Great to see the PR up! Is this ready for me to start taking a look? |
Hi @samster25, yes its ready, let me know if anything is missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks pretty good, I think we don't need to expose what kind of sparse tensor this is, so I'd prefer dropping the coo
prefix everywhere. I think just adding a doccomment or comment somewhere explaining what kind of sparse tensor this is should suffice.
@universalmind303 Thanks for the quick review! Is there anything else needed for this PR? |
no this looks good to me, I'd like if @samster25 or @jaychia could take a look at it as well though. They are a bit more knowledgeable than me on this part of the codebase. |
src/daft-core/src/array/ops/cast.rs
Outdated
let data_iterator = self.data_array().into_iter(); | ||
let validity = self.data_array().validity(); | ||
let shape_and_data_iter = shape_iterator.zip(data_iterator); | ||
let zero_series = Int64Array::from(( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should be able to do Int64Array::from(("item", [0].as_slice()))
if !is_valid { | ||
// Handle invalid row by populating dummy data. | ||
offsets.push(1); | ||
non_zero_values.push(Series::empty("dummy", inner_dtype.as_ref())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you don't have to push anything here since it doesn't contribute to the final set of values in the series.
let is_valid = validity.map_or(true, |v| v.get_bit(i)); | ||
if !is_valid { | ||
// Handle invalid row by populating dummy data. | ||
offsets.push(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem like these offsets are used. What we would do here is push the count_so_far
into this offset vec that can be converted directly into Offsets<i64>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct the offsets and dummy values doesnt contribute anything to the final values in the series. I kept them in order to preserve the validity of the original dense tensor. If I drop it the offsets and validity would not match.
src/daft-core/src/array/ops/cast.rs
Outdated
non_zero_values_array: &ListArray, | ||
offsets: &Offsets<i64>, | ||
) -> DaftResult<Box<dyn arrow2::array::Array>> { | ||
let item: Box<dyn arrow2::array::Array> = match inner_dtype { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for these types we have the with_match_numeric_daft_types
macro.
see
_ => with_match_numeric_daft_types!(comp_type, |$T| { |
src/daft-core/src/array/ops/cast.rs
Outdated
continue; | ||
} | ||
for j in 0..indices.unwrap().len() { | ||
let index = $non_zero_indices_array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should move these
let index_array = $non_zero_indices_array
.get(i)
.unwrap()
.u64()
.unwrap();
let values_array = $non_zero_values_array
.get(i)
.unwrap()
.downcast::<$array_type>()
.unwrap()
.as_arrow();
to the outer loop. and then iterate over the values via
for (idx, val) in index_array.into_iter().zip(values_array.into_iter()) {
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will let us avoid bounds checks and let this be vectorized.
src/daft-core/src/array/ops/cast.rs
Outdated
let validity = self.physical.validity(); | ||
let zero_series = Int64Array::from(( | ||
"item", | ||
Box::new(arrow2::array::Int64Array::from_iter([Some(0)].iter())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can do Int64Array::from((name, [0].as_slice()))
let is_valid = validity.map_or(true, |v| v.get_bit(i)); | ||
if !is_valid { | ||
// Handle invalid row by populating dummy data. | ||
offsets.push(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same feedback about offsets and elliding the empty Series as above.
hi @michaelvay Great work! Just let some minor feedback but overall looks great! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, Amazing first contribution! (I think the biggest we have seen so far!) @universalmind303 Do you mind helping out to resolve the merge conflicts? For context, we had a clean-up week which might have been the cause for conflicts.
Thanks @michaelvay! Looks like we might be good for merge after conflict resolution? |
Just merged! Thanks @michaelvay for all your hard work :) |
Closes #2494
It's still WIP, wanted to get some feedback on the draft and see if I am in the right path.
Still having issues with: