-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AnyDictionary Abstraction and Take ArrayRef in DictionaryArray::with_values #4707
Conversation
} | ||
|
||
impl<K: ArrowDictionaryKeyType> AnyDictionaryArray for DictionaryArray<K> { | ||
fn keys(&self) -> &dyn Array { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will tie in nicely with #4705
@@ -460,21 +461,18 @@ impl<K: ArrowDictionaryKeyType> DictionaryArray<K> { | |||
/// } | |||
/// ``` | |||
/// | |||
pub fn with_values(&self, values: &dyn Array) -> Self { | |||
pub fn with_values(&self, values: ArrayRef) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change, and makes this API consistent with the other constructors and avoids needing to go via ArrayData
b10a9af
to
aaae7b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tustvold -- I think this is getting close. The last missing part in my mind is applying a transform to the values to operate on the array (aka a version of unary
that works on dyn Array
)
/// | ||
/// This can be used to efficiently implement kernels for all possible dictionary | ||
/// keys without needing to create specialized implementations for each key type | ||
pub fn as_any_dictionary_array(array: &dyn Array) -> Option<&dyn AnyDictionaryArray> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can also add as_any_dictionary
and as_any_dictionary_opt
to AsArray
?:
https://docs.rs/arrow/latest/arrow/array/trait.AsArray.html#method.as_dictionary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to AsArray
@@ -82,7 +82,7 @@ where | |||
{ | |||
let dict_values = array.values().as_any().downcast_ref().unwrap(); | |||
let values = unary::<T, F, T>(dict_values, op); | |||
Ok(Arc::new(array.with_values(&values))) | |||
Ok(Arc::new(array.with_values(Arc::new(values)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about deprecating unary_dict
and instead adding something that takes an AnyDictionaryArray
?
I feel like the common pattern is:
/// Apply op to the values of `array` and return a new array, with the same keys and type but transformed values
fn unary_dict(array: &dyn DictionaryArrayAny, op: F) -> Result<ArrayRef, ArrowError>
where:
F: Fn(T::Native) -> T::Native,
F: Fn(T::Native) -> T::Native,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would entail generating code that is generic over both the dictionary and all the primitive values, which is precisely what we are trying to avoid doing. This categorically should not be the pattern.
Instead users should write a function with a kernel like
fn my_amazing_kernel(a: &dyn Array) -> Result<ArrayRef> {
if let Some(a) = as_any_dictionary_array(a) {
let values = my_amazing_kernel(a.values())?;
Ok(a.with_values(values))
}
downcast_primitive_array! {
a => ...
}
}
What do you think about deprecating unary_dict
Yes, I eventually would hope that we can deprecate and remove unary_dict, it is a deeply problematic function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b95d762 deprecates unary_dict and adds an example of how to handle this correctly
Integration test failure seems unrelated, will see if sorts itself out tomorrow, if not I'll raise an upstream report |
} | ||
|
||
/// Applies an infallible unary function to an array with primitive values. | ||
#[deprecated(note = "Use arrow_array::AnyDictionaryArray")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tustvold @alamb saw this when upgrading to the latest DF 31.0.0, and I think this deprecation note doesn't look very helpful: should users implement the functionalities of unary_dyn
by themselves using things like as_any_dictionary_opt
? shall we just change the implementation of unary_dyn
using the new APIs instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My major motivation for deprecating it was to move away from these quasi-generic APIs, that are the worst of both worlds. They have to parameterised on the value type, whilst also not being properly generic. Additionally in this particular case it results in codegen for each type of key type, and returns a PrimitiveArray when it could preserve the dictionary.
There is an example on AnyDictionaryArray of how to handle this better - https://docs.rs/arrow-array/latest/arrow_array/array/trait.AnyDictionaryArray.html. Swapping the with_values for the take kernel will hydrate the dictionary as a PrimitiveArray if that is the desired behaviour, this will be both faster and result in less codegen
TLDR this function has very peculiar semantics that I think encourage a problematic pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally in this particular case it results in codegen for each type of key type, and returns a PrimitiveArray when it could preserve the dictionary.
I can understand it has to do case analysis for different key types, but why it would return a PrimitiveArray unnecessarily? it does call unary_dict
which preserves the dictionary right?
Yea, when seeing the deprecation message, I was expecting that a equivalent function would be provided for users to migrate to, instead of having to find a way to replicate the logic by themselves. I'm thinking we can come up with a new implementation for unary_dyn
without the key type enumeration problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does call unary_dict which preserves the dictionary right?
Aah yeah sorry, I'm confusing this with one of the other recently removed APIs
I'm thinking we can come up with a new implementation for unary_dyn without the key type enumeration problem
We could do, I would prefer to encourage users to not implement type dispatch at this level. Why duplicate the any-dictionary dispatch logic for every type of primitive value, and not just handle the dispatch logic once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm sorry, not sure I understand this. I think it would still be valuable to have a function similar to unary
but works for dictionary arrays? unary_dyn
bridges the gap by taking an ArrayRef
as input so users don't have to do the special handling of dictionary array. Otherwise, I guess they all have to repeat the steps of:
if let Some(d) = a.as_any_dictionary_opt() {
// Recursively handle dictionary input
let r = my_kernel(d.values().as_ref())?;
return Ok(d.with_values(r));
}
by themselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops sorry, it should be something like:
pub fn my_function(array: ArrayRef, to_type: &DataType) -> ArrayRef {
match to_type {
DataType::Int64 => {
unary_dyn::<_, Int64Type>(&array, |v| v.div_floor(MICROS_PER_SECOND)).unwrap()
},
...
}
}
basically we are doing some additional casting on top of the input array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't really follow, but this is sounding sufficiently use-case specific that perhaps it doesn't belong upstream??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea sure, we can keep this logic in our own repo for now. Just wanna raise this in case someone else run into the same issue :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, I guess they all have to repeat the steps of:
I agree the pattern of I have a function I want to apply to an array and I would like it applied either to a non-dictionary or dictionary encoded version of that type is very common.
It would be great to avoid clients having to write stuff like
if let Some(d) = a.as_any_dictionary_opt() {
// Recursively handle dictionary input
let r = my_kernel(d.values().as_ref())?;
return Ok(d.with_values(r));
} else {
// apply my kernel to dictionary input)
return my_kernel(a.as_ref())
}
Which forces users to:
- Remember to handle dictionaries
- won't automatically work for things like REE (or StringViews)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBC I'm not objecting to what @alamb proposes, just observing that this function doesn't actually provide that interface.
If we can find a way to avoid people needing to duplicate logic, I'm all for it, I've just not been able to devise such an interface
The challenge is Rust doesn't support higher rank types, so you can't pass a generic function as an argument.
Perhaps macros might work... But they have their own issues...
Which issue does this PR close?
Closes #.
Rationale for this change
apache/datafusion#5471 (comment)
What changes are included in this PR?
Are there any user-facing changes?