Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AnyDictionary Abstraction and Take ArrayRef in DictionaryArray::with_values #4707

Merged
merged 3 commits into from
Aug 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions arrow-arith/src/arity.rs
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ where
{
let dict_values = array.values().as_any().downcast_ref().unwrap();
let values = unary::<T, F, T>(dict_values, op);
Ok(Arc::new(array.with_values(&values)))
Ok(Arc::new(array.with_values(Arc::new(values))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about deprecating unary_dict and instead adding something that takes an AnyDictionaryArray ?

I feel like the common pattern is:

/// Apply op to the values of `array` and return a new array, with the same keys and type but transformed values 
fn unary_dict(array: &dyn DictionaryArrayAny, op: F) -> Result<ArrayRef, ArrowError>
where:
     F: Fn(T::Native) -> T::Native,
    F: Fn(T::Native) -> T::Native,

Copy link
Contributor Author

@tustvold tustvold Aug 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would entail generating code that is generic over both the dictionary and all the primitive values, which is precisely what we are trying to avoid doing. This categorically should not be the pattern.

Instead users should write a function with a kernel like

fn my_amazing_kernel(a: &dyn Array) -> Result<ArrayRef> {
    if let Some(a) = as_any_dictionary_array(a) {
        let values = my_amazing_kernel(a.values())?;
        Ok(a.with_values(values))
    }

    downcast_primitive_array! {
        a => ...
    }
}

What do you think about deprecating unary_dict

Yes, I eventually would hope that we can deprecate and remove unary_dict, it is a deeply problematic function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b95d762 deprecates unary_dict and adds an example of how to handle this correctly

}

/// A helper function that applies a fallible unary function to a dictionary array with primitive value type.
Expand All @@ -105,10 +105,11 @@ where

let dict_values = array.values().as_any().downcast_ref().unwrap();
let values = try_unary::<T, F, T>(dict_values, op)?;
Ok(Arc::new(array.with_values(&values)))
Ok(Arc::new(array.with_values(Arc::new(values))))
}

/// Applies an infallible unary function to an array with primitive values.
#[deprecated(note = "Use arrow_array::AnyDictionaryArray")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold @alamb saw this when upgrading to the latest DF 31.0.0, and I think this deprecation note doesn't look very helpful: should users implement the functionalities of unary_dyn by themselves using things like as_any_dictionary_opt? shall we just change the implementation of unary_dyn using the new APIs instead?

Copy link
Contributor Author

@tustvold tustvold Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My major motivation for deprecating it was to move away from these quasi-generic APIs, that are the worst of both worlds. They have to parameterised on the value type, whilst also not being properly generic. Additionally in this particular case it results in codegen for each type of key type, and returns a PrimitiveArray when it could preserve the dictionary.

There is an example on AnyDictionaryArray of how to handle this better - https://docs.rs/arrow-array/latest/arrow_array/array/trait.AnyDictionaryArray.html. Swapping the with_values for the take kernel will hydrate the dictionary as a PrimitiveArray if that is the desired behaviour, this will be both faster and result in less codegen

TLDR this function has very peculiar semantics that I think encourage a problematic pattern

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally in this particular case it results in codegen for each type of key type, and returns a PrimitiveArray when it could preserve the dictionary.

I can understand it has to do case analysis for different key types, but why it would return a PrimitiveArray unnecessarily? it does call unary_dict which preserves the dictionary right?

Yea, when seeing the deprecation message, I was expecting that a equivalent function would be provided for users to migrate to, instead of having to find a way to replicate the logic by themselves. I'm thinking we can come up with a new implementation for unary_dyn without the key type enumeration problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does call unary_dict which preserves the dictionary right?

Aah yeah sorry, I'm confusing this with one of the other recently removed APIs

I'm thinking we can come up with a new implementation for unary_dyn without the key type enumeration problem

We could do, I would prefer to encourage users to not implement type dispatch at this level. Why duplicate the any-dictionary dispatch logic for every type of primitive value, and not just handle the dispatch logic once?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm sorry, not sure I understand this. I think it would still be valuable to have a function similar to unary but works for dictionary arrays? unary_dyn bridges the gap by taking an ArrayRef as input so users don't have to do the special handling of dictionary array. Otherwise, I guess they all have to repeat the steps of:

    if let Some(d) = a.as_any_dictionary_opt() {
        // Recursively handle dictionary input
        let r = my_kernel(d.values().as_ref())?;
        return Ok(d.with_values(r));
    }

by themselves?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops sorry, it should be something like:

pub fn my_function(array: ArrayRef, to_type: &DataType) -> ArrayRef {
    match to_type {
        DataType::Int64 => {
            unary_dyn::<_, Int64Type>(&array, |v| v.div_floor(MICROS_PER_SECOND)).unwrap()
        },
        ...
    }
}

basically we are doing some additional casting on top of the input array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't really follow, but this is sounding sufficiently use-case specific that perhaps it doesn't belong upstream??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea sure, we can keep this logic in our own repo for now. Just wanna raise this in case someone else run into the same issue :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, I guess they all have to repeat the steps of:

I agree the pattern of I have a function I want to apply to an array and I would like it applied either to a non-dictionary or dictionary encoded version of that type is very common.

It would be great to avoid clients having to write stuff like

    if let Some(d) = a.as_any_dictionary_opt() {
        // Recursively handle dictionary input
        let r = my_kernel(d.values().as_ref())?;
        return Ok(d.with_values(r));
    } else {
       // apply my kernel to dictionary input)
       return my_kernel(a.as_ref())
    }

Which forces users to:

  1. Remember to handle dictionaries
  2. won't automatically work for things like REE (or StringViews)

Copy link
Contributor Author

@tustvold tustvold Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBC I'm not objecting to what @alamb proposes, just observing that this function doesn't actually provide that interface.

If we can find a way to avoid people needing to duplicate logic, I'm all for it, I've just not been able to devise such an interface

The challenge is Rust doesn't support higher rank types, so you can't pass a generic function as an argument.

Perhaps macros might work... But they have their own issues...

pub fn unary_dyn<F, T>(array: &dyn Array, op: F) -> Result<ArrayRef, ArrowError>
where
T: ArrowPrimitiveType,
Expand All @@ -134,6 +135,7 @@ where
}

/// Applies a fallible unary function to an array with primitive values.
#[deprecated(note = "Use arrow_array::AnyDictionaryArray")]
pub fn try_unary_dyn<F, T>(array: &dyn Array, op: F) -> Result<ArrayRef, ArrowError>
where
T: ArrowPrimitiveType,
Expand Down Expand Up @@ -436,6 +438,7 @@ mod tests {
use arrow_array::types::*;

#[test]
#[allow(deprecated)]
fn test_unary_f64_slice() {
let input =
Float64Array::from(vec![Some(5.1f64), None, Some(6.8), None, Some(7.2)]);
Expand All @@ -455,6 +458,7 @@ mod tests {
}

#[test]
#[allow(deprecated)]
fn test_unary_dict_and_unary_dyn() {
let mut builder = PrimitiveDictionaryBuilder::<Int8Type, Int32Type>::new();
builder.append(5).unwrap();
Expand Down
2 changes: 1 addition & 1 deletion arrow-arith/src/temporal.rs
Original file line number Diff line number Diff line change
Expand Up @@ -462,7 +462,7 @@ where
downcast_dictionary_array!(
array => {
let values = time_fraction_dyn(array.values(), name, op)?;
Ok(Arc::new(array.with_values(&values)))
Ok(Arc::new(array.with_values(values)))
}
dt => return_compute_error_with!(format!("{name} does not support"), dt),
)
Expand Down
116 changes: 101 additions & 15 deletions arrow-array/src/array/dictionary_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -434,6 +434,7 @@ impl<K: ArrowDictionaryKeyType> DictionaryArray<K> {
/// Panics if `values` has a length less than the current values
///
/// ```
/// # use std::sync::Arc;
/// # use arrow_array::builder::PrimitiveDictionaryBuilder;
/// # use arrow_array::{Int8Array, Int64Array, ArrayAccessor};
/// # use arrow_array::types::{Int32Type, Int8Type};
Expand All @@ -451,7 +452,7 @@ impl<K: ArrowDictionaryKeyType> DictionaryArray<K> {
/// let values: Int64Array = typed_dictionary.values().unary(|x| x as i64);
///
/// // Create a Dict(Int32,
/// let new = dictionary.with_values(&values);
/// let new = dictionary.with_values(Arc::new(values));
///
/// // Verify values are as expected
/// let new_typed = new.downcast_dict::<Int64Array>().unwrap();
Expand All @@ -460,21 +461,18 @@ impl<K: ArrowDictionaryKeyType> DictionaryArray<K> {
/// }
/// ```
///
pub fn with_values(&self, values: &dyn Array) -> Self {
pub fn with_values(&self, values: ArrayRef) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change, and makes this API consistent with the other constructors and avoids needing to go via ArrayData

assert!(values.len() >= self.values.len());

let builder = self
.to_data()
.into_builder()
.data_type(DataType::Dictionary(
Box::new(K::DATA_TYPE),
Box::new(values.data_type().clone()),
))
.child_data(vec![values.to_data()]);

// SAFETY:
// Offsets were valid before and verified length is greater than or equal
Self::from(unsafe { builder.build_unchecked() })
let data_type = DataType::Dictionary(
Box::new(K::DATA_TYPE),
Box::new(values.data_type().clone()),
);
Self {
data_type,
keys: self.keys.clone(),
values,
is_ordered: false,
}
}

/// Returns `PrimitiveDictionaryBuilder` of this dictionary array for mutating
Expand Down Expand Up @@ -930,6 +928,94 @@ where
}
}

/// A [`DictionaryArray`] with the key type erased
///
/// This can be used to efficiently implement kernels for all possible dictionary
/// keys without needing to create specialized implementations for each key type
///
/// For example
///
/// ```
/// # use arrow_array::*;
/// # use arrow_array::cast::AsArray;
/// # use arrow_array::builder::PrimitiveDictionaryBuilder;
/// # use arrow_array::types::*;
/// # use arrow_schema::ArrowError;
/// # use std::sync::Arc;
///
/// fn to_string(a: &dyn Array) -> Result<ArrayRef, ArrowError> {
/// if let Some(d) = a.as_any_dictionary_opt() {
/// // Recursively handle dictionary input
/// let r = to_string(d.values().as_ref())?;
/// return Ok(d.with_values(r));
/// }
/// downcast_primitive_array! {
/// a => Ok(Arc::new(a.iter().map(|x| x.map(|x| x.to_string())).collect::<StringArray>())),
/// d => Err(ArrowError::InvalidArgumentError(format!("{d:?} not supported")))
/// }
/// }
///
/// let result = to_string(&Int32Array::from(vec![1, 2, 3])).unwrap();
/// let actual = result.as_string::<i32>().iter().map(Option::unwrap).collect::<Vec<_>>();
/// assert_eq!(actual, &["1", "2", "3"]);
///
/// let mut dict = PrimitiveDictionaryBuilder::<Int32Type, UInt16Type>::new();
/// dict.extend([Some(1), Some(1), Some(2), Some(3), Some(2)]);
/// let dict = dict.finish();
///
/// let r = to_string(&dict).unwrap();
/// let r = r.as_dictionary::<Int32Type>().downcast_dict::<StringArray>().unwrap();
/// assert_eq!(r.keys(), dict.keys()); // Keys are the same
///
/// let actual = r.into_iter().map(Option::unwrap).collect::<Vec<_>>();
/// assert_eq!(actual, &["1", "1", "2", "3", "2"]);
/// ```
///
/// See [`AsArray::as_any_dictionary_opt`] and [`AsArray::as_any_dictionary`]
pub trait AnyDictionaryArray: Array {
/// Returns the primitive keys of this dictionary as an [`Array`]
fn keys(&self) -> &dyn Array;

/// Returns the values of this dictionary
fn values(&self) -> &ArrayRef;

/// Returns the keys of this dictionary as usize
///
/// The values for nulls will be arbitrary, but are guaranteed
/// to be in the range `0..self.values.len()`
///
/// # Panic
///
/// Panics if `values.len() == 0`
fn normalized_keys(&self) -> Vec<usize>;

/// Create a new [`DictionaryArray`] replacing `values` with the new values
///
/// See [`DictionaryArray::with_values`]
fn with_values(&self, values: ArrayRef) -> ArrayRef;
}

impl<K: ArrowDictionaryKeyType> AnyDictionaryArray for DictionaryArray<K> {
fn keys(&self) -> &dyn Array {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will tie in nicely with #4705

&self.keys
}

fn values(&self) -> &ArrayRef {
self.values()
}

fn normalized_keys(&self) -> Vec<usize> {
let v_len = self.values().len();
assert_ne!(v_len, 0);
let iter = self.keys().values().iter();
iter.map(|x| x.as_usize().min(v_len)).collect()
}

fn with_values(&self, values: ArrayRef) -> ArrayRef {
Arc::new(self.with_values(values))
}
}

#[cfg(test)]
mod tests {
use super::*;
Expand Down
20 changes: 20 additions & 0 deletions arrow-array/src/cast.rs
Original file line number Diff line number Diff line change
Expand Up @@ -833,6 +833,14 @@ pub trait AsArray: private::Sealed {
fn as_dictionary<K: ArrowDictionaryKeyType>(&self) -> &DictionaryArray<K> {
self.as_dictionary_opt().expect("dictionary array")
}

/// Downcasts this to a [`AnyDictionaryArray`] returning `None` if not possible
fn as_any_dictionary_opt(&self) -> Option<&dyn AnyDictionaryArray>;

/// Downcasts this to a [`AnyDictionaryArray`] panicking if not possible
fn as_any_dictionary(&self) -> &dyn AnyDictionaryArray {
self.as_any_dictionary_opt().expect("any dictionary array")
}
}

impl private::Sealed for dyn Array + '_ {}
Expand Down Expand Up @@ -874,6 +882,14 @@ impl AsArray for dyn Array + '_ {
) -> Option<&DictionaryArray<K>> {
self.as_any().downcast_ref()
}

fn as_any_dictionary_opt(&self) -> Option<&dyn AnyDictionaryArray> {
let array = self;
downcast_dictionary_array! {
array => Some(array),
_ => None
}
}
}

impl private::Sealed for ArrayRef {}
Expand Down Expand Up @@ -915,6 +931,10 @@ impl AsArray for ArrayRef {
) -> Option<&DictionaryArray<K>> {
self.as_ref().as_dictionary_opt()
}

fn as_any_dictionary_opt(&self) -> Option<&dyn AnyDictionaryArray> {
self.as_ref().as_any_dictionary_opt()
}
}

#[cfg(test)]
Expand Down
2 changes: 1 addition & 1 deletion arrow-row/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1642,7 +1642,7 @@ mod tests {
// Construct dictionary with a timezone
let dict = a.finish();
let values = TimestampNanosecondArray::from(dict.values().to_data());
let dict_with_tz = dict.with_values(&values.with_timezone("+02:00"));
let dict_with_tz = dict.with_values(Arc::new(values.with_timezone("+02:00")));
let d = DataType::Dictionary(
Box::new(DataType::Int32),
Box::new(DataType::Timestamp(
Expand Down
Loading