-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-39024: [Compute] Allow implicitly casting extension to storage types in compute functions #39200
base: main
Are you sure you want to change the base?
Conversation
|
I am not sure it's a good idea to do this automatically and exhaustively. Some operations will depend on the logical meaning of the types: you probably don't want the same casting rules to apply to This seems reasonable for functions that merely forward one of their arguments, though, such as |
I see. I think there are two things we can do do mitigate this problem:
|
There's also a more passive yet controlled approach: We add a |
That would work but that means a lot of work for each extension type to follow additions to the range of default compute functions. I would suggest we perhaps need a more general semantic description of storage type equivalence. Draft: class ExtensionType {
public:
// Storage equivalence for equality testing and hashing
static constexpr uint32_t kEquality = 1;
// Storage equivalence for ordered comparisons
static constexpr uint32_t kOrdering = 2;
// Storage equivalence for selections (filter, take, etc.)
static constexpr uint32_t kSelection = 4;
// Storage equivalence for arithmetic
static constexpr uint32_t kArithmetic = 8;
// Storage equivalence for explicit casts
static constexpr uint32_t kCasting = 16;
// Storage equivalence for all operations
static constexpr uint32_t kAny = std::numeric_limits<uint32_t>::max();
// By default, an extension type can be implicitly handled as its storage type
// for selections, equality testing and hashing.
virtual uint32_t storage_equivalence() const { return kEquality | kSelection; } @felipecrv WDYT? @alamb @tustvold How do you handle extension types in Datafusion? |
TLDR is that we don't currently handle them We have discussed how to do so in apache/datafusion#7923 and apache/arrow-rs#4472 but I would say we have not yet reached a consensus cc @wjones127 and @yukkit who I think may be interested |
Another way to handle this would be with a property on kernels (or even functions) instead of on ExtensionType: class ARROW_EXPORT InputType {
public:
bool accepts_extension_types_if_storage_matches = false;
// ...
// ...
filter_kernel->signature->in_types()[0].accepts_extension_types_if_storage_matches = true; This would have the effect of disabling operation on extension arrays' storage by default, but allow enabling it per kernel or function. (The filter for accepted extension types could of course also be more articulated than a simple I think this would put configurability where it needs to be since the specific categories selection, equality comparison, and arithmetic have differing levels of difficulty in support:
Given that supporting various operations on the stored types is so nuanced, it seems we'd need to reserve handling of it for a system which can express those nuances - like kernel dispatch. Another solution would be to formalize what we (de facto) have done thus far: allow casting to/from storage types, allowing operation on storage only when explicitly requested. |
+1 on what @bkietz said: the casting behavior should be declared on the operation, not on the type itself. Ideally we would have a declarative way of defining more elaborate type constraints on kernels and a decidable solver that dispatches to the appropriate implementation. If we implement that we have to write a checker that ensures there is always only one solution (ordering of declaration of the kernel implementations shouldn't matter). |
Ok, say I have a IPv4 extension type backed by a UInt32 storage type. I want to declare that ordering on IPv4 extension values is the same as on the storage type (so that, e.g., sorting would work automatically). How does that work with the |
The operation declares that it can receive int32 and all logical types defined in terms of int32 (like
But to be fair, I would have to think more carefully about these things before accepting any magical implicit behavior. See how much pain the C implicit casts (that look inoffensive) have created in the world. |
Yes, but the problem here is another extension type based on |
FWIW the design in arrow-rs currently is that extension types only exist in the metadata. The rationale being that by default kernels should treat them as the storage type. Imo it goes against the whole purpose of extension types if the only way to support them is for every kernel to be aware of all possible extension types. That's not to say there can't be special kernels for certain extension types, potentially selected as part of a query engine's planning, but I can't help feeling if an array can't be correctly treated as its physical type, it isn't really an extension type but something else entirely... That's all to say I agree with coercing extension arrays to their physical types, arrow-rs goes even further by not having an extension array at all. However, this is just my view on this issue, and I suspect others will feel differently |
Mark, to be addressed later |
I quite like @pitrou's description of equivalence between a type and its storage, which lets extension type authors get a lot of mileage out of existing internals for simple cases. For example, you probably want Allowing an implicit or automatic cast to storage seems like a unsafe precedent; however, allowing I can't currently think of an example where an explicit
I think it is up to extension type authors to decide what logical manipulation can and cannot be performed on an array. The extension type purpose (IMO) is that implementations take care of the physical manipulations (filter, take, slice, concatenate, read/write files). What I like about Antoine's suggestion is that it makes opting in to more storage behaviour very easy (since many extension types probably want to opt in to some or all storage behaviour) but is safe by default. |
I think this conversation needs to move to the mailing list. I'll open a thread with a summary |
I think another problem it's that maybe it's hard to define some logic in extension type without or even with these casts. For example, if I want to define a "cast_extension_to_string", I cannot add my logic to the MetaFunction "Cast", do we have some solving for this? |
Rationale for this change
Extension types are only logical, i.e. each extension type has a underlying native storage type. We can allow implicit casting from extension types to their storage types. I believe this can make
ExtensionType
s much easier to use.What changes are included in this PR?
DispatchBest
, if no matching kernels exist, allow extension types to be casted to their storage types and match again. The kernel with the minimum number of casts is selected. If there are multiple best kernels, report an error. The dispatching algo is similar to C++'s.DispatchBest
, callEnsureExtensionToStorage
to simply replace extension types by their storage types, because these functions do not provide native kernels for extension types so we don't need to count the number of casts needed.Are these changes tested?
Yes.
Are there any user-facing changes?
No.