-
Notifications
You must be signed in to change notification settings - Fork 786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an ExtensionType to DataType enum #4472
Comments
I might be missing something here, but why would it be lost, schema metadata should roundtrip over C data interface?
My major objection to this approach is that it undermines the transparency of extension types. I feel quite strongly that only codepaths explicitly concerned with extension types should need concern themselves with them, for example the take or arithmetic kernel should not need to know about extension types. However, adding a |
This works well for RecordBatch, but not for an individual array transported independent from any batch. Basically, arrays themselves have no way to be tagged as an extension array, since those don't contain a field where that metadata is stored; they are only extension arrays in the context of a batch.
I definitely agree, and don't want to make these operations more complex than they ought to be. If we can think of another place to put this information, I'm open to that. (A bit of a tangent, but...) In my ideal world, there would be a logical type enum and a physical type enum. Physical types would be the current |
FWIW my workaround for now is to just wrap it in a record batch and unwrap on the other side. But it would be nice to find a way to not have to do that. |
I personally think Field is the correct location for such metadata, imo DataType should only contain the information for kernels to interpret the physical array data. Concerns about nullability, specialized kernels for extension types, etc... I think should be handled at a higher level. Whilst it isn't a true logical vs physical separation, I think that is a helpful way to conceptualize it.
This feels like a limitation of the way the python conversion is implemented
In particular the schema is inferred from the array's data type. If instead there were a way to provide a |
I think that could be a decent approach, although I'm still trying to understand what that would look like. It sounds like the arrow-rs type system is closed, but can be wrapped in a higher-level type system. (whereas, the C++ kernels are extension-aware). So it sounds like the point to add extension types is when building extensions in datafusion. Eventually, I think it would be nice to have an example in arrow-datafusion showing how to add support for a simple extension type (such as UUID) in the engine. Basically the end result would be showing that SELECT gen_random_uuid() outputs
Showing that you can add methods that output extension types ( |
I encountered the same trouble when trying to add a custom type. Although the extension type can be marked through the metadata of FIeld, the metadata will be lost in the scene of array processing, for example: udf in datafusion |
I'm taking a stab at migrating my In particular, a geospatial algorithm would have to return a Edit: If I'm understanding correctly, it's also impossible to implement impl TryFrom<&dyn Array> for GeometryArray like I could in arrow2, because |
@yukkit is contemplating User Defined Types in DataFusion, and the arrow extension type mechanism is the obvious implementation I think -- see apache/datafusion#7923
@tustvold are you proposing something like the following? enum DataType {
....
List(FieldRef),
/// Extension type, with potentially embedded metadata in the field reference
Extension(FieldRef)
} This proposal runs afoul of how I think this structure would allow @kylebarron to implement impl TryFrom<&dyn Array> for GeometryArray {
fn try_from(arr: &dyn Array>) -> Result<Self> {
match arr.data_type() {
DataType::Extension(field) if is_geo_type(field.metadata()) => {
.... do the conversion ...
}
dt => Err("Unsupported datatype: {dt}
}
} |
No I'm proposing not making changes to DataType and using the Field metadata that already exists. This way we avoid conflating physical and logical type information. This is the same mechanism we use in IOx to encode the logical notions of tag vs field columns |
So how would we implement @kylebarron 's use case? Perhaps via a RecordBatch (with a single column)? |
My interpretation of this is that it's a "zero-sum" architecture decision, in the sense that if you don't want to conflate logical and physical types in the |
You would need whatever performs the kernel selection to have the Field, most likely via Schema, I'm not sure you necessarily need this information simultaneously with the Array? For example, the DF PhysicalExpr could have already extracted the necessary metadata at plan time (although I think it has access to the schema anyway). |
@tustvold do you think If it's the latter, I think I can understand the position to keep extension type separate. But if it's the former, it's hard to see how we can provide a decent UX without bringing the extension type into the array itself. For example, if we return a RecordBatches with a UUID, a user might reasonable surprised that the UUID column prints as the raw bytes and not a hyphenated UUID string. |
RecordBatch/Schema is but one way that users might choose to expose logical type information, they might also define their own array abstractions that wrap arrow arrays, or their own schema abstraction at plan time, etc... As @kylebarron rightly states it's a zero-sum API challenge, either all of arrow must become aware of extension types, or it is confined to the areas that actually care. It seems odd to me to optimize the design here for things that are not present in the specification at the expense of everything else, further it seems unfortunate to optimise for one particular way of encoding logical type information. I don't have all the answers here, I don't know what a general purpose logical type abstraction looks like, if such a thing even exists, but it does seem that the core library shouldn't be opinionated in this regard |
For Python conversion specifically, this might be solved with the new PyCapsule interface (ref #5067) because the |
FYI @yukkit has created a PR showing how |
### What Basic type conversions from TransportChunk to RecordBatch and back. Adding the round-trip test turned up an interesting issue. TransportChunk <-> RecordBatch fails to round-trip successfully because we lose the ExtensionType encapsulation that used to be encoded by arrow2. While on the surface this isn't immediately problematic, as we don't care about ExtensionTypes, the discussion indicates there are in fact going to be very real pain points when it comes to writing semantic data processing engines using arrow-rs. This is because the metadata is attached to the FIELD, not the DATATYPE, and there exist many processing contexts where the context of that field itself is lost. apache/arrow-rs#4472 ### Checklist * [x] I have read and agree to [Contributor Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and the [Code of Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md) * [x] I've included a screenshot or gif (if applicable) * [x] I have tested the web demo (if applicable): * Using examples from latest `main` build: [rerun.io/viewer](https://rerun.io/viewer/pr/7355?manifest_url=https://app.rerun.io/version/main/examples_manifest.json) * Using full set of examples from `nightly` build: [rerun.io/viewer](https://rerun.io/viewer/pr/7355?manifest_url=https://app.rerun.io/version/nightly/examples_manifest.json) * [x] The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG * [x] If applicable, add a new check to the [release checklist](https://github.com/rerun-io/rerun/blob/main/tests/python/release_checklist)! * [x] If have noted any breaking changes to the log API in `CHANGELOG.md` and the migration guide - [PR Build Summary](https://build.rerun.io/pr/7355) - [Recent benchmark results](https://build.rerun.io/graphs/crates.html) - [Wasm size tracking](https://build.rerun.io/graphs/sizes.html) To run all checks from `main`, comment on the PR with `@rerun-bot full-check`.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Extension types are annotated in field metadata. This works well with record batches, but when exporting/importing an array over the C data interface, the extension type metadata is lost.
The C++ implementation solves this by having an ExtensionType class and always exports that metadata over C data interface here:
https://github.com/apache/arrow/blob/b9aec9ad2b655817b8925462e4e2dd6973807e23/cpp/src/arrow/c/bridge.cc#L243-L252
Describe the solution you'd like
I'd propose adding a new enum variant to DataType:
Then make sure the C data interface implementation handles exporting and importing this type.
Describe alternatives you've considered
We could add an extension type registry like C++, but that seems heavier than we really need.
Additional context
https://arrow.apache.org/docs/format/CDataInterface.html#extension-arrays
Previous discussions:
The text was updated successfully, but these errors were encountered: