Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: [Docs] Update extension type examples to not use UUID #43849

Closed
wants to merge 130 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
130 commits
Select commit Hold shift + click to select a range
4db2ca0
UuidType -> RationalType in the docs
khwilson Aug 27, 2024
4d95f22
fix json formatting
khwilson Aug 27, 2024
fde3215
fix some typos
khwilson Aug 27, 2024
d875d2c
Update docs/source/python/extending_types.rst
khwilson Aug 28, 2024
eb69055
Update docs/source/python/extending_types.rst
khwilson Aug 28, 2024
3fc6842
response to ianmcook
khwilson Aug 28, 2024
0e051ea
define parameters
khwilson Aug 31, 2024
8614fd5
Update python/pyarrow/types.pxi
khwilson Sep 8, 2024
03b3fd9
Update docs/source/python/extending_types.rst
khwilson Sep 8, 2024
2abdcb8
more edits
khwilson Sep 8, 2024
4c66e4d
missed one formatting
khwilson Sep 8, 2024
253156b
import pyarrow in doctests examples
khwilson Sep 8, 2024
da6cdf2
import pyarrow in more doctests examples
khwilson Sep 8, 2024
d31d59b
MINOR: [Java] Bump dep.slf4j.version from 2.0.13 to 2.0.16 in /java (…
dependabot[bot] Aug 26, 2024
8855c59
MINOR: [R] Add missing PR num to news.md item (#43811)
amoeba Aug 26, 2024
7bc2e01
MINOR: [Java] Bump dep.junit.jupiter.version from 5.10.3 to 5.11.0 in…
dependabot[bot] Aug 26, 2024
20f8357
GH-15058: [C++][Python] Native support for UUID (#37298)
rok Aug 26, 2024
95bce2e
MINOR: [Go] Bump github.com/hamba/avro/v2 from 2.24.1 to 2.25.0 in /g…
dependabot[bot] Aug 27, 2024
e8912c9
GH-43667: [Java] Keeping Flight default header size consistent betwee…
PANKAJ9768 Aug 27, 2024
aa8950f
MINOR: [Go] Bump github.com/substrait-io/substrait-go from 0.6.0 to 0…
dependabot[bot] Aug 27, 2024
db0029f
MINOR: [Java] Downgrade gRPC to 1.65 (#43839)
lidavidm Aug 27, 2024
5b98125
MINOR: [Java] Bump org.apache.commons:commons-compress from 1.27.0 to…
dependabot[bot] Aug 27, 2024
ca4c756
MINOR: [C#] Bump Microsoft.NET.Test.Sdk from 17.10.0 to 17.11.0 in /c…
dependabot[bot] Aug 27, 2024
909ae17
GH-41056: [GLib][FlightRPC] Add gaflight_client_do_put() and related …
kou Aug 27, 2024
ef33625
GH-43815: [CI][Packaging][Python] Avoid uploading wheel to gemfury if…
raulcd Aug 27, 2024
ab38581
GH-43790: [Go][Parquet] Add support for LZ4_RAW compression codec (#4…
joellubi Aug 27, 2024
581a6db
MINOR: [CI] Use `docker compose` on self-hosted ARM builds (#43844)
pitrou Aug 27, 2024
5dcd5eb
GH-43805: [C++] Enable filesystem automatically when one of ARROW_{AZ…
kou Aug 27, 2024
fd3df37
MINOR: [Java] Logback dependency upgrade (#43842)
vibhatha Aug 28, 2024
18a8670
MINOR: [Java] Bump commons-cli:commons-cli from 1.8.0 to 1.9.0 in /ja…
dependabot[bot] Aug 28, 2024
a48f69b
MINOR: [Java] Bump com.google.api.grpc:proto-google-common-protos fro…
dependabot[bot] Aug 28, 2024
2fc8423
GH-43860: [Go][Parquet] Handle the error correctly (#43861)
bigsheeper Aug 28, 2024
4b9434e
GH-43854: [C++] Expose the set of device types where a ChunkedArray i…
felipecrv Aug 28, 2024
f460bcf
GH-38183: [CI][Python] Use pipx to install GCS testbench (#43852)
pitrou Aug 29, 2024
97d5b25
GH-43877: [Ruby] Add support for 0 decimal value (#43882)
kou Aug 29, 2024
ac4a714
GH-43870: [C++][Acero] Fix typos in join benchmark (#43871)
zanmato1984 Aug 29, 2024
4a7a421
GH-41696: [Python][Packaging] Bump MACOSX_DEPLOYMENT_TARGET to 12 ins…
raulcd Aug 29, 2024
962f98e
GH-43732: [Go] Require Go 1.22 or above (#43864)
haoxins Aug 29, 2024
8a79889
GH-43759: [C++] Acero: Minor code enhancement for Join (#43760)
mapleFU Aug 29, 2024
3446117
GH-43885: [C++][CI] Catch potential integer overflow in PoolBuffer (#…
pitrou Aug 29, 2024
b6c2237
GH-43869: [Java][CI] Flight related failure in the AMD64 Windows Serv…
vibhatha Aug 30, 2024
37d54cb
GH-43837: [Go][IPC] Consolidate StreamWriter and FileWriter, ensuring…
joellubi Aug 30, 2024
9ef2e74
MINOR: [JS] Bump @swc/helpers from 0.5.11 to 0.5.12 in /js (#43901)
dependabot[bot] Sep 1, 2024
1679f3c
GH-43665: [R] Remove references to bindings vignette (#43889)
nealrichardson Sep 1, 2024
373de67
MINOR: [JS] Bump ix from 6.0.0 to 7.0.0 in /js (#43898)
dependabot[bot] Sep 2, 2024
aa006c3
MINOR: [JS] Bump @typescript-eslint/eslint-plugin from 7.12.0 to 7.18…
dependabot[bot] Sep 2, 2024
576c8cc
MINOR: [R] Fix monospace formatting in dplyr-funcs-doc (#43461)
feinleib Sep 2, 2024
37909ba
GH-43894: [R] format_aggregation() should print options too (#43896)
nealrichardson Sep 2, 2024
732b104
GH-25118: [Python] Make NumPy an optional runtime dependency (#41904)
raulcd Sep 2, 2024
a16021a
GH-43758: [C++] Compute: More comment in RowEncoder (#43763)
mapleFU Sep 2, 2024
9ed05af
GH-43768: [C++] Fix the case when boolean_{any|all} meets constant in…
mapleFU Sep 2, 2024
7ed01ee
GH-43883: [CI] Remove Python version guard when installing GCS testbe…
pitrou Sep 2, 2024
319a37f
MINOR: [C#] Bump Grpc.Tools from 2.65.0 to 2.66.0 in /csharp (#43913)
dependabot[bot] Sep 2, 2024
4103604
MINOR: [C#] Bump Google.Protobuf from 3.27.3 to 3.28.0 in /csharp (#4…
dependabot[bot] Sep 2, 2024
d5b873a
GH-40216: [CI][Packaging][Python] Upload pyarrow nightly wheels to sc…
raulcd Sep 2, 2024
2c502aa
GH-43746: [C++] Add support for Boost 1.86 (#43766)
kou Sep 3, 2024
11bee6e
MINOR: [CI] Bump actions/setup-python from 5.1.1 to 5.2.0 (#43917)
dependabot[bot] Sep 3, 2024
6111912
GH-43797: [C++] Attach `arrow::ArrayStatistics` to `arrow::ArrayData`…
kou Sep 3, 2024
4ee3f3a
MINOR: [Java] Bump org.mockito:mockito-junit-jupiter from 5.12.0 to 5…
dependabot[bot] Sep 3, 2024
6b7ed7a
MINOR: [Java] Bump com.github.luben:zstd-jni from 1.5.6-4 to 1.5.6-5 …
dependabot[bot] Sep 3, 2024
6686afa
MINOR: [Java] Bump org.apache.orc:orc-core from 1.9.2 to 1.9.4 in /ja…
dependabot[bot] Sep 3, 2024
9657077
MINOR: [Java] Bump parquet.version from 1.14.1 to 1.14.2 in /java (#4…
dependabot[bot] Sep 3, 2024
34eaeb5
MINOR: [Java] Bump error_prone_core.version from 2.30.0 to 2.31.0 in …
dependabot[bot] Sep 3, 2024
4050087
GH-40216: [Python][CI][Packaging] Upload nightly wheels to main label…
jorisvandenbossche Sep 3, 2024
908b509
GH-43907: [C#][FlightRPC] Add Grpc Call Options support on Flight Cli…
qmmk Sep 3, 2024
22f16df
GH-43927: [C++] Make ChunkResolver::ResolveMany output a list of Chun…
felipecrv Sep 3, 2024
f4d3c56
GH-43719: [C++] Clarify the way SIMD-enabled agg kernels come from th…
felipecrv Sep 3, 2024
3ac2a29
GH-43933: [CI] Remove docker-compose warnings (#43934)
lysnikolaou Sep 3, 2024
b50ad30
GH-43902: [Java] Support for Long memory addresses (#43903)
vibhatha Sep 4, 2024
96b91c2
GH-43727: [Python] RecordBatch fails gracefully on non-cpu devices (#…
danepitkin Sep 4, 2024
ab6ddac
GH-40216: [Python][CI][Packaging] Don't upload sdist to scientific-py…
jorisvandenbossche Sep 4, 2024
bb646a3
GH-43669: [Docs][Dev] Document archery --debug flag in section about …
jorisvandenbossche Sep 4, 2024
63d3992
GH-43672: [C#] Schema should be optional on FlightInfo (#43673)
ndglover Sep 4, 2024
02c2c8a
GH-43728: [Python] ChunkedArray fails gracefully on non-cpu devices (…
danepitkin Sep 4, 2024
3ff350b
GH-38255: [Java] Implement Flight SQL Bulk Ingestion (#43551)
eramitmittal Sep 5, 2024
f6c58ec
GH-43952: [CI] Bump actions/{upload|download}-artifact from 3 to late…
dependabot[bot] Sep 5, 2024
caa368e
GH-43299: [Release][Packaging] Only include pyarrow folder when find…
raulcd Sep 5, 2024
790f892
GH-43967: [C++] Enhance error message for URI parsing (#43938)
CrystalZhou0529 Sep 5, 2024
981b841
GH-43946: [C++][Parquet] Guard against use of cleared decryptor/encry…
pitrou Sep 5, 2024
97172b6
GH-43969: [CI][Dev] Prune .dockerignore (#43971)
pitrou Sep 5, 2024
0b4066f
GH-40154: [C++][Parquet] Separate encoders and decoder (#43972)
pitrou Sep 5, 2024
f8d2f8f
MINOR: [CI][C++] Add C++ example builds to "cpp" Crossbow task group …
pitrou Sep 5, 2024
a95e612
GH-43944: [C++][Parquet] Add support for arrow::ArrayStatistics: non …
kou Sep 5, 2024
efb1551
GH-43796: [C++] Indent preprocessor directives (#43798)
kou Sep 6, 2024
9489c10
MINOR: [Java] Bump com.puppycrawl.tools:checkstyle from 10.17.0 to 10…
vibhatha Sep 6, 2024
b7959c1
GH-43979: [CI][C++][Dev] Add cpplint to pre-commit (#43982)
kou Sep 6, 2024
319efc3
GH-43712: [C++][Parquet] Dataset: Handle num-nulls in Parquet correct…
mapleFU Sep 6, 2024
df4d396
GH-43992: [C++] Add missing std::move() in array_nested.cc (#43993)
mapleFU Sep 6, 2024
e073859
GH-43814: [GLib][FlightRPC] Add `GAFlightServerClass::do_put` (#43999)
kou Sep 7, 2024
6b17f49
GH-43983: [C++][Parquet] Add support for arrow::ArrayStatistics: zero…
kou Sep 8, 2024
5ca26d7
GH-40860: [GLib][Parquet] Add `gparquet_arrow_file_writer_write_recor…
kou Sep 9, 2024
38b3770
GH-43966: [Java] Check for nullabilities when comparing StructVector …
hellishfire Sep 9, 2024
55130fa
GH-43986: [C++][Acero] Some code cleanup to `Grouper` (#43988)
zanmato1984 Sep 9, 2024
41d66d8
GH-43301: [C++][Parquet] Enhance the comment for ColumnReader/Decoder…
mapleFU Sep 9, 2024
5db9d64
GH-43536: [Python][CI] Add a Crossbow job with the free-threaded buil…
lysnikolaou Sep 9, 2024
1a949a1
MINOR: [CI] Bump actions/download-artifact from 4.1.7 to 4.1.8 (#44020)
dependabot[bot] Sep 9, 2024
cc5f90a
GH-44013: [Java] Consider warnings as errors for Dataset Module (#44014)
vibhatha Sep 10, 2024
e6e8585
GH-44011: [Java] Consider warnings as errors for C Module (#44012)
vibhatha Sep 10, 2024
8508c76
GH-43576: [Java] Gandiva Tests are failing due to linking issues (#43…
vibhatha Sep 10, 2024
ab90f74
GH-44036: [C++] IPC: ipc reader/writer code enhancement (#44019)
mapleFU Sep 10, 2024
6561bc0
GH-43996: [Java] Mark new allocated ArrowSchema as released (#43997)
viirya Sep 10, 2024
3f63b20
MINOR: [C#] Bump Microsoft.NET.Test.Sdk from 17.11.0 to 17.11.1 in /c…
dependabot[bot] Sep 10, 2024
3fb6e58
GH-44016: [Java] Consider warnings as errors for Format Module (#44017)
vibhatha Sep 10, 2024
3fe07d6
MINOR: [Java] Bump logback.version from 1.5.7 to 1.5.8 in /java (#44023)
dependabot[bot] Sep 10, 2024
8da5134
MINOR: [Java] Bump io.netty:netty-bom from 4.1.112.Final to 4.1.113.F…
dependabot[bot] Sep 10, 2024
9a36873
GH-43187: [C++] Support basic is_in predicate simplification (#43761)
larry98 Sep 10, 2024
b28d202
GH-43956: [Format] Allow Decimal32/Decimal64 in format (#43976)
zeroshade Sep 10, 2024
b1cf8b6
MINOR: [Java] Bump com.google.guava:guava-bom from 33.2.1-jre to 33.3…
dependabot[bot] Sep 10, 2024
2fc9dc1
MINOR: [Java] Bump checker.framework.version from 3.46.0 to 3.47.0 in…
dependabot[bot] Sep 10, 2024
d658f64
MINOR: [CI][C++] Enable core dumps and stack traces in Linux/macOS jo…
pitrou Sep 11, 2024
395ce07
GH-44044: [Java] Consider warnings as errors for Vector Module (#44045)
vibhatha Sep 11, 2024
0a4d5c1
GH-43962: [Java] Consider warnings as errors for Adapter Module (#43963)
vibhatha Sep 11, 2024
c53f430
GH-44006: [GLib][Parquet] Add `gparquet_arrow_file_writer_new_row_gro…
kou Sep 11, 2024
e4a6f1e
GH-44050: [CI][Integration] Execute integration test again (#44051)
kou Sep 11, 2024
8d5a775
GH-43973: [Python] Table fails gracefully on non-cpu devices (#43974)
danepitkin Sep 11, 2024
d4b38fd
GH-32538: [C++][Parquet] Add JSON canonical extension type (#13901)
progger-dev Sep 11, 2024
89c08a4
GH-36412: [Python][CI] Fix deprecation warning about day freq alias w…
jorisvandenbossche Sep 11, 2024
7c6c42d
MINOR: [Java] Bump com.gradle:common-custom-user-data-maven-extension…
dependabot[bot] Sep 12, 2024
837a3e2
GH-43748: [R] Handle package_version in safe_r_metadata (#43895)
nealrichardson Sep 12, 2024
0f9ed84
GH-44063: [Python] Deprecate the no longer used serialize/deserialize…
jorisvandenbossche Sep 12, 2024
002b301
GH-44072: [C++][Parquet] Add Float16 reading benchmarks (#44073)
pitrou Sep 12, 2024
a76ab32
GH-44081: [C++][Parquet] Fix reported metrics in parquet-arrow-reader…
pitrou Sep 12, 2024
5fd9d74
GH-44076: [CI] Remove verify-rc-binaries-wheel-macos-11 which is now …
raulcd Sep 12, 2024
1fe30d3
GH-44046: [Python] Fix threading issues with borrowed refs and pandas…
lysnikolaou Sep 12, 2024
d2dd352
MINOR: [CI] Bump actions/{download,upload}-artifact version (#44086)
pitrou Sep 12, 2024
ed8585e
GH-43840: [CI] Add cuda group to tasks.yml and minor updates for new …
raulcd Sep 12, 2024
a6b718e
GH-42247: [C++] Support casting to and from utf8_view/binary_view (#4…
felipecrv Sep 12, 2024
bd8866a
GH-44079: [C++][Parquet] Remove deprecated APIs (#44080)
pitrou Sep 12, 2024
5779318
MINOR: [Docs] Remove mention of JIRA issues in the contributing PR ch…
jorisvandenbossche Sep 13, 2024
0940ae8
fix spacing and version issues
khwilson Sep 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 24 additions & 7 deletions docs/source/format/Integration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -390,20 +390,37 @@ but can be of any type.

Extension types are, as in the IPC format, represented as their underlying
storage type plus some dedicated field metadata to reconstruct the extension
type. For example, assuming a "uuid" extension type backed by a
FixedSizeBinary(16) storage, here is how a "uuid" field would be represented::
type. For example, assuming a "rational" extension type backed by a
``struct<numer: int32, denom: int32>`` storage, here is how a "rational" field
would be represented::

{
"name" : "name_of_the_field",
"nullable" : /* boolean */,
"type" : {
"name" : "fixedsizebinary",
"byteWidth" : 16
"name" : "struct"
},
"children" : [],
"children" : [
{
"name": "numer",
"type": {
"name": "int",
"bitWidth": 32,
"isSigned": true
}
},
{
"name": "denom",
"type": {
"name": "int",
"bitWidth": 32,
"isSigned": true
}
}
],
"metadata" : [
{"key": "ARROW:extension:name", "value": "uuid"},
{"key": "ARROW:extension:metadata", "value": "uuid-serialized"}
{"key": "ARROW:extension:name", "value": "rational"},
{"key": "ARROW:extension:metadata", "value": "rational-serialized"}
]
}

Expand Down
88 changes: 56 additions & 32 deletions docs/source/python/extending_types.rst
Original file line number Diff line number Diff line change
Expand Up @@ -131,58 +131,82 @@ and serialization mechanism. The extension name and serialized metadata
can potentially be recognized by other (non-Python) Arrow implementations
such as PySpark.

For example, we could define a custom UUID type for 128-bit numbers which can
be represented as ``FixedSizeBinary`` type with 16 bytes::
For example, we could define a custom rational type for fractions which can
be represented as a pair of integers::

class UuidType(pa.ExtensionType):
import pyarrow as pa
import pyarrow.types as pt

khwilson marked this conversation as resolved.
Show resolved Hide resolved
class RationalType(pa.ExtensionType):

def __init__(self):
super().__init__(pa.binary(16), "my_package.uuid")

def __arrow_ext_serialize__(self):
# Since we don't have a parameterized type, we don't need extra
# metadata to be deserialized
return b''
super().__init__(
pa.struct(
[
("numer", pa.int32()),
("denom", pa.int32()),
],
),
"my_package.rational",
)

def __arrow_ext_serialize__(self) -> bytes:
# No serialized metadata necessary
return b""

@classmethod
def __arrow_ext_deserialize__(cls, storage_type, serialized):
def __arrow_ext_deserialize__(self, storage_type, serialized):
khwilson marked this conversation as resolved.
Show resolved Hide resolved
# Sanity checks, not required but illustrate the method signature.
assert storage_type == pa.binary(16)
assert pt.is_struct(storage_type)
assert pt.is_int32(storage_type[0].type)
khwilson marked this conversation as resolved.
Show resolved Hide resolved
assert serialized == b''
khwilson marked this conversation as resolved.
Show resolved Hide resolved
# Return an instance of this subclass given the serialized
# metadata.
return UuidType()

# return an instance of this subclass given the serialized
# metadata
return RationalType()


The special methods ``__arrow_ext_serialize__`` and ``__arrow_ext_deserialize__``
define the serialization of an extension type instance. For non-parametric
types such as the above, the serialization payload can be left empty.
ianmcook marked this conversation as resolved.
Show resolved Hide resolved
define the serialization of an extension type instance.

This can now be used to create arrays and tables holding the extension type::

>>> uuid_type = UuidType()
>>> uuid_type.extension_name
'my_package.uuid'
>>> uuid_type.storage_type
FixedSizeBinaryType(fixed_size_binary[16])

>>> import uuid
>>> storage_array = pa.array([uuid.uuid4().bytes for _ in range(4)], pa.binary(16))
>>> arr = pa.ExtensionArray.from_storage(uuid_type, storage_array)
>>> rational_type = RationalType()
>>> rational_type.extension_name
'my_package.rational'
>>> rational_type.storage_type
StructType(struct<numer: int32, denom: int32>)

>>> storage_array = pa.array(
... [
... {"numer": 10, "denom": 17},
... {"numer": 20, "denom": 13},
... ],
... type=rational_type.storage_type
khwilson marked this conversation as resolved.
Show resolved Hide resolved
... )
>>> arr = rational_type.wrap_array(storage_array)
>>> arr = pa.ExtensionArray.from_storage(rational_type, storage_array)
khwilson marked this conversation as resolved.
Show resolved Hide resolved
>>> arr
<pyarrow.lib.ExtensionArray object at 0x7f75c2f300a0>
<pyarrow.lib.ExtensionArray object at 0x1067f5420>
-- is_valid: all not null
-- child 0 type: int32
[
10,
20
]
-- child 1 type: int32
[
A6861959108644B797664AEEE686B682,
718747F48E5F4058A7261E2B6B228BE8,
7FE201227D624D96A5CD8639DEF2A68B,
C6CA8C7F95744BFD9462A40B3F57A86C
17,
13
]

This array can be included in RecordBatches, sent over IPC and received in
another Python process. The receiving process must explicitly register the
extension type for deserialization, otherwise it will fall back to the
storage type::

>>> pa.register_extension_type(UuidType())
>>> pa.register_extension_type(RationalType())

For example, creating a RecordBatch and writing it to a stream using the
IPC protocol::
Expand All @@ -198,10 +222,10 @@ and then reading it back yields the proper type::
>>> with pa.ipc.open_stream(buf) as reader:
... result = reader.read_all()
>>> result.column('ext').type
UuidType(FixedSizeBinaryType(fixed_size_binary[16]))
RationalType(StructType(struct<numer: int32, denom: int32>))

The receiving application doesn't need to be Python but can still recognize
the extension type as a "my_package.uuid" type, if it has implemented its own
the extension type as a "my_package.rational" type, if it has implemented its own
extension type to receive it. If the type is not registered in the receiving
application, it will fall back to the storage type.

Expand Down
166 changes: 104 additions & 62 deletions python/pyarrow/types.pxi
Original file line number Diff line number Diff line change
Expand Up @@ -1618,59 +1618,79 @@ cdef class ExtensionType(BaseExtensionType):

Examples
--------
Define a UuidType extension type subclassing ExtensionType:
Define a RationalType extension type subclassing ExtensionType:

>>> import pyarrow as pa
>>> class UuidType(pa.ExtensionType):
... def __init__(self):
... pa.ExtensionType.__init__(self, pa.binary(16), "my_package.uuid")
... def __arrow_ext_serialize__(self):
... # since we don't have a parameterized type, we don't need extra
... # metadata to be deserialized
... return b''
... @classmethod
... def __arrow_ext_deserialize__(self, storage_type, serialized):
... # return an instance of this subclass given the serialized
... # metadata.
... return UuidType()
...
>>> import pyarrow.types as pt
>>> class RationalType(pa.ExtensionType):
... def __init__(self, data_type: pa.DataType):
... if not pt.is_integer(data_type):
... raise TypeError(f"data_type must be an integer type not {data_type}")
... super().__init__(
... pa.struct(
... [
... ("numer", data_type),
... ("denom", data_type),
... ],
... ),
... # N.B. This name does _not_ reference `data_type` so deserialization
... # will work for _any_ integer `data_type` after registration
... "my_package.rational",
... )
... def __arrow_ext_serialize__(self) -> bytes:
... # No serialized metadata necessary
... return b""
ianmcook marked this conversation as resolved.
Show resolved Hide resolved
... @classmethod
... def __arrow_ext_deserialize__(self, storage_type, serialized):
khwilson marked this conversation as resolved.
Show resolved Hide resolved
... # return an instance of this subclass given the serialized
... # metadata
... return RationalType(storage_type[0].type)

Register the extension type:

>>> pa.register_extension_type(UuidType())
>>> pa.register_extension_type(RationalType(pa.int64()))

Create an instance of UuidType extension type:
Create an instance of RationalType extension type:

>>> uuid_type = UuidType()
>>> rational_type = RationalType(pa.int32())

Inspect the extension type:

>>> uuid_type.extension_name
'my_package.uuid'
>>> uuid_type.storage_type
FixedSizeBinaryType(fixed_size_binary[16])
>>> rational_type.extension_name
'my_package.rational'
>>> rational_type.storage_type
StructType(struct<numer: int32, denom: int32>)

Wrap an array as an extension array:

>>> import uuid
>>> storage_array = pa.array([uuid.uuid4().bytes for _ in range(4)], pa.binary(16))
>>> uuid_type.wrap_array(storage_array)
>>> storage_array = pa.array(
... [
... {"numer": 10, "denom": 17},
... {"numer": 20, "denom": 13},
... ],
... type=rational_type.storage_type
... )
>>> ratoinal_type.wrap_array(storage_array)
khwilson marked this conversation as resolved.
Show resolved Hide resolved
<pyarrow.lib.ExtensionArray object at ...>
[
-- is_valid: all not null
...
Copy link
Member

@ianmcook ianmcook Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this failure is happening because this shows the ellipsis instead of the full output: https://github.com/apache/arrow/actions/runs/10762310741/job/29849821115?pr=43849#step:6:9145
(as mentioned in your other comment)

]

Or do the same with creating an ExtensionArray:

>>> pa.ExtensionArray.from_storage(uuid_type, storage_array)
>>> pa.ExtensionArray.from_storage(rational_type, storage_array)
khwilson marked this conversation as resolved.
Show resolved Hide resolved
<pyarrow.lib.ExtensionArray object at ...>
[
-- is_valid: all not null
...
]

Unregister the extension type:

>>> pa.unregister_extension_type("my_package.uuid")
>>> pa.unregister_extension_type("my_package.rational")

Note that even though we registered the concrete type
``RationalType(pa.int64())``, pyarrow will be able to deserialize
``RationalType(integer_type)`` for any ``integer_type`` as the deserializer
khwilson marked this conversation as resolved.
Show resolved Hide resolved
will reference the name ``my_package.rational`` and the ``@classmethod``
``__arrow_ext_deserialize__``.
"""

def __cinit__(self):
Expand Down Expand Up @@ -2039,30 +2059,41 @@ def register_extension_type(ext_type):

Examples
--------
Define a UuidType extension type subclassing ExtensionType:
Define a RationalType extension type subclassing ExtensionType:

>>> import pyarrow as pa
>>> class UuidType(pa.ExtensionType):
... def __init__(self):
... pa.ExtensionType.__init__(self, pa.binary(16), "my_package.uuid")
... def __arrow_ext_serialize__(self):
... # since we don't have a parameterized type, we don't need extra
... # metadata to be deserialized
... return b''
... @classmethod
... def __arrow_ext_deserialize__(self, storage_type, serialized):
... # return an instance of this subclass given the serialized
... # metadata.
... return UuidType()
...
>>> import pyarrow.types as pt
>>> class RationalType(pa.ExtensionType):
... def __init__(self, data_type: pa.DataType):
... if not pt.is_integer(data_type):
... raise TypeError(f"data_type must be an integer type not {data_type}")
... super().__init__(
... pa.struct(
... [
... ("numer", data_type),
... ("denom", data_type),
... ],
... ),
... # N.B. This name does _not_ reference `data_type` so deserialization
... # will work for _any_ integer `data_type` after registration
... "my_package.rational",
... )
... def __arrow_ext_serialize__(self) -> bytes:
... # No serialized metadata necessary
... return b""
... @classmethod
... def __arrow_ext_deserialize__(self, storage_type, serialized):
khwilson marked this conversation as resolved.
Show resolved Hide resolved
... # return an instance of this subclass given the serialized
... # metadata
... return RationalType(storage_type[0].type)

Register the extension type:

>>> pa.register_extension_type(UuidType())
>>> pa.register_extension_type(RationalType(pa.int64()))

Unregister the extension type:

>>> pa.unregister_extension_type("my_package.uuid")
>>> pa.unregister_extension_type("my_package.rational")
"""
cdef:
DataType _type = ensure_type(ext_type, allow_none=False)
Expand All @@ -2089,30 +2120,41 @@ def unregister_extension_type(type_name):

Examples
--------
Define a UuidType extension type subclassing ExtensionType:
Define a RationalType extension type subclassing ExtensionType:

>>> import pyarrow as pa
>>> class UuidType(pa.ExtensionType):
... def __init__(self):
... pa.ExtensionType.__init__(self, pa.binary(16), "my_package.uuid")
... def __arrow_ext_serialize__(self):
... # since we don't have a parameterized type, we don't need extra
... # metadata to be deserialized
... return b''
... @classmethod
... def __arrow_ext_deserialize__(self, storage_type, serialized):
... # return an instance of this subclass given the serialized
... # metadata.
... return UuidType()
...
>>> import pyarrow.types as pt
>>> class RationalType(pa.ExtensionType):
... def __init__(self, data_type: pa.DataType):
... if not pt.is_integer(data_type):
... raise TypeError(f"data_type must be an integer type not {data_type}")
... super().__init__(
... pa.struct(
... [
... ("numer", data_type),
... ("denom", data_type),
... ],
... ),
... # N.B. This name does _not_ reference `data_type` so deserialization
... # will work for _any_ integer `data_type` after registration
... "my_package.rational",
... )
... def __arrow_ext_serialize__(self) -> bytes:
... # No serialized metadata necessary
... return b""
... @classmethod
... def __arrow_ext_deserialize__(self, storage_type, serialized):
khwilson marked this conversation as resolved.
Show resolved Hide resolved
... # return an instance of this subclass given the serialized
... # metadata
... return RationalType(storage_type[0].type)

Register the extension type:

>>> pa.register_extension_type(UuidType())
>>> pa.register_extension_type(RationalType(pa.int64()))

Unregister the extension type:

>>> pa.unregister_extension_type("my_package.uuid")
>>> pa.unregister_extension_type("my_package.rational")
"""
cdef:
c_string c_type_name = tobytes(type_name)
Expand Down
Loading