Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: [Docs] Update extension type examples to not use UUID #43849

Closed
wants to merge 130 commits into from
Closed
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
130 commits
Select commit Hold shift + click to select a range
4db2ca0
UuidType -> RationalType in the docs
khwilson Aug 27, 2024
4d95f22
fix json formatting
khwilson Aug 27, 2024
fde3215
fix some typos
khwilson Aug 27, 2024
d875d2c
Update docs/source/python/extending_types.rst
khwilson Aug 28, 2024
eb69055
Update docs/source/python/extending_types.rst
khwilson Aug 28, 2024
3fc6842
response to ianmcook
khwilson Aug 28, 2024
0e051ea
define parameters
khwilson Aug 31, 2024
8614fd5
Update python/pyarrow/types.pxi
khwilson Sep 8, 2024
03b3fd9
Update docs/source/python/extending_types.rst
khwilson Sep 8, 2024
2abdcb8
more edits
khwilson Sep 8, 2024
4c66e4d
missed one formatting
khwilson Sep 8, 2024
253156b
import pyarrow in doctests examples
khwilson Sep 8, 2024
da6cdf2
import pyarrow in more doctests examples
khwilson Sep 8, 2024
d31d59b
MINOR: [Java] Bump dep.slf4j.version from 2.0.13 to 2.0.16 in /java (…
dependabot[bot] Aug 26, 2024
8855c59
MINOR: [R] Add missing PR num to news.md item (#43811)
amoeba Aug 26, 2024
7bc2e01
MINOR: [Java] Bump dep.junit.jupiter.version from 5.10.3 to 5.11.0 in…
dependabot[bot] Aug 26, 2024
20f8357
GH-15058: [C++][Python] Native support for UUID (#37298)
rok Aug 26, 2024
95bce2e
MINOR: [Go] Bump github.com/hamba/avro/v2 from 2.24.1 to 2.25.0 in /g…
dependabot[bot] Aug 27, 2024
e8912c9
GH-43667: [Java] Keeping Flight default header size consistent betwee…
PANKAJ9768 Aug 27, 2024
aa8950f
MINOR: [Go] Bump github.com/substrait-io/substrait-go from 0.6.0 to 0…
dependabot[bot] Aug 27, 2024
db0029f
MINOR: [Java] Downgrade gRPC to 1.65 (#43839)
lidavidm Aug 27, 2024
5b98125
MINOR: [Java] Bump org.apache.commons:commons-compress from 1.27.0 to…
dependabot[bot] Aug 27, 2024
ca4c756
MINOR: [C#] Bump Microsoft.NET.Test.Sdk from 17.10.0 to 17.11.0 in /c…
dependabot[bot] Aug 27, 2024
909ae17
GH-41056: [GLib][FlightRPC] Add gaflight_client_do_put() and related …
kou Aug 27, 2024
ef33625
GH-43815: [CI][Packaging][Python] Avoid uploading wheel to gemfury if…
raulcd Aug 27, 2024
ab38581
GH-43790: [Go][Parquet] Add support for LZ4_RAW compression codec (#4…
joellubi Aug 27, 2024
581a6db
MINOR: [CI] Use `docker compose` on self-hosted ARM builds (#43844)
pitrou Aug 27, 2024
5dcd5eb
GH-43805: [C++] Enable filesystem automatically when one of ARROW_{AZ…
kou Aug 27, 2024
fd3df37
MINOR: [Java] Logback dependency upgrade (#43842)
vibhatha Aug 28, 2024
18a8670
MINOR: [Java] Bump commons-cli:commons-cli from 1.8.0 to 1.9.0 in /ja…
dependabot[bot] Aug 28, 2024
a48f69b
MINOR: [Java] Bump com.google.api.grpc:proto-google-common-protos fro…
dependabot[bot] Aug 28, 2024
2fc8423
GH-43860: [Go][Parquet] Handle the error correctly (#43861)
bigsheeper Aug 28, 2024
4b9434e
GH-43854: [C++] Expose the set of device types where a ChunkedArray i…
felipecrv Aug 28, 2024
f460bcf
GH-38183: [CI][Python] Use pipx to install GCS testbench (#43852)
pitrou Aug 29, 2024
97d5b25
GH-43877: [Ruby] Add support for 0 decimal value (#43882)
kou Aug 29, 2024
ac4a714
GH-43870: [C++][Acero] Fix typos in join benchmark (#43871)
zanmato1984 Aug 29, 2024
4a7a421
GH-41696: [Python][Packaging] Bump MACOSX_DEPLOYMENT_TARGET to 12 ins…
raulcd Aug 29, 2024
962f98e
GH-43732: [Go] Require Go 1.22 or above (#43864)
haoxins Aug 29, 2024
8a79889
GH-43759: [C++] Acero: Minor code enhancement for Join (#43760)
mapleFU Aug 29, 2024
3446117
GH-43885: [C++][CI] Catch potential integer overflow in PoolBuffer (#…
pitrou Aug 29, 2024
b6c2237
GH-43869: [Java][CI] Flight related failure in the AMD64 Windows Serv…
vibhatha Aug 30, 2024
37d54cb
GH-43837: [Go][IPC] Consolidate StreamWriter and FileWriter, ensuring…
joellubi Aug 30, 2024
9ef2e74
MINOR: [JS] Bump @swc/helpers from 0.5.11 to 0.5.12 in /js (#43901)
dependabot[bot] Sep 1, 2024
1679f3c
GH-43665: [R] Remove references to bindings vignette (#43889)
nealrichardson Sep 1, 2024
373de67
MINOR: [JS] Bump ix from 6.0.0 to 7.0.0 in /js (#43898)
dependabot[bot] Sep 2, 2024
aa006c3
MINOR: [JS] Bump @typescript-eslint/eslint-plugin from 7.12.0 to 7.18…
dependabot[bot] Sep 2, 2024
576c8cc
MINOR: [R] Fix monospace formatting in dplyr-funcs-doc (#43461)
feinleib Sep 2, 2024
37909ba
GH-43894: [R] format_aggregation() should print options too (#43896)
nealrichardson Sep 2, 2024
732b104
GH-25118: [Python] Make NumPy an optional runtime dependency (#41904)
raulcd Sep 2, 2024
a16021a
GH-43758: [C++] Compute: More comment in RowEncoder (#43763)
mapleFU Sep 2, 2024
9ed05af
GH-43768: [C++] Fix the case when boolean_{any|all} meets constant in…
mapleFU Sep 2, 2024
7ed01ee
GH-43883: [CI] Remove Python version guard when installing GCS testbe…
pitrou Sep 2, 2024
319a37f
MINOR: [C#] Bump Grpc.Tools from 2.65.0 to 2.66.0 in /csharp (#43913)
dependabot[bot] Sep 2, 2024
4103604
MINOR: [C#] Bump Google.Protobuf from 3.27.3 to 3.28.0 in /csharp (#4…
dependabot[bot] Sep 2, 2024
d5b873a
GH-40216: [CI][Packaging][Python] Upload pyarrow nightly wheels to sc…
raulcd Sep 2, 2024
2c502aa
GH-43746: [C++] Add support for Boost 1.86 (#43766)
kou Sep 3, 2024
11bee6e
MINOR: [CI] Bump actions/setup-python from 5.1.1 to 5.2.0 (#43917)
dependabot[bot] Sep 3, 2024
6111912
GH-43797: [C++] Attach `arrow::ArrayStatistics` to `arrow::ArrayData`…
kou Sep 3, 2024
4ee3f3a
MINOR: [Java] Bump org.mockito:mockito-junit-jupiter from 5.12.0 to 5…
dependabot[bot] Sep 3, 2024
6b7ed7a
MINOR: [Java] Bump com.github.luben:zstd-jni from 1.5.6-4 to 1.5.6-5 …
dependabot[bot] Sep 3, 2024
6686afa
MINOR: [Java] Bump org.apache.orc:orc-core from 1.9.2 to 1.9.4 in /ja…
dependabot[bot] Sep 3, 2024
9657077
MINOR: [Java] Bump parquet.version from 1.14.1 to 1.14.2 in /java (#4…
dependabot[bot] Sep 3, 2024
34eaeb5
MINOR: [Java] Bump error_prone_core.version from 2.30.0 to 2.31.0 in …
dependabot[bot] Sep 3, 2024
4050087
GH-40216: [Python][CI][Packaging] Upload nightly wheels to main label…
jorisvandenbossche Sep 3, 2024
908b509
GH-43907: [C#][FlightRPC] Add Grpc Call Options support on Flight Cli…
qmmk Sep 3, 2024
22f16df
GH-43927: [C++] Make ChunkResolver::ResolveMany output a list of Chun…
felipecrv Sep 3, 2024
f4d3c56
GH-43719: [C++] Clarify the way SIMD-enabled agg kernels come from th…
felipecrv Sep 3, 2024
3ac2a29
GH-43933: [CI] Remove docker-compose warnings (#43934)
lysnikolaou Sep 3, 2024
b50ad30
GH-43902: [Java] Support for Long memory addresses (#43903)
vibhatha Sep 4, 2024
96b91c2
GH-43727: [Python] RecordBatch fails gracefully on non-cpu devices (#…
danepitkin Sep 4, 2024
ab6ddac
GH-40216: [Python][CI][Packaging] Don't upload sdist to scientific-py…
jorisvandenbossche Sep 4, 2024
bb646a3
GH-43669: [Docs][Dev] Document archery --debug flag in section about …
jorisvandenbossche Sep 4, 2024
63d3992
GH-43672: [C#] Schema should be optional on FlightInfo (#43673)
ndglover Sep 4, 2024
02c2c8a
GH-43728: [Python] ChunkedArray fails gracefully on non-cpu devices (…
danepitkin Sep 4, 2024
3ff350b
GH-38255: [Java] Implement Flight SQL Bulk Ingestion (#43551)
eramitmittal Sep 5, 2024
f6c58ec
GH-43952: [CI] Bump actions/{upload|download}-artifact from 3 to late…
dependabot[bot] Sep 5, 2024
caa368e
GH-43299: [Release][Packaging] Only include pyarrow folder when find…
raulcd Sep 5, 2024
790f892
GH-43967: [C++] Enhance error message for URI parsing (#43938)
CrystalZhou0529 Sep 5, 2024
981b841
GH-43946: [C++][Parquet] Guard against use of cleared decryptor/encry…
pitrou Sep 5, 2024
97172b6
GH-43969: [CI][Dev] Prune .dockerignore (#43971)
pitrou Sep 5, 2024
0b4066f
GH-40154: [C++][Parquet] Separate encoders and decoder (#43972)
pitrou Sep 5, 2024
f8d2f8f
MINOR: [CI][C++] Add C++ example builds to "cpp" Crossbow task group …
pitrou Sep 5, 2024
a95e612
GH-43944: [C++][Parquet] Add support for arrow::ArrayStatistics: non …
kou Sep 5, 2024
efb1551
GH-43796: [C++] Indent preprocessor directives (#43798)
kou Sep 6, 2024
9489c10
MINOR: [Java] Bump com.puppycrawl.tools:checkstyle from 10.17.0 to 10…
vibhatha Sep 6, 2024
b7959c1
GH-43979: [CI][C++][Dev] Add cpplint to pre-commit (#43982)
kou Sep 6, 2024
319efc3
GH-43712: [C++][Parquet] Dataset: Handle num-nulls in Parquet correct…
mapleFU Sep 6, 2024
df4d396
GH-43992: [C++] Add missing std::move() in array_nested.cc (#43993)
mapleFU Sep 6, 2024
e073859
GH-43814: [GLib][FlightRPC] Add `GAFlightServerClass::do_put` (#43999)
kou Sep 7, 2024
6b17f49
GH-43983: [C++][Parquet] Add support for arrow::ArrayStatistics: zero…
kou Sep 8, 2024
5ca26d7
GH-40860: [GLib][Parquet] Add `gparquet_arrow_file_writer_write_recor…
kou Sep 9, 2024
38b3770
GH-43966: [Java] Check for nullabilities when comparing StructVector …
hellishfire Sep 9, 2024
55130fa
GH-43986: [C++][Acero] Some code cleanup to `Grouper` (#43988)
zanmato1984 Sep 9, 2024
41d66d8
GH-43301: [C++][Parquet] Enhance the comment for ColumnReader/Decoder…
mapleFU Sep 9, 2024
5db9d64
GH-43536: [Python][CI] Add a Crossbow job with the free-threaded buil…
lysnikolaou Sep 9, 2024
1a949a1
MINOR: [CI] Bump actions/download-artifact from 4.1.7 to 4.1.8 (#44020)
dependabot[bot] Sep 9, 2024
cc5f90a
GH-44013: [Java] Consider warnings as errors for Dataset Module (#44014)
vibhatha Sep 10, 2024
e6e8585
GH-44011: [Java] Consider warnings as errors for C Module (#44012)
vibhatha Sep 10, 2024
8508c76
GH-43576: [Java] Gandiva Tests are failing due to linking issues (#43…
vibhatha Sep 10, 2024
ab90f74
GH-44036: [C++] IPC: ipc reader/writer code enhancement (#44019)
mapleFU Sep 10, 2024
6561bc0
GH-43996: [Java] Mark new allocated ArrowSchema as released (#43997)
viirya Sep 10, 2024
3f63b20
MINOR: [C#] Bump Microsoft.NET.Test.Sdk from 17.11.0 to 17.11.1 in /c…
dependabot[bot] Sep 10, 2024
3fb6e58
GH-44016: [Java] Consider warnings as errors for Format Module (#44017)
vibhatha Sep 10, 2024
3fe07d6
MINOR: [Java] Bump logback.version from 1.5.7 to 1.5.8 in /java (#44023)
dependabot[bot] Sep 10, 2024
8da5134
MINOR: [Java] Bump io.netty:netty-bom from 4.1.112.Final to 4.1.113.F…
dependabot[bot] Sep 10, 2024
9a36873
GH-43187: [C++] Support basic is_in predicate simplification (#43761)
larry98 Sep 10, 2024
b28d202
GH-43956: [Format] Allow Decimal32/Decimal64 in format (#43976)
zeroshade Sep 10, 2024
b1cf8b6
MINOR: [Java] Bump com.google.guava:guava-bom from 33.2.1-jre to 33.3…
dependabot[bot] Sep 10, 2024
2fc9dc1
MINOR: [Java] Bump checker.framework.version from 3.46.0 to 3.47.0 in…
dependabot[bot] Sep 10, 2024
d658f64
MINOR: [CI][C++] Enable core dumps and stack traces in Linux/macOS jo…
pitrou Sep 11, 2024
395ce07
GH-44044: [Java] Consider warnings as errors for Vector Module (#44045)
vibhatha Sep 11, 2024
0a4d5c1
GH-43962: [Java] Consider warnings as errors for Adapter Module (#43963)
vibhatha Sep 11, 2024
c53f430
GH-44006: [GLib][Parquet] Add `gparquet_arrow_file_writer_new_row_gro…
kou Sep 11, 2024
e4a6f1e
GH-44050: [CI][Integration] Execute integration test again (#44051)
kou Sep 11, 2024
8d5a775
GH-43973: [Python] Table fails gracefully on non-cpu devices (#43974)
danepitkin Sep 11, 2024
d4b38fd
GH-32538: [C++][Parquet] Add JSON canonical extension type (#13901)
progger-dev Sep 11, 2024
89c08a4
GH-36412: [Python][CI] Fix deprecation warning about day freq alias w…
jorisvandenbossche Sep 11, 2024
7c6c42d
MINOR: [Java] Bump com.gradle:common-custom-user-data-maven-extension…
dependabot[bot] Sep 12, 2024
837a3e2
GH-43748: [R] Handle package_version in safe_r_metadata (#43895)
nealrichardson Sep 12, 2024
0f9ed84
GH-44063: [Python] Deprecate the no longer used serialize/deserialize…
jorisvandenbossche Sep 12, 2024
002b301
GH-44072: [C++][Parquet] Add Float16 reading benchmarks (#44073)
pitrou Sep 12, 2024
a76ab32
GH-44081: [C++][Parquet] Fix reported metrics in parquet-arrow-reader…
pitrou Sep 12, 2024
5fd9d74
GH-44076: [CI] Remove verify-rc-binaries-wheel-macos-11 which is now …
raulcd Sep 12, 2024
1fe30d3
GH-44046: [Python] Fix threading issues with borrowed refs and pandas…
lysnikolaou Sep 12, 2024
d2dd352
MINOR: [CI] Bump actions/{download,upload}-artifact version (#44086)
pitrou Sep 12, 2024
ed8585e
GH-43840: [CI] Add cuda group to tasks.yml and minor updates for new …
raulcd Sep 12, 2024
a6b718e
GH-42247: [C++] Support casting to and from utf8_view/binary_view (#4…
felipecrv Sep 12, 2024
bd8866a
GH-44079: [C++][Parquet] Remove deprecated APIs (#44080)
pitrou Sep 12, 2024
5779318
MINOR: [Docs] Remove mention of JIRA issues in the contributing PR ch…
jorisvandenbossche Sep 13, 2024
0940ae8
fix spacing and version issues
khwilson Sep 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions docs/source/format/Columnar.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1596,12 +1596,12 @@ structure. These extension keys are:
they should not be used for third-party extension types.

This extension metadata can annotate any of the built-in Arrow logical
types. The intent is that an implementation that does not support an
extension type can still handle the underlying data. For example a
16-byte UUID value could be embedded in ``FixedSizeBinary(16)``, and
implementations that do not have this extension type can still work
with the underlying binary values and pass along the
``custom_metadata`` in subsequent Arrow protocol messages.
types. For example, Arrow specifies a canonical extension type that
represents a UUID as a ``FixedSizeBinary(16)``. Arrow implementations are
not required to support canonical extensions, so an implementation that
does not support this UUID type will simply interpret it as a
``FixedSizeBinary(16)`` and pass along the ``custom_metadata`` in
subsequent Arrow protocol messages.

Extension types may or may not use the
``'ARROW:extension:metadata'`` field. Let's consider some example
Expand Down
31 changes: 24 additions & 7 deletions docs/source/format/Integration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -390,20 +390,37 @@ but can be of any type.

Extension types are, as in the IPC format, represented as their underlying
storage type plus some dedicated field metadata to reconstruct the extension
type. For example, assuming a "uuid" extension type backed by a
FixedSizeBinary(16) storage, here is how a "uuid" field would be represented::
type. For example, assuming a "rational" extension type backed by a
``struct<numer: int32, denom: int32>`` storage, here is how a "rational" field
would be represented::

{
"name" : "name_of_the_field",
"nullable" : /* boolean */,
"type" : {
"name" : "fixedsizebinary",
"byteWidth" : 16
"name" : "struct"
},
"children" : [],
"children" : [
{
"name": "numer",
"type": {
"name": "int",
"bitWidth": 32,
"isSigned": true
}
},
{
"name": "denom",
"type": {
"name": "int",
"bitWidth": 32,
"isSigned": true
}
}
],
"metadata" : [
{"key": "ARROW:extension:name", "value": "uuid"},
{"key": "ARROW:extension:metadata", "value": "uuid-serialized"}
{"key": "ARROW:extension:name", "value": "rational"},
{"key": "ARROW:extension:metadata", "value": "rational-serialized"}
]
}

Expand Down
185 changes: 122 additions & 63 deletions docs/source/python/extending_types.rst
Original file line number Diff line number Diff line change
Expand Up @@ -116,73 +116,103 @@ a :class:`~pyarrow.Array` or a :class:`~pyarrow.ChunkedArray`.
Defining extension types ("user-defined types")
-----------------------------------------------

Arrow has the notion of extension types in the metadata specification as a
possibility to extend the built-in types. This is done by annotating any of the
built-in Arrow data types (the "storage type") with a custom type name and
optional serialized representation ("ARROW:extension:name" and
"ARROW:extension:metadata" keys in the Field’s custom_metadata of an IPC
message).
See the :ref:`format_metadata_extension_types` section of the metadata
specification for more details.

Pyarrow allows you to define such extension types from Python by subclassing
:class:`ExtensionType` and giving the derived class its own extension name
and serialization mechanism. The extension name and serialized metadata
can potentially be recognized by other (non-Python) Arrow implementations
Arrow affords a notion of extension types which allow users to annotate data
types with additional semantics. This allows developers both to
specify custom serialization and deserialization routines (for example,
to :ref:`Python scalars <custom-scalar-conversion>` and
:ref:`pandas <conversion-to-pandas>`) and to more easily interpret data.

In Arrow, :ref:`extension types <format_metadata_extension_types>`
are specified by annotating any of the built-in Arrow data types
(the "storage type") with a custom type name and, optionally, a byte
array that can be used to provide additional metadata (referred to as
"parameters" in this documentation). These appear as the
``ARROW:extension:name`` and ``ARROW:extension:metadata`` keys in the
Field's ``custom_metadata``.

Note that since these annotations are part of the Arrow specification,
they can potentially be recognized by other (non-Python) Arrow consumers
such as PySpark.

For example, we could define a custom UUID type for 128-bit numbers which can
be represented as ``FixedSizeBinary`` type with 16 bytes::

class UuidType(pa.ExtensionType):

def __init__(self):
super().__init__(pa.binary(16), "my_package.uuid")

def __arrow_ext_serialize__(self):
# Since we don't have a parameterized type, we don't need extra
# metadata to be deserialized
return b''
PyArrow allows you to define extension types from Python by subclassing
:class:`ExtensionType` and giving the derived class its own extension name
and mechanism to (de)serialize any parameters. For example, we could define
a custom rational type for fractions which can be represented as a pair of
integers::

class RationalType(pa.ExtensionType):

def __init__(self, data_type: pa.DataType):
if not pa.types.is_integer(data_type):
raise TypeError(f"data_type must be an integer type not {data_type}")

super().__init__(
pa.struct(
[
("numer", data_type),
("denom", data_type),
],
),
"my_package.rational",
)

def __arrow_ext_serialize__(self) -> bytes:
# No parameters are necessary
return b""

@classmethod
def __arrow_ext_deserialize__(cls, storage_type, serialized):
# Sanity checks, not required but illustrate the method signature.
assert storage_type == pa.binary(16)
assert serialized == b''
# Return an instance of this subclass given the serialized
# metadata.
return UuidType()
assert pa.types.is_struct(storage_type)
assert pa.types.is_integer(storage_type[0].type)
assert storage_type[0].type == storage_type[1].type
assert serialized == b""

# return an instance of this subclass
return RationalType(storage_type[0].type)


The special methods ``__arrow_ext_serialize__`` and ``__arrow_ext_deserialize__``
define the serialization of an extension type instance. For non-parametric
types such as the above, the serialization payload can be left empty.
ianmcook marked this conversation as resolved.
Show resolved Hide resolved
define the serialization and deserialization of an extension type instance.

This can now be used to create arrays and tables holding the extension type::

>>> uuid_type = UuidType()
>>> uuid_type.extension_name
'my_package.uuid'
>>> uuid_type.storage_type
FixedSizeBinaryType(fixed_size_binary[16])

>>> import uuid
>>> storage_array = pa.array([uuid.uuid4().bytes for _ in range(4)], pa.binary(16))
>>> arr = pa.ExtensionArray.from_storage(uuid_type, storage_array)
>>> rational_type = RationalType(pa.int32())
>>> rational_type.extension_name
'my_package.rational'
>>> rational_type.storage_type
StructType(struct<numer: int32, denom: int32>)

>>> storage_array = pa.array(
... [
... {"numer": 10, "denom": 17},
... {"numer": 20, "denom": 13},
... ],
... type=rational_type.storage_type,
... )
>>> arr = rational_type.wrap_array(storage_array)
>>> # or equivalently
>>> arr = pa.ExtensionArray.from_storage(rational_type, storage_array)
>>> arr
<pyarrow.lib.ExtensionArray object at 0x7f75c2f300a0>
<pyarrow.lib.ExtensionArray object at 0x1067f5420>
-- is_valid: all not null
-- child 0 type: int32
[
10,
20
]
-- child 1 type: int32
[
A6861959108644B797664AEEE686B682,
718747F48E5F4058A7261E2B6B228BE8,
7FE201227D624D96A5CD8639DEF2A68B,
C6CA8C7F95744BFD9462A40B3F57A86C
17,
13
]

This array can be included in RecordBatches, sent over IPC and received in
another Python process. The receiving process must explicitly register the
extension type for deserialization, otherwise it will fall back to the
storage type::

>>> pa.register_extension_type(UuidType())
>>> pa.register_extension_type(RationalType(pa.int32()))

For example, creating a RecordBatch and writing it to a stream using the
IPC protocol::
Expand All @@ -197,19 +227,45 @@ and then reading it back yields the proper type::

>>> with pa.ipc.open_stream(buf) as reader:
... result = reader.read_all()
>>> result.column('ext').type
UuidType(FixedSizeBinaryType(fixed_size_binary[16]))
>>> result.column("ext").type
RationalType(StructType(struct<numer: int32, denom: int32>))

Further, note that while we registered the concrete type
``RationalType(pa.int32())``, the same extension name
(``"my_package.rational"``) is used by ``RationalType(integer_type)``
for *all* Arrow integer types. As such, the above code also allows users to
(de)serialize these data types::

>>> big_rational_type = RationalType(pa.int64())
>>> storage_array = pa.array(
... [
... {"numer": 10, "denom": 17},
... {"numer": 20, "denom": 13},
... ],
... type=big_rational_type.storage_type,
... )
>>> arr = big_rational_type.wrap_array(storage_array)
>>> batch = pa.RecordBatch.from_arrays([arr], ["ext"])
>>> sink = pa.BufferOutputStream()
>>> with pa.RecordBatchStreamWriter(sink, batch.schema) as writer:
... writer.write_batch(batch)
>>> buf = sink.getvalue()
>>> with pa.ipc.open_stream(buf) as reader:
... result = reader.read_all()
>>> result.column("ext").type
RationalType(StructType(struct<numer: int64, denom: int64>))

The receiving application doesn't need to be Python but can still recognize
the extension type as a "my_package.uuid" type, if it has implemented its own
the extension type as a "my_package.rational" type if it has implemented its own
extension type to receive it. If the type is not registered in the receiving
application, it will fall back to the storage type.

Parameterized extension type
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The above example used a fixed storage type with no further metadata. But
more flexible, parameterized extension types are also possible.
The above example illustrated how to construct an extension type that requires
no additional metadata beyond its storage type. But Arrow also provides more
flexible, parameterized extension types.

The example given here implements an extension type for the `pandas "period"
data type <https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-span-representation>`__,
Expand All @@ -225,7 +281,7 @@ of the given frequency since 1970.
# attributes need to be set first before calling
# super init (as that calls serialize)
self._freq = freq
super().__init__(pa.int64(), 'my_package.period')
super().__init__(pa.int64(), "my_package.period")

@property
def freq(self):
Expand All @@ -240,7 +296,7 @@ of the given frequency since 1970.
# metadata.
serialized = serialized.decode()
assert serialized.startswith("freq=")
freq = serialized.split('=')[1]
freq = serialized.split("=")[1]
return PeriodType(freq)

Here, we ensure to store all information in the serialized metadata that is
Expand Down Expand Up @@ -274,7 +330,7 @@ the data as a 2-D Numpy array ``(N, 3)`` without any copy::
super().__init__(pa.list_(pa.float32(), 3), "my_package.Point3DType")

def __arrow_ext_serialize__(self):
return b''
return b""

@classmethod
def __arrow_ext_deserialize__(cls, storage_type, serialized):
Expand Down Expand Up @@ -313,6 +369,8 @@ This array can be sent over IPC, received in another Python process, and the cus
extension array class will be preserved (as long as the receiving process registers
the extension type using :func:`register_extension_type` before reading the IPC data).

.. _custom-scalar-conversion:

Custom scalar conversion
~~~~~~~~~~~~~~~~~~~~~~~~

Expand All @@ -335,7 +393,7 @@ For example, if we wanted the above example 3D point type to return a custom
super().__init__(pa.list_(pa.float32(), 3), "my_package.Point3DType")

def __arrow_ext_serialize__(self):
return b''
return b""

@classmethod
def __arrow_ext_deserialize__(cls, storage_type, serialized):
Expand All @@ -354,6 +412,7 @@ Arrays built using this extension type now provide scalars that convert to our `
>>> arr.to_pylist()
[Point3D(x=1.0, y=2.0, z=3.0), Point3D(x=4.0, y=5.0, z=6.0)]

.. _conversion-to-pandas:

Conversion to pandas
~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -436,16 +495,16 @@ Extension arrays can be used as columns in ``pyarrow.Table`` or

>>> data = [
... pa.array([1, 2, 3]),
... pa.array(['foo', 'bar', None]),
... pa.array(["foo", "bar", None]),
... pa.array([True, None, True]),
... tensor_array,
... tensor_array_2
... ]
>>> my_schema = pa.schema([('f0', pa.int8()),
... ('f1', pa.string()),
... ('f2', pa.bool_()),
... ('tensors_int', tensor_type),
... ('tensors_float', tensor_type_2)])
>>> my_schema = pa.schema([("f0", pa.int8()),
... ("f1", pa.string()),
... ("f2", pa.bool_()),
... ("tensors_int", tensor_type),
... ("tensors_float", tensor_type_2)])
>>> table = pa.Table.from_arrays(data, schema=my_schema)
>>> table
pyarrow.Table
Expand Down Expand Up @@ -541,7 +600,7 @@ or

.. code-block:: python

>>> tensor_type = pa.fixed_shape_tensor(pa.bool_(), [2, 2, 3], dim_names=['C', 'H', 'W'])
>>> tensor_type = pa.fixed_shape_tensor(pa.bool_(), [2, 2, 3], dim_names=["C", "H", "W"])

for ``NCHW`` format where:

Expand Down
Loading
Loading