Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: Extension types #451

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions LogicalTypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -767,6 +767,72 @@ optional group my_map (MAP_KEY_VALUE) {
}
```

## EXTENSION

Extension types allow the Parquet type system to be open-ended. An extension
type can be used to signal a third-party type that has no equivalent in the
core Parquet type system.

Extension types will typically be specified by third-party communities, or
be vendor-specific. An extension type specification will typically contain
the following elements:

1. The extension type must be identified by a dotted name with the first name
component clearly denoting the authority that defined the type. The
`parquet.` namespace is reserved for use by the Parquet community and
must not be used for third-party extension types.

2. The extension type must define which parameters it takes, if any. It must
define a binary serialization to store those parameters in the Parquet schema.
It is recommended (but not required) that the serialization is a UTF-8 encoding
of a JSON object.

3. The extension type must define which kind of node it annotates: leaf
or non-leaf. If non-leaf, the allowed subtree shape must be defined.

4. If the extension type annotates leaf nodes, it must define the allowed
physical type(s).

5. If the extension type annotates leaf nodes, it should also optionally
define its sort order (see the `ColumnOrder` definition in the Thrift
format). If it does not, then the extension type is unordered.

### Reading extension types

An extension type is identified by its name. A reader will typically have
a collection of extension types that it knows about; it may also offer a way
for the user to register additional extension types.

When a reader encounters an extension type in a Parquet schema, it should try
to match it by name to its known extension types. If it does not recognize
the extension type, then it should read it as the underlying physical type
and should not try to interpret the column's statistics. It may however
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min/max statistics, others should be valid?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yes, you're right.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps including column index?

preserve the extension type information when transmitting the data to other
systems, or for round-tripping purposes.

### Examples

The fictional ParquetNet community defines a IPv6 extension type
with the following characteristics:

1. Name: `parquetnet.ipv6`
2. Parameters: none, the serialization is always empty
3. Node type: only leaf
4. Physical type: only FIXED_LEN_BYTE_ARRAY(16)
5. Sort order: binary lexicographic order (the IP addresses use big-endian encoding)

The fictional ParquetScience community defines a double-precision fixed-shaped
tensor type with the following characteristics:

1. Name: `parquetsci.f64tensor`
2. Parameters: the number of dimensions `ndim` (an integer), and the shape of the
tensor elements (a tuple of `ndim` integers). It is serialized as a JSON
object thusly: `{"ndim": 3, "shape": (4, 5, 6)}`
3. Node type: only leaf
4. Physical type: only FIXED_LEN_BYTE_ARRAY(nbytes) where `nbytes` is 8 times
the shape's product
5. Sort order: unordered

## UNKNOWN (always null)

Sometimes, when discovering the schema of existing data, values are always null
Expand Down
46 changes: 45 additions & 1 deletion src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,27 @@ struct Statistics {
8: optional bool is_min_value_exact;
}

/**
* An extension type description
*
* Extension types allow for third-party semantics not provided by the core
* Parquet type system.
*
* `name` is a dotted name reliably identifying the extension type.
* Names beginning with "parquet." are reserved for standardization within
* the Parquet project.
*
* If the extension type is parametric, then `serialization` is an encoding
* of the extension type's parameters. It is recommended (but not required)
* that the parameters are serialized as a JSON object in UTF-8 encoding.
*
* If the extension type is not parametric, then `serialization` is empty.
*/
struct ExtensionTypeDescription {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why choosing a dedicated ExtensionTypeDescription struct over list<KeyValue>? I'm afraid that a binary typed field may incur misuse from the users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the list<KeyValue> contain and where would it reside? I'm not following you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct ExtensionTypeDescription {
  1: optional list<KeyValue> metadata
}

And specify the required keys for each extension type, pretty much like what Arrow does.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not make sense, does it? The keys will always be the same, so why not reify them in the Thrift spec as the PR currently does?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or are you thinking about extension-specific parameter keys as in https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note we would still need the extension name, so this would be:

struct ExtensionTypeDescription {
  1: required string name
  2: optional list<KeyValue> parameters
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or are you thinking about extension-specific parameter keys as in https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html ?

Yes, I mean something like this.

1: required string name
2: optional binary serialization
}

/** Empty structs to use as logical type annotations */
struct StringType {} // allowed for BYTE_ARRAY, must be encoded with UTF-8
struct UUIDType {} // allowed for FIXED[16], must encoded raw UUID bytes
Expand Down Expand Up @@ -380,6 +401,21 @@ struct JsonType {
struct BsonType {
}

/**
* Extension type annotation
*
* `type_index` is an index into `FileMetaData.extension_types`. This
* indirection allows for efficient representation of schemas with many
* columns of a given extension type.
*
* Each extension type specification will define the set of allowed physical
* types (for example, a hypothetical IPv6 extension type would require
* FIXED_LEN_BYTE_ARRAY(16)).
*/
struct ExtensionType {
1: required i32 type_index
}

/**
* LogicalType annotations to replace ConvertedType.
*
Expand Down Expand Up @@ -410,6 +446,7 @@ union LogicalType {
13: BsonType BSON // use ConvertedType BSON
14: UUIDType UUID // no compatible ConvertedType
15: Float16Type FLOAT16 // no compatible ConvertedType
16: ExtensionType EXTENSION // no compatible ConvertedType
}

/**
Expand Down Expand Up @@ -956,7 +993,6 @@ struct TypeDefinedOrder {}
* for this column should be ignored.
*/
union ColumnOrder {

/**
* The sort orders for logical types are:
* UTF8 - unsigned byte-wise comparison
Expand All @@ -980,6 +1016,7 @@ union ColumnOrder {
* ENUM - unsigned byte-wise comparison
* LIST - undefined
* MAP - undefined
* EXTENSION - extension type-specific
*
* In the absence of logical types, the sort order is determined by the physical type:
* BOOLEAN - false, true
Expand Down Expand Up @@ -1211,6 +1248,13 @@ struct FileMetaData {
* Used only in encrypted files with plaintext footer.
*/
9: optional binary footer_signing_key_metadata

/**
* A list of all extension types used in the Parquet schema, if any.
* The entries in this list are referenced through the `ExtensionType.type_index`
* of each ExtensionType field.
*/
10: optional list<ExtensionTypeDescription> extension_types
}

/** Crypto metadata for files with encrypted footer **/
Expand Down