Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec: add variant type #10831

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Spec: add variant type #10831

wants to merge 5 commits into from

Conversation

aihuaxu
Copy link
Contributor

@aihuaxu aihuaxu commented Jul 31, 2024

Help: #10392

Spec: add variant type

Proposal: https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit

This is to layout the spec for variant type. The specs are placed in Parquet project (see variant spec and shredding spec.

@github-actions github-actions bot added the Specification Issues that may introduce spec changes. label Jul 31, 2024
@aihuaxu
Copy link
Contributor Author

aihuaxu commented Jul 31, 2024

cc @rdblue, @RussellSpitzer and @flyrain

format/spec.md Outdated Show resolved Hide resolved
@RussellSpitzer
Copy link
Member

I do want to make sure we don't do a hostile fork here of the spec from Spark so let's make sure we get support from them to move the spec here before we merge. At the same time we should start going through wordings and continue to discuss the specs. I still think that would be easier to do in a public Google Doc though than in Github IMHO.

@sfc-gh-aixu
Copy link

I do want to make sure we don't do a hostile fork here of the spec from Spark so let's make sure we get support from them to move the spec here before we merge. At the same time we should start going through wordings and continue to discuss the specs. I still think that would be easier to do in a public Google Doc though than in Github IMHO.

Definitely. It's not for merge yet. I'm mostly trying to get the comments in place. Make sense to move that to google doc and link here.

@aihuaxu aihuaxu marked this pull request as ready for review October 9, 2024 20:59
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
@RussellSpitzer
Copy link
Member

This needs some notes in Partition Transforms , I think explicitly we should disallow identity

For Appendix B - We should define something or state explicitly we don't define it for variant.

Appendix C - We'll need some details on the JSON serialization since that's going to have to define some string representations I think

Under Sort Orders we should probably note you cannot sort on a Variant?

Appendix D: Single Value Serialzation needs an entry, we can probably right "Not SUpported" for now, Json needs an entry

@RussellSpitzer
Copy link
Member

@aihuaxu
Copy link
Contributor Author

aihuaxu commented Oct 18, 2024

This needs some notes in Partition Transforms , I think explicitly we should disallow identity

For Appendix B - We should define something or state explicitly we don't define it for variant.

Appendix C - We'll need some details on the JSON serialization since that's going to have to define some string representations I think

Under Sort Orders we should probably note you cannot sort on a Variant?

Appendix D: Single Value Serialzation needs an entry, we can probably right "Not SUpported" for now, Json needs an entry

Thanks @RussellSpitzer I missed those sections and just updated.

I mark Partition Transforms, sorting and hashing not supported/allowed for now.
For Appendix C, I think it should be just variant, similar to primitive type, since it's Iceberg schema as I understand the section.

format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated
@@ -444,6 +449,9 @@ Sorting floating-point numbers should produce the following behavior: `-NaN` < `

A data or delete file is associated with a sort order by the sort order's id within [a manifest](#manifests). Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. Writers should use this default sort order to sort the data on write, but are not required to if the default order is prohibitively expensive, as it would be for streaming writes.

Note:

1. `variant` columns are not valid for sorting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought whether a variant is orderable is determined by engines per pervious discussion. Are we explicitly saying that all variants are not orderable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flyrain Yeah. The engines can support ordering actually. Here basically we will not define the sorting order. I'm wondering if we should define one in the future.

So I will update to "The ability to sort variant columns and the specific sort order is determined by the engines."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are only saying that Variant values cannot be present in an Iceberg sort order. Engines can sort if they choose.

We could also define a sort order if we wanted, but this seems like a good place to start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to explicitly say "Variant values cannot be present in an Iceberg sort order. " in the spec?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should specifically write that. Don't we do so for Maps and Arrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see we call out anything for Maps and Arrays in sorting section. But the code has the check to error out with "Cannot sort by non-primitive source field".

I don't add explicitly for Variant (just like Maps and Arrays) either but let me know if that makes sense .

format/spec.md Outdated
| **`month`** | Extract a date or timestamp month, as months from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` |
| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` |
| **`hour`** | Extract a timestamp hour, as hours from 1970-01-01 00:00:00 | `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` |
| **`void`** | Always produces `null` | Any | Source type or `int` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revert the whitespace changes, please. It makes these tables hard to maintain and less readable.

format/spec.md Outdated
@@ -444,6 +449,9 @@ Sorting floating-point numbers should produce the following behavior: `-NaN` < `

A data or delete file is associated with a sort order by the sort order's id within [a manifest](#manifests). Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. Writers should use this default sort order to sort the data on write, but are not required to if the default order is prohibitively expensive, as it would be for streaming writes.

Note:

1. The ability to sort `variant` columns and the specific sort order is determined by the engines.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? I think anything we don't specify is up to engines already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will remove that then. Do we need to call out "Variant values cannot be present in an Iceberg sort order"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should specifically forbid sort orders containing a variant. I think we actually are underdetermined in the spec here.

We have the following checks in the Reference Implementation

ValidationException.check(
sourceType != null, "Cannot find source column for sort field: %s", field);
ValidationException.check(
sourceType.isPrimitiveType(),
"Cannot sort by non-primitive source field: %s",
sourceType);
ValidationException.check(
field.transform().canTransform(sourceType),
"Invalid source type %s for transform: %s",
sourceType,
field.transform());

So currently, even though we don't specify this here, you cannot make a sort order with array or map. I think we should explicitly call this out and add variant as well. My real concern here is that we add the ability to sort on something but don't define what that sorting actually looks like.

format/spec.md Outdated
@@ -178,6 +178,11 @@ A **`list`** is a collection of values with some element type. The element field

A **`map`** is a collection of key-value pairs with a key type and a value type. Both the key field and value field each have an integer id that is unique in the table schema. Map keys are required and map values can be either optional or required. Both map keys and map values may be any type, including nested types.

### Semi-structured Types

A **`variant`** is a type to represent semi-structured data. A variant value can store a value of any other type, including `null`, any primitive, struct, list or map value. The variant encoding is defined the [Apache Parquet Project](https://github.com/apache/parquet-format/blob/4f208158dba80ff4bff4afaa4441d7270103dff6/VariantEncoding.md). Variant type is added in [v3](#version-3).
Copy link
Contributor

@rdblue rdblue Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other cases are more specific about what is present, rather than what is being represented. Also, I don't think that the description is accurate. A variant cannot store maps. I would rather state clearly what a variant stores so that there is no ambiguity.

How about this instead?

A variant is a binary value that encodes semi-structured data. The structure and data types in a variant are not necessarily consistent across rows in a table or data file. The variant type and binary encoding are defined in the Parquet project. Support for Variant is added in Iceberg v3.

Variants are similar to JSON with a wider set of primitive values including date, timestamp, timestamptz,
binary, and floating points.

Variant values may contain nested types:

  • An array is an ordered collection of variant values
  • An object is a collection of fields that are a string key and a variant value

As a semi-structured type, there are important differences between variant and Iceberg's other types:

  • Variant arrays are similar to lists, but may contain any variant value rather than a fixed element type
  • Variant objects are similar to structs, but may contain variable fields identified by name and field values may be any variant value rather than a fixed field type
  • Variant primitives are narrower than Iceberg's primitive types: uuid, time, fixed(L), and nanosecond precision timestamp(tz) are not supported

Copy link
Contributor Author

@aihuaxu aihuaxu Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. We need to define exactly how and what can be stored in a Variant.

The variant actually uses two binary values. Should we mention like that rather than a binary value?

I changed to "A variant is a value that stores semi-structured data" - as I understand from the other definition, we don't need to mention how they are stored but about what can be stored so I removed the binary.

format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated
@@ -1133,6 +1142,7 @@ Hash results are not dependent on decimal scale, which is part of the type, not
4. UUIDs are encoded using big endian. The test UUID for the example above is: `f79c3e09-677c-4bbd-a479-3f349cb785e7`. This UUID encoded as a byte array is:
`F7 9C 3E 09 67 7C 4B BD A4 79 3F 34 9C B7 85 E7`
5. `doubleToLongBits` must give the IEEE 754 compliant bit representation of the double value. All `NaN` bit patterns must be canonicalized to `0x7ff8000000000000L`. Negative zero (`-0.0`) must be canonicalized to positive zero (`0.0`). Float hash values are the result of hashing the float cast to double to ensure that schema evolution does not change hash values if float types are promoted.
6. `variant` values are currently not valid for bucketing and so they are not hashed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Notes is a good place for this. Can you start a new paragraph about unhashed types?

A 32-bit hash is not defined for variant because there are multiple representations for equivalent values.

format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated
| **`struct`** | Not supported |
| **`list`** | Not supported |
| **`map`** | Not supported |
| **`variant`** | Not supported |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer, didn't we want to say that this should be a Variant value that contains a Variant of a value for each shredded column? I don't want to miss this in v3 or else we won't be able to do file skipping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to specify how to handle the metadata (concat?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree, but currently we don't have any info about shredding so I didn't want to include it yet

format/spec.md Outdated
| **`struct`** | **`JSON object by field ID`** | `{"1": 1, "2": "bar"}` | Stores struct fields using the field ID as the JSON field name; field values are stored using this JSON single-value format |
| **`list`** | **`JSON array of values`** | `[1, 2, 3]` | Stores a JSON array of values that are serialized using this JSON single-value format |
| **`map`** | **`JSON object of key and value arrays`** | `{ "keys": ["a", "b"], "values": [1, 2] }` | Stores arrays of keys and values; individual keys and values are serialized using this JSON single-value format |
| **`variant`** | **`Same JSON representation in this table for stored type`** | `null`, `true`, `{"1": 1, "2": "bar"}` | The JSON representation matches the format shown in this table for the type stored in the Variant. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is insufficient because it loses type information. For example, if there is a Variant that contains a timestamp, the timestamp's type is lost. This also appears to use the struct representation for an object, but a Variant has no field IDs.

I think that we probably do want to use a JSON representation, but we will need to make sure that it aligns with the JSON conversion defined in the Variant spec (https://github.com/apache/parquet-format/pull/461/files#diff-80a56bf0d841087ab5038020e4b78119d6ceb44684719ba4d3a6e22effb36eb9R459) and that we have defined requirements for recovering the types.

Note that the JSON conversion in the Variant spec differs from this section:

  • Binary values are hexadecimal strings here (:facepalm:) and base64 in Variant
  • Decimal values are strings here and numbers in Variant

We may want to not allow Variant in JSON because of the type loss. Or we may want to specify a subset that can be recovered, like booleans, integers, floats, strings, and arrays. There are two places where this section is used: in the REST protocol for sending lower/upper bounds and in default values. We don't really need default values for variants (but would have to disallow them) and lower/upper bounds would generally work with a subset.

The other option is to encode variant as a base64 binary value. That is a bit ugly in JSON, but it is not lossy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't exactly understand how this gets used. I will make changes to say "The JSON representation with the encoding defined in the https://github.com/apache/parquet-format/pull/461/files#diff-80a56bf0d841087ab5038020e4b78119d6ceb44684719ba4d3a6e22effb36eb9R459 of Variant spec". Would that work for lower/upper bounds usage?

I notice that SingleValueParser.fromJson(Type type, JsonNode defaultValue) and toJson(Type type, Object defaultValue, JsonGenerator generator) would error out for Variant. Is that sufficient to disallow defaultValue?

@rdblue
Copy link
Contributor

rdblue commented Oct 24, 2024

@aihuaxu, I think there are a couple of things missing:

  • The Avro appendix should be updated to state that a Variant is stored as a Record with two fields, a required binary metadata and a required binary value. Shredding is not supported in Avro.
  • The ORC appendix should be updated to state that a Variant is stored as a struct with two fields, a required binary metadata and a required binary value. Type attribute should be iceberg.struct-type=variant. Shredding is not supported in ORC.

@rdblue rdblue closed this Oct 24, 2024
@rdblue rdblue reopened this Oct 24, 2024
@rdblue
Copy link
Contributor

rdblue commented Oct 24, 2024

Oops. I didn't mean to close this.

@aihuaxu
Copy link
Contributor Author

aihuaxu commented Oct 25, 2024

@aihuaxu, I think there are a couple of things missing:

  • The Avro appendix should be updated to state that a Variant is stored as a Record with two fields, a required binary metadata and a required binary value. Shredding is not supported in Avro.
  • The ORC appendix should be updated to state that a Variant is stored as a struct with two fields, a required binary metadata and a required binary value. Type attribute should be iceberg.struct-type=variant. Shredding is not supported in ORC.

Thanks @rdblue I thought we will make changes when we start to work on Avro/ORC. I added that.

I don't have much context for Json conversion. Not sure if we need to add more info.

@@ -1110,6 +1125,7 @@ Maps with non-string keys must use an array representation with the `map` logica
|**`struct`**|`record`||
|**`list`**|`array`||
|**`map`**|`array` of key-value records, or `map` when keys are strings (optional).|Array storage must use logical type name `map` and must store elements that are 2-field records. The first field is a non-null key and the second field is the value.|
|**`variant`**|`record with `metadata` and `value` fields`|Shredding is not supported in Avro.|

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave out the shredding note til we define it and support it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if this was requested somewhere else, but i'd keep any mention of "shredding" out till we add it in another pr

@@ -1287,6 +1307,7 @@ Types are serialized according to this table:
|**`struct`**|`JSON object: {`<br />&nbsp;&nbsp;`"type": "struct",`<br />&nbsp;&nbsp;`"fields": [ {`<br />&nbsp;&nbsp;&nbsp;&nbsp;`"id": <field id int>,`<br />&nbsp;&nbsp;&nbsp;&nbsp;`"name": <name string>,`<br />&nbsp;&nbsp;&nbsp;&nbsp;`"required": <boolean>,`<br />&nbsp;&nbsp;&nbsp;&nbsp;`"type": <type JSON>,`<br />&nbsp;&nbsp;&nbsp;&nbsp;`"doc": <comment string>,`<br />&nbsp;&nbsp;&nbsp;&nbsp;`"initial-default": <JSON encoding of default value>,`<br />&nbsp;&nbsp;&nbsp;&nbsp;`"write-default": <JSON encoding of default value>`<br />&nbsp;&nbsp;&nbsp;&nbsp;`}, ...`<br />&nbsp;&nbsp;`] }`|`{`<br />&nbsp;&nbsp;`"type": "struct",`<br />&nbsp;&nbsp;`"fields": [ {`<br />&nbsp;&nbsp;&nbsp;&nbsp;`"id": 1,`<br />&nbsp;&nbsp;&nbsp;&nbsp;`"name": "id",`<br />&nbsp;&nbsp;&nbsp;&nbsp;`"required": true,`<br />&nbsp;&nbsp;&nbsp;&nbsp;`"type": "uuid",`<br />&nbsp;&nbsp;&nbsp;&nbsp;`"initial-default": "0db3e2a8-9d1d-42b9-aa7b-74ebe558dceb",`<br />&nbsp;&nbsp;&nbsp;&nbsp;`"write-default": "ec5911be-b0a7-458c-8438-c9a3e53cffae"`<br />&nbsp;&nbsp;`}, {`<br />&nbsp;&nbsp;&nbsp;&nbsp;`"id": 2,`<br />&nbsp;&nbsp;&nbsp;&nbsp;`"name": "data",`<br />&nbsp;&nbsp;&nbsp;&nbsp;`"required": false,`<br />&nbsp;&nbsp;&nbsp;&nbsp;`"type": {`<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`"type": "list",`<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`...`<br />&nbsp;&nbsp;&nbsp;&nbsp;`}`<br />&nbsp;&nbsp;`} ]`<br />`}`|
|**`list`**|`JSON object: {`<br />&nbsp;&nbsp;`"type": "list",`<br />&nbsp;&nbsp;`"element-id": <id int>,`<br />&nbsp;&nbsp;`"element-required": <bool>`<br />&nbsp;&nbsp;`"element": <type JSON>`<br />`}`|`{`<br />&nbsp;&nbsp;`"type": "list",`<br />&nbsp;&nbsp;`"element-id": 3,`<br />&nbsp;&nbsp;`"element-required": true,`<br />&nbsp;&nbsp;`"element": "string"`<br />`}`|
|**`map`**|`JSON object: {`<br />&nbsp;&nbsp;`"type": "map",`<br />&nbsp;&nbsp;`"key-id": <key id int>,`<br />&nbsp;&nbsp;`"key": <type JSON>,`<br />&nbsp;&nbsp;`"value-id": <val id int>,`<br />&nbsp;&nbsp;`"value-required": <bool>`<br />&nbsp;&nbsp;`"value": <type JSON>`<br />`}`|`{`<br />&nbsp;&nbsp;`"type": "map",`<br />&nbsp;&nbsp;`"key-id": 4,`<br />&nbsp;&nbsp;`"key": "string",`<br />&nbsp;&nbsp;`"value-id": 5,`<br />&nbsp;&nbsp;`"value-required": false,`<br />&nbsp;&nbsp;`"value": "double"`<br />`}`|
| **`variant`**| `JSON string: "variant"`|`"variant"`|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should link to a well defined json translation of the Variant type which I think should live in the parquet spec.

@@ -1436,6 +1457,7 @@ This serialization scheme is for storing single values as individual binary valu
| **`struct`** | Not supported |
| **`list`** | Not supported |
| **`map`** | Not supported |
| **`variant`** | Not supported |
Copy link
Member

@RussellSpitzer RussellSpitzer Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree this should be not-supported for now. Then when shredding is included say something like for Shredded variants only, binary value concatenation of metadata and value + separator byte or something. We can figure that out with the shredding addition though

@@ -1462,6 +1484,7 @@ This serialization scheme is for storing single values as individual binary valu
| **`struct`** | **`JSON object by field ID`** | `{"1": 1, "2": "bar"}` | Stores struct fields using the field ID as the JSON field name; field values are stored using this JSON single-value format |
| **`list`** | **`JSON array of values`** | `[1, 2, 3]` | Stores a JSON array of values that are serialized using this JSON single-value format |
| **`map`** | **`JSON object of key and value arrays`** | `{ "keys": ["a", "b"], "values": [1, 2] }` | Stores arrays of keys and values; individual keys and values are serialized using this JSON single-value format |
| **`variant`** | **`JSON string`** | `"rO0ABXVyAANbW0JL/RkVZ2fbNwIAAHhwAAAAAnVyAAJbQqzzF/gGCFTgAgAAeHAAAAAMAQIABAdu"` | Stores base64-encoded variant value |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we have to explain how we combine the metadata and value here as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Specification Issues that may introduce spec changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants