-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spec: add variant type #10831
base: main
Are you sure you want to change the base?
Spec: add variant type #10831
Conversation
b868ea6
to
1a0404b
Compare
cc @rdblue, @RussellSpitzer and @flyrain |
I do want to make sure we don't do a hostile fork here of the spec from Spark so let's make sure we get support from them to move the spec here before we merge. At the same time we should start going through wordings and continue to discuss the specs. I still think that would be easier to do in a public Google Doc though than in Github IMHO. |
Definitely. It's not for merge yet. I'm mostly trying to get the comments in place. Make sense to move that to google doc and link here. |
e51c8e6
to
5a8acf1
Compare
5a8acf1
to
408ad2d
Compare
285d009
to
f7adbbc
Compare
f7adbbc
to
3e91ce9
Compare
This needs some notes in For Appendix B - We should define something or state explicitly we don't define it for variant. Appendix C - We'll need some details on the JSON serialization since that's going to have to define some string representations I think Under Sort Orders we should probably note you cannot sort on a Variant? Appendix D: Single Value Serialzation needs an entry, we can probably right "Not SUpported" for now, Json needs an entry |
3e91ce9
to
0bc975e
Compare
Thanks @RussellSpitzer I missed those sections and just updated. I mark Partition Transforms, sorting and hashing not supported/allowed for now. |
format/spec.md
Outdated
@@ -444,6 +449,9 @@ Sorting floating-point numbers should produce the following behavior: `-NaN` < ` | |||
|
|||
A data or delete file is associated with a sort order by the sort order's id within [a manifest](#manifests). Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. Writers should use this default sort order to sort the data on write, but are not required to if the default order is prohibitively expensive, as it would be for streaming writes. | |||
|
|||
Note: | |||
|
|||
1. `variant` columns are not valid for sorting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought whether a variant is orderable is determined by engines per pervious discussion. Are we explicitly saying that all variants are not orderable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@flyrain Yeah. The engines can support ordering actually. Here basically we will not define the sorting order. I'm wondering if we should define one in the future.
So I will update to "The ability to sort variant columns and the specific sort order is determined by the engines."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are only saying that Variant values cannot be present in an Iceberg sort order. Engines can sort if they choose.
We could also define a sort order if we wanted, but this seems like a good place to start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to explicitly say "Variant values cannot be present in an Iceberg sort order. " in the spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should specifically write that. Don't we do so for Maps and Arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see we call out anything for Maps and Arrays in sorting section. But the code has the check to error out with "Cannot sort by non-primitive source field".
I don't add explicitly for Variant (just like Maps and Arrays) either but let me know if that makes sense .
6673520
to
3aabac4
Compare
format/spec.md
Outdated
| **`month`** | Extract a date or timestamp month, as months from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | | ||
| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | | ||
| **`hour`** | Extract a timestamp hour, as hours from 1970-01-01 00:00:00 | `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | | ||
| **`void`** | Always produces `null` | Any | Source type or `int` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's revert the whitespace changes, please. It makes these tables hard to maintain and less readable.
format/spec.md
Outdated
@@ -444,6 +449,9 @@ Sorting floating-point numbers should produce the following behavior: `-NaN` < ` | |||
|
|||
A data or delete file is associated with a sort order by the sort order's id within [a manifest](#manifests). Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. Writers should use this default sort order to sort the data on write, but are not required to if the default order is prohibitively expensive, as it would be for streaming writes. | |||
|
|||
Note: | |||
|
|||
1. The ability to sort `variant` columns and the specific sort order is determined by the engines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? I think anything we don't specify is up to engines already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I will remove that then. Do we need to call out "Variant values cannot be present in an Iceberg sort order"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should specifically forbid sort orders containing a variant. I think we actually are underdetermined in the spec here.
We have the following checks in the Reference Implementation
iceberg/api/src/main/java/org/apache/iceberg/SortOrder.java
Lines 301 to 311 in 8a16a41
ValidationException.check( | |
sourceType != null, "Cannot find source column for sort field: %s", field); | |
ValidationException.check( | |
sourceType.isPrimitiveType(), | |
"Cannot sort by non-primitive source field: %s", | |
sourceType); | |
ValidationException.check( | |
field.transform().canTransform(sourceType), | |
"Invalid source type %s for transform: %s", | |
sourceType, | |
field.transform()); |
So currently, even though we don't specify this here, you cannot make a sort order with array or map. I think we should explicitly call this out and add variant as well. My real concern here is that we add the ability to sort on something but don't define what that sorting actually looks like.
format/spec.md
Outdated
@@ -178,6 +178,11 @@ A **`list`** is a collection of values with some element type. The element field | |||
|
|||
A **`map`** is a collection of key-value pairs with a key type and a value type. Both the key field and value field each have an integer id that is unique in the table schema. Map keys are required and map values can be either optional or required. Both map keys and map values may be any type, including nested types. | |||
|
|||
### Semi-structured Types | |||
|
|||
A **`variant`** is a type to represent semi-structured data. A variant value can store a value of any other type, including `null`, any primitive, struct, list or map value. The variant encoding is defined the [Apache Parquet Project](https://github.com/apache/parquet-format/blob/4f208158dba80ff4bff4afaa4441d7270103dff6/VariantEncoding.md). Variant type is added in [v3](#version-3). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other cases are more specific about what is present, rather than what is being represented. Also, I don't think that the description is accurate. A variant cannot store maps. I would rather state clearly what a variant stores so that there is no ambiguity.
How about this instead?
A
variant
is a binary value that encodes semi-structured data. The structure and data types in a variant are not necessarily consistent across rows in a table or data file. The variant type and binary encoding are defined in the Parquet project. Support for Variant is added in Iceberg v3.Variants are similar to JSON with a wider set of primitive values including date, timestamp, timestamptz,
binary, and floating points.Variant values may contain nested types:
- An array is an ordered collection of variant values
- An object is a collection of fields that are a string key and a variant value
As a semi-structured type, there are important differences between variant and Iceberg's other types:
- Variant arrays are similar to lists, but may contain any variant value rather than a fixed element type
- Variant objects are similar to structs, but may contain variable fields identified by name and field values may be any variant value rather than a fixed field type
- Variant primitives are narrower than Iceberg's primitive types: uuid, time, fixed(L), and nanosecond precision timestamp(tz) are not supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. We need to define exactly how and what can be stored in a Variant.
The variant actually uses two binary values. Should we mention like that rather than a binary value?
I changed to "A variant
is a value that stores semi-structured data" - as I understand from the other definition, we don't need to mention how they are stored but about what can be stored so I removed the binary.
format/spec.md
Outdated
@@ -1133,6 +1142,7 @@ Hash results are not dependent on decimal scale, which is part of the type, not | |||
4. UUIDs are encoded using big endian. The test UUID for the example above is: `f79c3e09-677c-4bbd-a479-3f349cb785e7`. This UUID encoded as a byte array is: | |||
`F7 9C 3E 09 67 7C 4B BD A4 79 3F 34 9C B7 85 E7` | |||
5. `doubleToLongBits` must give the IEEE 754 compliant bit representation of the double value. All `NaN` bit patterns must be canonicalized to `0x7ff8000000000000L`. Negative zero (`-0.0`) must be canonicalized to positive zero (`0.0`). Float hash values are the result of hashing the float cast to double to ensure that schema evolution does not change hash values if float types are promoted. | |||
6. `variant` values are currently not valid for bucketing and so they are not hashed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think Notes is a good place for this. Can you start a new paragraph about unhashed types?
A 32-bit hash is not defined for
variant
because there are multiple representations for equivalent values.
format/spec.md
Outdated
| **`struct`** | Not supported | | ||
| **`list`** | Not supported | | ||
| **`map`** | Not supported | | ||
| **`variant`** | Not supported | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer, didn't we want to say that this should be a Variant value
that contains a Variant of a value for each shredded column? I don't want to miss this in v3 or else we won't be able to do file skipping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need to specify how to handle the metadata
(concat?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree, but currently we don't have any info about shredding so I didn't want to include it yet
format/spec.md
Outdated
| **`struct`** | **`JSON object by field ID`** | `{"1": 1, "2": "bar"}` | Stores struct fields using the field ID as the JSON field name; field values are stored using this JSON single-value format | | ||
| **`list`** | **`JSON array of values`** | `[1, 2, 3]` | Stores a JSON array of values that are serialized using this JSON single-value format | | ||
| **`map`** | **`JSON object of key and value arrays`** | `{ "keys": ["a", "b"], "values": [1, 2] }` | Stores arrays of keys and values; individual keys and values are serialized using this JSON single-value format | | ||
| **`variant`** | **`Same JSON representation in this table for stored type`** | `null`, `true`, `{"1": 1, "2": "bar"}` | The JSON representation matches the format shown in this table for the type stored in the Variant. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is insufficient because it loses type information. For example, if there is a Variant that contains a timestamp, the timestamp's type is lost. This also appears to use the struct representation for an object, but a Variant has no field IDs.
I think that we probably do want to use a JSON representation, but we will need to make sure that it aligns with the JSON conversion defined in the Variant spec (https://github.com/apache/parquet-format/pull/461/files#diff-80a56bf0d841087ab5038020e4b78119d6ceb44684719ba4d3a6e22effb36eb9R459) and that we have defined requirements for recovering the types.
Note that the JSON conversion in the Variant spec differs from this section:
- Binary values are hexadecimal strings here (:facepalm:) and base64 in Variant
- Decimal values are strings here and numbers in Variant
We may want to not allow Variant in JSON because of the type loss. Or we may want to specify a subset that can be recovered, like booleans, integers, floats, strings, and arrays. There are two places where this section is used: in the REST protocol for sending lower/upper bounds and in default values. We don't really need default values for variants (but would have to disallow them) and lower/upper bounds would generally work with a subset.
The other option is to encode variant
as a base64 binary value. That is a bit ugly in JSON, but it is not lossy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't exactly understand how this gets used. I will make changes to say "The JSON representation with the encoding defined in the https://github.com/apache/parquet-format/pull/461/files#diff-80a56bf0d841087ab5038020e4b78119d6ceb44684719ba4d3a6e22effb36eb9R459 of Variant spec". Would that work for lower/upper bounds usage?
I notice that SingleValueParser.fromJson(Type type, JsonNode defaultValue) and toJson(Type type, Object defaultValue, JsonGenerator generator) would error out for Variant. Is that sufficient to disallow defaultValue?
@aihuaxu, I think there are a couple of things missing:
|
Oops. I didn't mean to close this. |
636611f
to
7023543
Compare
Thanks @rdblue I thought we will make changes when we start to work on Avro/ORC. I added that. I don't have much context for Json conversion. Not sure if we need to add more info. |
7023543
to
d953b6e
Compare
d953b6e
to
67df611
Compare
@@ -1110,6 +1125,7 @@ Maps with non-string keys must use an array representation with the `map` logica | |||
|**`struct`**|`record`|| | |||
|**`list`**|`array`|| | |||
|**`map`**|`array` of key-value records, or `map` when keys are strings (optional).|Array storage must use logical type name `map` and must store elements that are 2-field records. The first field is a non-null key and the second field is the value.| | |||
|**`variant`**|`record with `metadata` and `value` fields`|Shredding is not supported in Avro.| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd leave out the shredding note til we define it and support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if this was requested somewhere else, but i'd keep any mention of "shredding" out till we add it in another pr
@@ -1287,6 +1307,7 @@ Types are serialized according to this table: | |||
|**`struct`**|`JSON object: {`<br /> `"type": "struct",`<br /> `"fields": [ {`<br /> `"id": <field id int>,`<br /> `"name": <name string>,`<br /> `"required": <boolean>,`<br /> `"type": <type JSON>,`<br /> `"doc": <comment string>,`<br /> `"initial-default": <JSON encoding of default value>,`<br /> `"write-default": <JSON encoding of default value>`<br /> `}, ...`<br /> `] }`|`{`<br /> `"type": "struct",`<br /> `"fields": [ {`<br /> `"id": 1,`<br /> `"name": "id",`<br /> `"required": true,`<br /> `"type": "uuid",`<br /> `"initial-default": "0db3e2a8-9d1d-42b9-aa7b-74ebe558dceb",`<br /> `"write-default": "ec5911be-b0a7-458c-8438-c9a3e53cffae"`<br /> `}, {`<br /> `"id": 2,`<br /> `"name": "data",`<br /> `"required": false,`<br /> `"type": {`<br /> `"type": "list",`<br /> `...`<br /> `}`<br /> `} ]`<br />`}`| | |||
|**`list`**|`JSON object: {`<br /> `"type": "list",`<br /> `"element-id": <id int>,`<br /> `"element-required": <bool>`<br /> `"element": <type JSON>`<br />`}`|`{`<br /> `"type": "list",`<br /> `"element-id": 3,`<br /> `"element-required": true,`<br /> `"element": "string"`<br />`}`| | |||
|**`map`**|`JSON object: {`<br /> `"type": "map",`<br /> `"key-id": <key id int>,`<br /> `"key": <type JSON>,`<br /> `"value-id": <val id int>,`<br /> `"value-required": <bool>`<br /> `"value": <type JSON>`<br />`}`|`{`<br /> `"type": "map",`<br /> `"key-id": 4,`<br /> `"key": "string",`<br /> `"value-id": 5,`<br /> `"value-required": false,`<br /> `"value": "double"`<br />`}`| | |||
| **`variant`**| `JSON string: "variant"`|`"variant"`| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should link to a well defined json translation of the Variant type which I think should live in the parquet spec.
@@ -1436,6 +1457,7 @@ This serialization scheme is for storing single values as individual binary valu | |||
| **`struct`** | Not supported | | |||
| **`list`** | Not supported | | |||
| **`map`** | Not supported | | |||
| **`variant`** | Not supported | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree this should be not-supported for now. Then when shredding is included say something like for Shredded variants only, binary value concatenation of metadata and value + separator byte or something. We can figure that out with the shredding addition though
@@ -1462,6 +1484,7 @@ This serialization scheme is for storing single values as individual binary valu | |||
| **`struct`** | **`JSON object by field ID`** | `{"1": 1, "2": "bar"}` | Stores struct fields using the field ID as the JSON field name; field values are stored using this JSON single-value format | | |||
| **`list`** | **`JSON array of values`** | `[1, 2, 3]` | Stores a JSON array of values that are serialized using this JSON single-value format | | |||
| **`map`** | **`JSON object of key and value arrays`** | `{ "keys": ["a", "b"], "values": [1, 2] }` | Stores arrays of keys and values; individual keys and values are serialized using this JSON single-value format | | |||
| **`variant`** | **`JSON string`** | `"rO0ABXVyAANbW0JL/RkVZ2fbNwIAAHhwAAAAAnVyAAJbQqzzF/gGCFTgAgAAeHAAAAAMAQIABAdu"` | Stores base64-encoded variant value | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we have to explain how we combine the metadata and value here as well?
Help: #10392
Spec: add variant type
Proposal: https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
This is to layout the spec for variant type. The specs are placed in Parquet project (see variant spec and shredding spec.