-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify Variant shredding and refactor for clarity #461
base: master
Are you sure you want to change the base?
Conversation
d8a2206
to
c4b435f
Compare
c4b435f
to
8352319
Compare
@@ -94,7 +112,7 @@ Each `offset` is a little-endian value of `offset_size` bytes, and represents th | |||
The first `offset` value will always be `0`, and the last `offset` value will always be the total length of `bytes`. | |||
The last part of the metadata is `bytes`, which stores all the string values in the dictionary. | |||
|
|||
## Metadata encoding grammar | |||
### Metadata encoding grammar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated these to not use more than one H1, which can cause issues with TOC. Pages should have just one H1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL as well
VariantShredding.md
Outdated
We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. | ||
Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. | ||
Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. | ||
All fields for a variant, whether shredded or not, must be present in the metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be controversial. I'm trying to say that you should not need to modify the metadata when reading. The reconstructed object should be able to use the stored metadata without adding fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused. When the field is not shredded, we will not have metadata for it, right? When it's getting shredded, then it will be like a column and we will generate metadata so it can be used for filtering/pruning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfc-gh-aixu, this is saying that when writing, the metadata for a shredded value and the metadata for a non-shredded value should be identical. Writers should not alter the metadata by removing shredded field names so that readers do not need to rewrite the metadata (and values) to add it back.
For example, consider an event that looks like this:
{
"id": 102,
"event_type": "signup",
"event_timestamp": "2024-10-21T20:06:34.198724",
"payload": {
"a": 1,
"b": 2
}
}
And a shredding schema:
optional group event (VARIANT) {
required binary metadata;
optional binary value;
optional group typed_value {
required group event_type {
optional binary value;
optional binary typed_value (STRING);
}
required group event_timestamp {
optional binary value;
optional int64 typed_value (TIMESTAMP(true, MICROS));
}
}
}
The top-level event_type
and event_timestamp
fields are shredded. But this is saying that the Variant metadata
must include those field names. That ensure that the existing binary metadata can be returned to the engine without adding event_type
and event_timestamp
fields when merging those fields into the top-level Variant value
when the entire Variant is projected.
VariantShredding.md
Outdated
|
||
Similarly the elements of an `array` must be a group containing one or more of `object`, `array`, `typed_value` or `variant_value`. | ||
Each shredded field is represented as a required group that contains a `variant_value` and a `typed_value` field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why each shredded field should be a required group is not clear to me. If fields were allowed to be optional, that would be another way of indicating non-existence of fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The primary purpose is to reduce the number of cases that implementers have to deal with. If all of the cases can be expressed with 2 optional fields rather than 2 optional fields inside an optional group, then the group should be required to simplify as much as possible.
In addition, every level in Parquet that is optional introduces another repetition/definition level. That adds up quickly with nested structures and ends up taking unnecessary space.
VariantShredding.md
Outdated
@@ -33,176 +33,239 @@ This document focuses on the shredding semantics, Parquet representation, implic | |||
For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. | |||
The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. | |||
|
|||
At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. | |||
These represent a fixed schema suitable for constructing the full Variant value for each row. | |||
|
|||
Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another place I'd like to just remove some of the text here. My main goal here is just to reduce the amount of text in the spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reduced this, but I don't think it's a problem to have a bit of context that answers the question "What is shredding and why do I care?"
The `typed_value` field may be any type that has a corresponding Variant type. | ||
For each value in the data, at most one of the `typed_value` and `variant_value` may be non-null. | ||
A writer may omit either field, which is equivalent to all rows being null. | ||
If both fields are non-null and either is not an object, the value is invalid. Readers must either fail or return the `typed_value`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer and @gene-db, this could use some attention.
Here, if both value
and typed_value
are non-null I initially thought it made more sense to prefer value
because it doesn't need to be re-encoded and may have been coerced by an engine to the shredded type.
However, this conflicts with object fields, where the value of typed_value
is preferred so that data skipping is correct. If the object's value
could contains a field that conflicts with a sub-field's typed_value
there is no way of knowing from field stats. If we preferred the field value stored in the object's value
then data skipping could be out of sync with the value returned in the case of a conflict.
|---------------|-----------|----------------------------------------------------------|--------------------------------------| | ||
| Null type | null | `null` | `null` | | ||
| Boolean | boolean | `true` or `false` | `true` | | ||
| Exact Numeric | number | Digits in fraction must match scale, no exponent | `34`, 34.00 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For exact numerics, we should allow truncating trailing zeros. For example, int8
value 1
and decimal(5,2)
value 100
can both be represented as a JSON value 1
.
Also, should the example be quoted to stay consistent?
| Exact Numeric | number | Digits in fraction must match scale, no exponent | `34`, 34.00 | | |
| Exact Numeric | number | Digits in fraction must match scale, no exponent | `34`, `34.00` | |
This is intended to allow future backwards-compatible extensions. | ||
In particular, the field names `_metadata_key_paths` and any name starting with `_spark` are reserved, and should not be used by other implementations. | ||
Any extra field names that do not start with an underscore should be assumed to be backwards incompatible, and readers should fail when reading such a schema. | ||
Shredding is an optional feature of Variant, and readers must continue to be able to read a group containing only `value` and `metadata` fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, isn't non-shredded just a special case of shredded with no typed_value
in the top level struct? I think it's automatically backwards compatible.
Each inner field's type is a recursively shredded variant value: that is, the fields of each object field must be one or more of `object`, `array`, `typed_value` or `variant_value`. | ||
| `value` | `typed_value` | Meaning | | ||
|----------|---------------|----------------------------------------------------------| | ||
| null | null | The value is missing | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, this is only allowed for object fields, right? You mention in the array section that array elements must have one of them non-null, but I think that's also true for the top-level value
/typed_value
, right?
|
||
Dictionary IDs in a `variant_value` field refer to entries in the top-level `metadata` field. | ||
If a Variant is missing in a context where a value is required, readers must either fail or return a Variant null: basic type 0 (primitive) and physical type 0 (null). | ||
For example, if a Variant is required (like `measurement` above) and both `value` and `typed_value` are null, the returned `value` must be `00` (Variant null). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in my previous comment, I think it would be invalid for measurement
to have both value
and typed_value
be null, and should be an error. I don't understand why we're recommend returning variant null as an option.
|
||
It is possible to recover a full Variant value using a recursive algorithm, where the initial call is to `ConstructVariant` with the top-level fields, which are assumed to be null if they are not present in the schema. | ||
Each shredded field in the `typed_value` group is represented as a required group that contains optional `value` and `typed_value` fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that here and in the array case, it would be good to clarify whether typed_value
can be omitted from the schema entirely. E.g. if there's no consistent type for a field, I think we'd still want to shred the field, but put all values in value
, and not require that a typed_value
type be specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conversely, is value
always required? Would it be valid for a writer to only create a typed_value
column if it knows that all values have a predictable type that can be shredded?
Rationale for this change
Updating the Variant and shredding specs from a thorough review.
What changes are included in this PR?
Spec updates, mostly to the shredding spec to minimize it and make it clear. This also attempts to make the variant spec more consistent (for example, by using
value
in both).object
andarray
in favor of always usingtyped_value
required
to avoid unnecessary null casesmetadata
must be valid for all variant values without modificationDo these changes have PoC implementations?
No.