Simplify Variant shredding and refactor for clarity #461

rdblue · 2024-10-20T22:17:19Z

Rationale for this change

Updating the Variant and shredding specs from a thorough review.

What changes are included in this PR?

Spec updates, mostly to the shredding spec to minimize it and make it clear. This also attempts to make the variant spec more consistent (for example, by using value in both).

Removes object and array in favor of always using typed_value
Makes list element and object field groups required to avoid unnecessary null cases
Separates cases for primitives, arrays, and objects
Adds individual examples for primitives, arrays, and objects
Adds Variant to Parquet type mapping for shredded columns
Clarifies that metadata must be valid for all variant values without modification
Updates reconstruction algorithm to be more pythonic

Do these changes have PoC implementations?

No.

rdblue · 2024-10-20T22:23:14Z

VariantEncoding.md

@@ -94,7 +112,7 @@ Each `offset` is a little-endian value of `offset_size` bytes, and represents th
 The first `offset` value will always be `0`, and the last `offset` value will always be the total length of `bytes`.
 The last part of the metadata is `bytes`, which stores all the string values in the dictionary.

-## Metadata encoding grammar
+### Metadata encoding grammar


I've updated these to not use more than one H1, which can cause issues with TOC. Pages should have just one H1.

TIL as well

VariantShredding.md

rdblue · 2024-10-20T22:24:32Z

VariantShredding.md

-We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups.
-Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing.
+All fields for a variant, whether shredded or not, must be present in the metadata.


This may be controversial. I'm trying to say that you should not need to modify the metadata when reading. The reconstructed object should be able to use the stored metadata without adding fields.

I'm a little confused. When the field is not shredded, we will not have metadata for it, right? When it's getting shredded, then it will be like a column and we will generate metadata so it can be used for filtering/pruning?

@sfc-gh-aixu, this is saying that when writing, the metadata for a shredded value and the metadata for a non-shredded value should be identical. Writers should not alter the metadata by removing shredded field names so that readers do not need to rewrite the metadata (and values) to add it back.

For example, consider an event that looks like this:

{ "id": 102, "event_type": "signup", "event_timestamp": "2024-10-21T20:06:34.198724", "payload": { "a": 1, "b": 2 } }

And a shredding schema:

optional group event (VARIANT) { required binary metadata; optional binary value; optional group typed_value { required group event_type { optional binary value; optional binary typed_value (STRING); } required group event_timestamp { optional binary value; optional int64 typed_value (TIMESTAMP(true, MICROS)); } } }

The top-level event_type and event_timestamp fields are shredded. But this is saying that the Variant metadata must include those field names. That ensure that the existing binary metadata can be returned to the engine without adding event_type and event_timestamp fields when merging those fields into the top-level Variant value when the entire Variant is projected.

VariantEncoding.md

sfc-gh-saya · 2024-10-22T20:36:04Z

VariantShredding.md


-Similarly the elements of an `array` must be a group containing one or more of `object`, `array`, `typed_value` or `variant_value`.
+Each shredded field is represented as a required group that contains a `variant_value` and a `typed_value` field.


Why each shredded field should be a required group is not clear to me. If fields were allowed to be optional, that would be another way of indicating non-existence of fields.

The primary purpose is to reduce the number of cases that implementers have to deal with. If all of the cases can be expressed with 2 optional fields rather than 2 optional fields inside an optional group, then the group should be required to simplify as much as possible.

In addition, every level in Parquet that is optional introduces another repetition/definition level. That adds up quickly with nested structures and ends up taking unnecessary space.

VariantShredding.md

VariantEncoding.md

VariantShredding.md

RussellSpitzer · 2024-10-24T20:19:43Z

VariantShredding.md

@@ -33,176 +33,239 @@ This document focuses on the shredding semantics, Parquet representation, implic
 For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns.
 The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification.

-At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`.
-These represent a fixed schema suitable for constructing the full Variant value for each row.
-
 Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data).


Another place I'd like to just remove some of the text here. My main goal here is just to reduce the amount of text in the spec

I reduced this, but I don't think it's a problem to have a bit of context that answers the question "What is shredding and why do I care?"

rdblue · 2024-10-24T20:50:13Z

VariantShredding.md

-The `typed_value` field may be any type that has a corresponding Variant type.
-For each value in the data, at most one of the `typed_value` and `variant_value` may be non-null.
-A writer may omit either field, which is equivalent to all rows being null.
+If both fields are non-null and either is not an object, the value is invalid. Readers must either fail or return the `typed_value`.


@RussellSpitzer and @gene-db, this could use some attention.

Here, if both value and typed_value are non-null I initially thought it made more sense to prefer value because it doesn't need to be re-encoded and may have been coerced by an engine to the shredded type.

However, this conflicts with object fields, where the value of typed_value is preferred so that data skipping is correct. If the object's value could contains a field that conflicts with a sub-field's typed_value there is no way of knowing from field stats. If we preferred the field value stored in the object's value then data skipping could be out of sync with the value returned in the case of a conflict.

gene-db · 2024-10-25T02:00:44Z

VariantEncoding.md

+|---------------|-----------|----------------------------------------------------------|--------------------------------------|
+| Null type     | null      | `null`                                                   | `null`                               |
+| Boolean       | boolean   | `true` or `false`                                        | `true`                               |
+| Exact Numeric | number    | Digits in fraction must match scale, no exponent         | `34`, 34.00                          |


For exact numerics, we should allow truncating trailing zeros. For example, int8 value 1 and decimal(5,2) value 100 can both be represented as a JSON value 1.

Also, should the example be quoted to stay consistent?

Suggested change

| Exact Numeric | number | Digits in fraction must match scale, no exponent | `34`, 34.00 |

| Exact Numeric | number | Digits in fraction must match scale, no exponent | `34`, `34.00` |

cashmand · 2024-10-28T18:18:05Z

VariantShredding.md

-This is intended to allow future backwards-compatible extensions.
-In particular, the field names `_metadata_key_paths` and any name starting with `_spark` are reserved, and should not be used by other implementations.
-Any extra field names that do not start with an underscore should be assumed to be backwards incompatible, and readers should fail when reading such a schema.
+Shredding is an optional feature of Variant, and readers must continue to be able to read a group containing only `value` and `metadata` fields.


At this point, isn't non-shredded just a special case of shredded with no typed_value in the top level struct? I think it's automatically backwards compatible.

cashmand · 2024-10-28T18:33:19Z

VariantShredding.md

-Each inner field's type is a recursively shredded variant value: that is, the fields of each object field must be one or more of `object`, `array`, `typed_value` or `variant_value`.
+| `value`  | `typed_value` | Meaning                                                  |
+|----------|---------------|----------------------------------------------------------|
+| null     | null          | The value is missing                                     |


Just to be clear, this is only allowed for object fields, right? You mention in the array section that array elements must have one of them non-null, but I think that's also true for the top-level value/typed_value, right?

cashmand · 2024-10-28T18:52:41Z

VariantShredding.md


-Dictionary IDs in a `variant_value` field refer to entries in the top-level `metadata` field.
+If a Variant is missing in a context where a value is required, readers must either fail or return a Variant null: basic type 0 (primitive) and physical type 0 (null).
+For example, if a Variant is required (like `measurement` above) and both `value` and `typed_value` are null, the returned `value` must be `00` (Variant null).


As mentioned in my previous comment, I think it would be invalid for measurement to have both value and typed_value be null, and should be an error. I don't understand why we're recommend returning variant null as an option.

cashmand · 2024-10-28T19:07:18Z

VariantShredding.md


-It is possible to recover a full Variant value using a recursive algorithm, where the initial call is to `ConstructVariant` with the top-level fields, which are assumed to be null if they are not present in the schema.
+Each shredded field in the `typed_value` group is represented as a required group that contains optional `value` and `typed_value` fields.


I think that here and in the array case, it would be good to clarify whether typed_value can be omitted from the schema entirely. E.g. if there's no consistent type for a field, I think we'd still want to shred the field, but put all values in value, and not require that a typed_value type be specified.

Conversely, is value always required? Would it be valid for a writer to only create a typed_value column if it knows that all values have a predictable type that can be shredded?

rdblue force-pushed the variant-updates branch from d8a2206 to c4b435f Compare October 20, 2024 22:18

Current work on variant updates.

8352319

rdblue force-pushed the variant-updates branch from c4b435f to 8352319 Compare October 20, 2024 22:22

rdblue commented Oct 20, 2024

View reviewed changes

VariantShredding.md Show resolved Hide resolved

rdblue commented Oct 20, 2024

View reviewed changes

gene-db reviewed Oct 21, 2024

View reviewed changes

VariantEncoding.md Outdated Show resolved Hide resolved

rdblue mentioned this pull request Oct 21, 2024

GH-459: Add Variant logical type annotation #460

Open

sfc-gh-saya reviewed Oct 22, 2024

View reviewed changes

Remove Option 2, which cannot be used because stats aren't trusted.

17258d8

RussellSpitzer reviewed Oct 24, 2024

View reviewed changes

VariantShredding.md Outdated Show resolved Hide resolved

More updates to the Variant spec.

6766f31

RussellSpitzer reviewed Oct 24, 2024

View reviewed changes

VariantShredding.md Show resolved Hide resolved

sfc-gh-saya reviewed Oct 24, 2024

View reviewed changes

VariantEncoding.md Show resolved Hide resolved

RussellSpitzer reviewed Oct 24, 2024

View reviewed changes

VariantShredding.md Outdated Show resolved Hide resolved

RussellSpitzer reviewed Oct 24, 2024

View reviewed changes

rdblue added 4 commits October 24, 2024 13:22

Trim the intro

662cad7

Update the encoding to capture required/optional value.

682a562

Remove unnecessary required.

f4f9d54

Use typed_value when there is a conflict.

7dde87b

rdblue commented Oct 24, 2024

View reviewed changes

Minor updates.

be6c6c6

rdblue changed the title ~~WIP: Current work on Variant specs~~ Simplify Variant shredding and refactor for clarity Oct 24, 2024

Clarify cases where a value is required but missing.

7a6cd7c

gene-db reviewed Oct 25, 2024

View reviewed changes

cashmand reviewed Oct 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify Variant shredding and refactor for clarity #461

Simplify Variant shredding and refactor for clarity #461

rdblue commented Oct 20, 2024 •

edited

Loading

rdblue Oct 20, 2024

RussellSpitzer Oct 24, 2024

rdblue Oct 20, 2024

sfc-gh-aixu Oct 21, 2024

rdblue Oct 21, 2024

sfc-gh-saya Oct 22, 2024

rdblue Oct 24, 2024

RussellSpitzer Oct 24, 2024

rdblue Oct 24, 2024

rdblue Oct 24, 2024 •

edited

Loading

gene-db Oct 25, 2024

cashmand Oct 28, 2024

cashmand Oct 28, 2024

cashmand Oct 28, 2024

cashmand Oct 28, 2024

cashmand Oct 30, 2024


		Similarly the elements of an `array` must be a group containing one or more of `object`, `array`, `typed_value` or `variant_value`.
		Each shredded field is represented as a required group that contains a `variant_value` and a `typed_value` field.

	\| Exact Numeric \| number \| Digits in fraction must match scale, no exponent \| `34`, 34.00 \|
	\| Exact Numeric \| number \| Digits in fraction must match scale, no exponent \| `34`, `34.00` \|


		It is possible to recover a full Variant value using a recursive algorithm, where the initial call is to `ConstructVariant` with the top-level fields, which are assumed to be null if they are not present in the schema.
		Each shredded field in the `typed_value` group is represented as a required group that contains optional `value` and `typed_value` fields.

Simplify Variant shredding and refactor for clarity #461

Are you sure you want to change the base?

Simplify Variant shredding and refactor for clarity #461

Conversation

rdblue commented Oct 20, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Do these changes have PoC implementations?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Oct 20, 2024 •

edited

Loading

rdblue Oct 24, 2024 •

edited

Loading