Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify Variant shredding and refactor for clarity #461

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Oct 20, 2024

Rationale for this change

Updating the Variant and shredding specs from a thorough review.

What changes are included in this PR?

Spec updates, mostly to the shredding spec to minimize it and make it clear. This also attempts to make the variant spec more consistent (for example, by using value in both).

  • Removes object and array in favor of always using typed_value
  • Makes list element and object field groups required to avoid unnecessary null cases
  • Separates cases for primitives, arrays, and objects
  • Adds individual examples for primitives, arrays, and objects
  • Adds Variant to Parquet type mapping for shredded columns
  • Clarifies that metadata must be valid for all variant values without modification
  • Updates reconstruction algorithm to be more pythonic

Do these changes have PoC implementations?

No.

@@ -94,7 +112,7 @@ Each `offset` is a little-endian value of `offset_size` bytes, and represents th
The first `offset` value will always be `0`, and the last `offset` value will always be the total length of `bytes`.
The last part of the metadata is `bytes`, which stores all the string values in the dictionary.

## Metadata encoding grammar
### Metadata encoding grammar
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated these to not use more than one H1, which can cause issues with TOC. Pages should have just one H1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL as well

We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`.
Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups.
Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing.
All fields for a variant, whether shredded or not, must be present in the metadata.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be controversial. I'm trying to say that you should not need to modify the metadata when reading. The reconstructed object should be able to use the stored metadata without adding fields.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused. When the field is not shredded, we will not have metadata for it, right? When it's getting shredded, then it will be like a column and we will generate metadata so it can be used for filtering/pruning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-aixu, this is saying that when writing, the metadata for a shredded value and the metadata for a non-shredded value should be identical. Writers should not alter the metadata by removing shredded field names so that readers do not need to rewrite the metadata (and values) to add it back.

For example, consider an event that looks like this:

{
  "id": 102,
  "event_type": "signup",
  "event_timestamp": "2024-10-21T20:06:34.198724",
  "payload": {
    "a": 1,
    "b": 2
  }
}

And a shredding schema:

optional group event (VARIANT) {
  required binary metadata;
  optional binary value;
  optional group typed_value {
    required group event_type {
      optional binary value;
      optional binary typed_value (STRING);
    }
    required group event_timestamp {
      optional binary value;
      optional int64 typed_value (TIMESTAMP(true, MICROS));
    }
  }
}

The top-level event_type and event_timestamp fields are shredded. But this is saying that the Variant metadata must include those field names. That ensure that the existing binary metadata can be returned to the engine without adding event_type and event_timestamp fields when merging those fields into the top-level Variant value when the entire Variant is projected.

VariantEncoding.md Outdated Show resolved Hide resolved

Similarly the elements of an `array` must be a group containing one or more of `object`, `array`, `typed_value` or `variant_value`.
Each shredded field is represented as a required group that contains a `variant_value` and a `typed_value` field.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why each shredded field should be a required group is not clear to me. If fields were allowed to be optional, that would be another way of indicating non-existence of fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary purpose is to reduce the number of cases that implementers have to deal with. If all of the cases can be expressed with 2 optional fields rather than 2 optional fields inside an optional group, then the group should be required to simplify as much as possible.

In addition, every level in Parquet that is optional introduces another repetition/definition level. That adds up quickly with nested structures and ends up taking unnecessary space.

VariantShredding.md Outdated Show resolved Hide resolved
VariantShredding.md Outdated Show resolved Hide resolved
@@ -33,176 +33,239 @@ This document focuses on the shredding semantics, Parquet representation, implic
For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns.
The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification.

At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`.
These represent a fixed schema suitable for constructing the full Variant value for each row.

Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another place I'd like to just remove some of the text here. My main goal here is just to reduce the amount of text in the spec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reduced this, but I don't think it's a problem to have a bit of context that answers the question "What is shredding and why do I care?"

The `typed_value` field may be any type that has a corresponding Variant type.
For each value in the data, at most one of the `typed_value` and `variant_value` may be non-null.
A writer may omit either field, which is equivalent to all rows being null.
If both fields are non-null and either is not an object, the value is invalid. Readers must either fail or return the `typed_value`.
Copy link
Contributor Author

@rdblue rdblue Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer and @gene-db, this could use some attention.

Here, if both value and typed_value are non-null I initially thought it made more sense to prefer value because it doesn't need to be re-encoded and may have been coerced by an engine to the shredded type.

However, this conflicts with object fields, where the value of typed_value is preferred so that data skipping is correct. If the object's value could contains a field that conflicts with a sub-field's typed_value there is no way of knowing from field stats. If we preferred the field value stored in the object's value then data skipping could be out of sync with the value returned in the case of a conflict.

@rdblue rdblue changed the title WIP: Current work on Variant specs Simplify Variant shredding and refactor for clarity Oct 24, 2024
|---------------|-----------|----------------------------------------------------------|--------------------------------------|
| Null type | null | `null` | `null` |
| Boolean | boolean | `true` or `false` | `true` |
| Exact Numeric | number | Digits in fraction must match scale, no exponent | `34`, 34.00 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For exact numerics, we should allow truncating trailing zeros. For example, int8 value 1 and decimal(5,2) value 100 can both be represented as a JSON value 1.

Also, should the example be quoted to stay consistent?

Suggested change
| Exact Numeric | number | Digits in fraction must match scale, no exponent | `34`, 34.00 |
| Exact Numeric | number | Digits in fraction must match scale, no exponent | `34`, `34.00` |

This is intended to allow future backwards-compatible extensions.
In particular, the field names `_metadata_key_paths` and any name starting with `_spark` are reserved, and should not be used by other implementations.
Any extra field names that do not start with an underscore should be assumed to be backwards incompatible, and readers should fail when reading such a schema.
Shredding is an optional feature of Variant, and readers must continue to be able to read a group containing only `value` and `metadata` fields.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, isn't non-shredded just a special case of shredded with no typed_value in the top level struct? I think it's automatically backwards compatible.

Each inner field's type is a recursively shredded variant value: that is, the fields of each object field must be one or more of `object`, `array`, `typed_value` or `variant_value`.
| `value` | `typed_value` | Meaning |
|----------|---------------|----------------------------------------------------------|
| null | null | The value is missing |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, this is only allowed for object fields, right? You mention in the array section that array elements must have one of them non-null, but I think that's also true for the top-level value/typed_value, right?


Dictionary IDs in a `variant_value` field refer to entries in the top-level `metadata` field.
If a Variant is missing in a context where a value is required, readers must either fail or return a Variant null: basic type 0 (primitive) and physical type 0 (null).
For example, if a Variant is required (like `measurement` above) and both `value` and `typed_value` are null, the returned `value` must be `00` (Variant null).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in my previous comment, I think it would be invalid for measurement to have both value and typed_value be null, and should be an error. I don't understand why we're recommend returning variant null as an option.


It is possible to recover a full Variant value using a recursive algorithm, where the initial call is to `ConstructVariant` with the top-level fields, which are assumed to be null if they are not present in the schema.
Each shredded field in the `typed_value` group is represented as a required group that contains optional `value` and `typed_value` fields.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that here and in the array case, it would be good to clarify whether typed_value can be omitted from the schema entirely. E.g. if there's no consistent type for a field, I think we'd still want to shred the field, but put all values in value, and not require that a typed_value type be specified.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conversely, is value always required? Would it be valid for a writer to only create a typed_value column if it knows that all values have a predictable type that can be shredded?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants