Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Add Variant data type #11324

Merged
merged 5 commits into from
Nov 15, 2024
Merged

Conversation

aihuaxu
Copy link
Contributor

@aihuaxu aihuaxu commented Oct 15, 2024

Add Variant data type to API module. We only add limited required type to API for now and we will promote stable methods to API after full implementation to the interface VariantLike later.

  • Add Variant as a new data type

Fix #11178.

@github-actions github-actions bot added the API label Oct 15, 2024
@aihuaxu aihuaxu marked this pull request as ready for review October 15, 2024 18:34
@@ -534,6 +534,7 @@ private static int estimateSize(Type type) {
case FIXED:
return ((Types.FixedType) type).length();
case BINARY:
case VARIANT:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for this size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't have the accurate size for Variant similar to Binary. So I use the same value as Binary. I'm wondering how we come up with 80 for Binary.

@@ -562,7 +563,7 @@ private static String sanitize(Literal<?> literal, long now, int today) {
} else if (literal instanceof Literals.DoubleLiteral) {
return sanitizeNumber(((Literals.DoubleLiteral) literal).value(), "float");
} else {
// for uuid, decimal, fixed, and binary, match the string result
// for uuid, decimal, fixed, variant, and binary, match the string result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is okay but we may be missing information by not sanitizing based on the variant's type (i.e. date) and it would be nice to have some idea of the structure in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting something like

{
 foo:
   bar: 3 
   baz: 
      bozz: "flew"
 } 

to

{ 
  (hash-foo):
     (hash-bar) : (1 digit number)
     (hash-baz) :
        (hash-bozz) : (hash-xaxa)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be a nice feature but probably ok for it's own issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for helping me understand the concept. I have filed a followup issue #11479 and will work on separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually to support this, we need to have defined the interface VariantLike on how to access the subfields and then recursively sanitize the values based on the subfield types.

When we promote the VariantLike to API, then we can add such implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think this needs to be done right away, but being able to have some representation that goes into the variant types would be great.

@aihuaxu aihuaxu force-pushed the variant-type-api branch 2 times, most recently from 327736b to 82f2e7b Compare November 1, 2024 17:20
@@ -1687,6 +1687,44 @@ public void testV3TimestampNanoTypeSupport() {
3);
}

@Test
public void testV3VariantTypeSupport() {
Copy link
Member

@RussellSpitzer RussellSpitzer Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is copying a lot of tests in this class but we should start future proofing a bit more imho. We also have tests around this sort of thing in TestSchema.java. I think it is probably ok to just keep all of our schema validation tests there, but it wouldn't hurt to have some redundancy here as well.

Refactoring the whole suite can come in another pr but I think we should build a templated test that's something like

@ParameterizedTest
@ValueSource(types = {Types.TimetstampNanos, Types.Variant, ....})
testTypeSupport(Type type) {
  Schema schemaWithType =  new Schema(
          Types.NestedField.required(1, "id", Types.LongType.get()),
          Types.NestedField.optional(2, type.name, type),
          Types.NestedField.optional(3, "arr", Types.ListType.ofRequired(4, type)),
          Types.NestedField.required(5, "struct", 
            Types.StructType.of(
                  Types.NestedField.optional(6, "inner_" + type.name, type),
                  Types.NestedField.required(7, "data", Types.StringType.get()))),
          Types.NestedField.optional(8, "struct_arr",
              Types.StructType.of(
                  Types.NestedField.optional(9, "ts", type))));
     
    //Psuedo code here 
    from 0 -> MIN_FORMAT_VERSION.get(type)
       fail to make metadata
       
    from MIN_FORMAT_VERSION.get(type) -> SUPPORTED_TABLE_VERSION
       succeed
}

The most important part about this is that we wouldn't have to continually update tests every time a new valid metadata version is added. It also would be much easier to test type compatibility. (I'm thinking that Geo is going to need the exact same thing soon)

For this PR I think it is enough to write a parameterized version for just Variant, then we could raise another PR to add in nanos and remove the redundant tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not this is part of my goal to remove all tests that have V3 or V2 in their title.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also todo add Variant to Schema.MIN_FORMAT_VERSION

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RussellSpitzer for already creating a PR to address that. I will include that when it's merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for moving this to TestSchema. The test here should really just be validating that schemas are validated against the table version, but the actual validation tests should be in TestSchema.

Copy link
Member

@RussellSpitzer RussellSpitzer Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets just drop this test then (or add a single variant field into the above test) and just change the above test to be called testV3SchemaValidation or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the actual validation for both variant and timestamp_ns since it's already covered in TestSchema. I added version check/validation coverage for TableMetadata.newTableMetadata().

@@ -38,6 +38,9 @@ class Identity<T> implements Transform<T, T> {
*/
@Deprecated
public static <I> Identity<I> get(Type type) {
Preconditions.checkArgument(
type.typeId() != Type.TypeID.VARIANT, "Unsupported type for identity: %s", type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to make sure that canTransform doesn't accept variants. We could special case it, but I'm actually wondering if it makes sense for variant to be a primitive type because it is only primitive sometimes.

Copy link
Member

@RussellSpitzer RussellSpitzer Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do only need to special case it if we make Variant a primitive. I'm not sure what primitive really means in this context though. I think anything that may or may not be scalar is probably not a primitive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the test coverage in TestBucketing.java TestIdentity.java to make canTransform doesn't accept variants.

@@ -412,6 +413,29 @@ public String toString() {
}
}

public static class VariantType extends PrimitiveType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that variant should be either a primitive type or a nested type. There are at least a few places where making variant a primitive creates problems. For example, Schema checks that a field is a primitive to allow it as an identifier field and the Identity transform will bind to a type if it is a primitive. There are also places that check whether a type is nested and make decisions based on whether there are sub-fields.

Neither one of those seems appropriate to me so I'd suggest making this extend just Type. Then we can adjust the logic in those places without assuming either one and potentially introducing bugs when the assumption is bad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have isPrimitiveType() to return false. I was debating on that - if VariantType extends PrimitiveType, some internal change would be easier. But concept-wise t should be from Type directly and let me change it.

@@ -312,4 +314,15 @@ public void testAddPartitionFieldsWithAndWithoutFieldIds() {
assertThat(spec.fields().get(2).fieldId()).isEqualTo(1006);
assertThat(spec.lastAssignedFieldId()).isEqualTo(1006);
}

@Test
public void testVariantUnsupported() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other transforms? None of them are allowed, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. I have the coverage in TestIdentity.java and TestBucketing.java and I think I have checked all the transforms.


@Override
public boolean isPrimitiveType() {
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see that although this extends PrimitiveType, it reports that it is not one. I think it would be safer not to extend PrimitiveType. Good that this is already used to get the tests passing.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I think we could clean up TestTableMetadata more (i left a comment there) but we can always push that out as well. I mean to take a pass on cleaning some of that up later as well.

@aihuaxu aihuaxu force-pushed the variant-type-api branch 2 times, most recently from 9b77a5f to 4050ac0 Compare November 7, 2024 21:44
@aihuaxu aihuaxu requested a review from rdblue November 7, 2024 21:50
@RussellSpitzer
Copy link
Member

@rdblue I think you are the last reviewer on this, do you have any further comments for this one?

@RussellSpitzer
Copy link
Member

Sync'd up with Ryan this morning, we are good to go.

Thanks @aihuaxu for working on this and @rdblue, @gene-db and @Fokko for reviews!

@RussellSpitzer RussellSpitzer merged commit 821aec3 into apache:main Nov 15, 2024
50 checks passed
@@ -412,6 +412,41 @@ public String toString() {
}
}

public static class VariantType implements Type {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aihuaxu, could you also implement the same Java serialization that we use for PrimitiveType? The existing PrimitiveHolder should work:

    Object writeReplace() throws ObjectStreamException {
      return new PrimitiveHolder(toString());
    }

@rdblue
Copy link
Contributor

rdblue commented Nov 16, 2024

Late +1 from me! Thanks @aihuaxu!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposal: add Variant type to iceberg
5 participants