-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Add Variant data type #11324
API: Add Variant data type #11324
Conversation
api/src/test/java/org/apache/iceberg/types/TestSerializableTypes.java
Outdated
Show resolved
Hide resolved
3a3282d
to
15d8ed6
Compare
15d8ed6
to
15b2b71
Compare
ca226ee
to
1d3056b
Compare
1d3056b
to
d4af2b3
Compare
@@ -534,6 +534,7 @@ private static int estimateSize(Type type) { | |||
case FIXED: | |||
return ((Types.FixedType) type).length(); | |||
case BINARY: | |||
case VARIANT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale for this size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't have the accurate size for Variant similar to Binary. So I use the same value as Binary. I'm wondering how we come up with 80 for Binary.
@@ -562,7 +563,7 @@ private static String sanitize(Literal<?> literal, long now, int today) { | |||
} else if (literal instanceof Literals.DoubleLiteral) { | |||
return sanitizeNumber(((Literals.DoubleLiteral) literal).value(), "float"); | |||
} else { | |||
// for uuid, decimal, fixed, and binary, match the string result | |||
// for uuid, decimal, fixed, variant, and binary, match the string result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is okay but we may be missing information by not sanitizing based on the variant's type (i.e. date) and it would be nice to have some idea of the structure in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting something like
{
foo:
bar: 3
baz:
bozz: "flew"
}
to
{
(hash-foo):
(hash-bar) : (1 digit number)
(hash-baz) :
(hash-bozz) : (hash-xaxa)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be a nice feature but probably ok for it's own issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for helping me understand the concept. I have filed a followup issue #11479 and will work on separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually to support this, we need to have defined the interface VariantLike
on how to access the subfields and then recursively sanitize the values based on the subfield types.
When we promote the VariantLike to API, then we can add such implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think this needs to be done right away, but being able to have some representation that goes into the variant types would be great.
d4af2b3
to
5c07698
Compare
327736b
to
82f2e7b
Compare
@@ -1687,6 +1687,44 @@ public void testV3TimestampNanoTypeSupport() { | |||
3); | |||
} | |||
|
|||
@Test | |||
public void testV3VariantTypeSupport() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is copying a lot of tests in this class but we should start future proofing a bit more imho. We also have tests around this sort of thing in TestSchema.java. I think it is probably ok to just keep all of our schema validation tests there, but it wouldn't hurt to have some redundancy here as well.
Refactoring the whole suite can come in another pr but I think we should build a templated test that's something like
@ParameterizedTest
@ValueSource(types = {Types.TimetstampNanos, Types.Variant, ....})
testTypeSupport(Type type) {
Schema schemaWithType = new Schema(
Types.NestedField.required(1, "id", Types.LongType.get()),
Types.NestedField.optional(2, type.name, type),
Types.NestedField.optional(3, "arr", Types.ListType.ofRequired(4, type)),
Types.NestedField.required(5, "struct",
Types.StructType.of(
Types.NestedField.optional(6, "inner_" + type.name, type),
Types.NestedField.required(7, "data", Types.StringType.get()))),
Types.NestedField.optional(8, "struct_arr",
Types.StructType.of(
Types.NestedField.optional(9, "ts", type))));
//Psuedo code here
from 0 -> MIN_FORMAT_VERSION.get(type)
fail to make metadata
from MIN_FORMAT_VERSION.get(type) -> SUPPORTED_TABLE_VERSION
succeed
}
The most important part about this is that we wouldn't have to continually update tests every time a new valid metadata version is added. It also would be much easier to test type compatibility. (I'm thinking that Geo is going to need the exact same thing soon)
For this PR I think it is enough to write a parameterized version for just Variant, then we could raise another PR to add in nanos and remove the redundant tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not this is part of my goal to remove all tests that have V3 or V2 in their title.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also todo add Variant to Schema.MIN_FORMAT_VERSION
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @RussellSpitzer for already creating a PR to address that. I will include that when it's merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for moving this to TestSchema
. The test here should really just be validating that schemas are validated against the table version, but the actual validation tests should be in TestSchema
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets just drop this test then (or add a single variant field into the above test) and just change the above test to be called testV3SchemaValidation or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the actual validation for both variant and timestamp_ns since it's already covered in TestSchema. I added version check/validation coverage for TableMetadata.newTableMetadata().
14665cb
to
ca2b5de
Compare
@@ -38,6 +38,9 @@ class Identity<T> implements Transform<T, T> { | |||
*/ | |||
@Deprecated | |||
public static <I> Identity<I> get(Type type) { | |||
Preconditions.checkArgument( | |||
type.typeId() != Type.TypeID.VARIANT, "Unsupported type for identity: %s", type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need to make sure that canTransform
doesn't accept variants. We could special case it, but I'm actually wondering if it makes sense for variant to be a primitive type because it is only primitive sometimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do only need to special case it if we make Variant a primitive. I'm not sure what primitive really means in this context though. I think anything that may or may not be scalar is probably not a primitive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the test coverage in TestBucketing.java TestIdentity.java to make canTransform
doesn't accept variants.
@@ -412,6 +413,29 @@ public String toString() { | |||
} | |||
} | |||
|
|||
public static class VariantType extends PrimitiveType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that variant should be either a primitive type or a nested type. There are at least a few places where making variant a primitive creates problems. For example, Schema
checks that a field is a primitive to allow it as an identifier field and the Identity
transform will bind to a type if it is a primitive. There are also places that check whether a type is nested and make decisions based on whether there are sub-fields.
Neither one of those seems appropriate to me so I'd suggest making this extend just Type
. Then we can adjust the logic in those places without assuming either one and potentially introducing bugs when the assumption is bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have isPrimitiveType
() to return false. I was debating on that - if VariantType extends PrimitiveType, some internal change would be easier. But concept-wise t should be from Type directly and let me change it.
@@ -312,4 +314,15 @@ public void testAddPartitionFieldsWithAndWithoutFieldIds() { | |||
assertThat(spec.fields().get(2).fieldId()).isEqualTo(1006); | |||
assertThat(spec.lastAssignedFieldId()).isEqualTo(1006); | |||
} | |||
|
|||
@Test | |||
public void testVariantUnsupported() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about other transforms? None of them are allowed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. I have the coverage in TestIdentity.java and TestBucketing.java and I think I have checked all the transforms.
|
||
@Override | ||
public boolean isPrimitiveType() { | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see that although this extends PrimitiveType
, it reports that it is not one. I think it would be safer not to extend PrimitiveType
. Good that this is already used to get the tests passing.
… full implementation. Block Transforms, SortOrder
ca2b5de
to
5eebb6e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I think we could clean up TestTableMetadata more (i left a comment there) but we can always push that out as well. I mean to take a pass on cleaning some of that up later as well.
9b77a5f
to
4050ac0
Compare
4050ac0
to
1e2fc66
Compare
@rdblue I think you are the last reviewer on this, do you have any further comments for this one? |
@@ -412,6 +412,41 @@ public String toString() { | |||
} | |||
} | |||
|
|||
public static class VariantType implements Type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aihuaxu, could you also implement the same Java serialization that we use for PrimitiveType
? The existing PrimitiveHolder
should work:
Object writeReplace() throws ObjectStreamException {
return new PrimitiveHolder(toString());
}
Late +1 from me! Thanks @aihuaxu! |
Add Variant data type to API module. We only add limited required type to API for now and we will promote stable methods to API after full implementation to the interface
VariantLike
later.Fix #11178.