-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for date, time_millis, and timestamp_millis to AvroColumnDecoder #13070
Conversation
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to [email protected]. For more information, see https://github.com/trinodb/cla. |
I have submitted CLA to [email protected]. Please check. |
cc @kokosing I think I've sent the CLA already but the test still failed. Did I miss something here? |
@tisonkun The registering process takes a few days. We will ping cla-bot once it's registered. |
@ebyhr two weeks passed. May I have an estimate about your team's reviewing? |
When I working on apache/pulsar#16683, I notice that after Trino supports parametric timestamp, the decode case should be review in another round. Related to: |
@cla-bot check |
The cla-bot has been summoned, and re-checked this pull request! |
lib/trino-record-decoder/src/main/java/io/trino/decoder/avro/AvroColumnDecoder.java
Outdated
Show resolved
Hide resolved
lib/trino-record-decoder/src/test/java/io/trino/decoder/avro/TestAvroDecoder.java
Outdated
Show resolved
Hide resolved
ping @ebyhr @Praveen2112 @hashhar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we opt for custom encoding instead of using what Avro already provides out of the box?
See for example https://avro.apache.org/docs/current/spec.html#Timestamp+%28millisecond+precision%29
This custom encoding won't be readable by other tools which don't know how we've encoded the values.
@hashhar I don't get your point. This patch follow the Avro spec and adapt to Trino internal representation design. Trino's |
And yes, it breaks downstream adoption as in apache/pulsar@c418395. This is introduced by how we implement parametric timestamp type 21e3ddf, and out of the scope of this patch. |
But for other Avro readers to be able to get a timestamp value without manually decoding it the Avro schema needs to annotate the field with a logical type. |
lib/trino-record-decoder/src/main/java/io/trino/decoder/avro/AvroColumnDecoder.java
Outdated
Show resolved
Hide resolved
I'm a bit confused about this now. I probably need to re-read existing code more thoroughly and re-review later. In meantime I'll defer to @Praveen2112 . |
ping @ebyhr @hashhar @Praveen2112 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we squash all the commits into one.
lib/trino-record-decoder/src/main/java/io/trino/decoder/avro/AvroColumnDecoder.java
Show resolved
Hide resolved
lib/trino-record-decoder/src/main/java/io/trino/decoder/avro/AvroColumnDecoder.java
Show resolved
Hide resolved
lib/trino-record-decoder/src/test/java/io/trino/decoder/avro/TestAvroDecoder.java
Outdated
Show resolved
Hide resolved
…Decoder squash. handle exactly time_millis and timestamp_millis type squash. test avro decoder squash. test avro schema converter squash. handle DateType when serialize record squash. fix style and revert trino-kafka changes squash. revert all trino-kafka changes squash. Update lib/trino-record-decoder/src/main/java/io/trino/decoder/avro/AvroColumnDecoder.java squash. pass logical type in tests Signed-off-by: tison <[email protected]>
a5d7dbe
to
54dab35
Compare
@Praveen2112 thanks for your reivews! Updated and squashed. |
public void testIntDecodedAsTimeMillis() | ||
{ | ||
DecoderTestColumnHandle row = new DecoderTestColumnHandle(0, "row", TIME_MILLIS, "id", null, null, false, false, false); | ||
Map<DecoderColumnHandle, FieldValueProvider> decodedRow = buildAndDecodeColumn(row, "id", "{\"type\":\"int\",\"logicalType\":\"time-millis\"}", 100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about for time-micros ? I guess it is also an valid Avro logical type ? Similarly for timestamp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They can work. For the original purpose, I prepare this patch only to adapt Pulsar Trino Plugin's use case. Since the review process goes so long now, I'd prefer after a good result on the mills part and I may continue the micros part. Otherwise, it's extending the scope and I don't see an end.
BTW, Trino itself handles layout changes of TimestampType and TimeType in different PRs, so I think it's not a requirement we must do it at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree we can have different PRs for a dedicated datatype. But is there a way to differentiate between if the data type is of time-micros or time-mills. If a datatype is annotated with a LogicalType time-micro
how do we restrict decoding that datatype ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the convertor's responsibility to pass the correct Trino type. AvroColumnDecoder
accepts the Trino type the caller passes to it.
public void testLongDecodedAsTimestampMillis() | ||
{ | ||
DecoderTestColumnHandle row = new DecoderTestColumnHandle(0, "row", TIMESTAMP_MILLIS, "id", null, null, false, false, false); | ||
Map<DecoderColumnHandle, FieldValueProvider> decodedRow = buildAndDecodeColumn(row, "id", "{\"type\":\"long\",\"logicalType\":\"timestamp-millis\"}", 1658463479L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it is the correct way to test it. Since we pass the value to be serialized as long - Avro serializes/de-serializes it as long but as per specification it looks like for Java we would pass the input as Instant
for Timestamp logical type - register the conversion for GeneircRecord
and serialize the data so when we de-serialize we would get them as Instant
instead of long (we could do the conversion manually). Similar mapping/conversion is implemented for Java classes generated by Avro.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GenericData.get().addLogicalTypeConversion(new TimeConversions.TimestampMillisConversion());
TypeConversion can be added using similar snippets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avro serializes/de-serializes it as long but as per specification it looks like for Java we would pass the input as Instant for Timestamp logical type
Yes. To support many and many cases, there's a lot of work to do. I don't think one PR should handle all of it or you can try it out. I'd say the patch here can resolve the specific issue of bumping trino version for Pulsar Trino Plugin.
GenericData.get().addLogicalTypeConversion(new TimeConversions.TimestampMillisConversion());
What does it mean? If I get it right you suggest one more possible case avro timestamp can be represented as. Then my answer is as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree but I'm afraid if this is the correct approach. One major thing which bugs me is that we manually do the conversion from a long to a micro-second precision compatible format - but ideally it should happen as a part of Avro library and it does support it. And one additional benefit is that in case if Avro modifies their internal representation it shouldn't affect us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but ideally it should happen as a part of Avro library and it does support it
Can you show me how Avro "does support it"? I'm glad to adapt such a way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method shows on how Avro implements it
https://github.com/apache/avro/blob/a2c111dc3581f9549b173f7c6a9073cfbf173558/lang/java/avro/src/test/java/org/apache/avro/generic/TestGenericLogicalTypes.java#L368
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I'm looking into it this week :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Praveen2112 after a review I think the conversion is inevitable because Trino has it's own representation, introduced by:
I take a look at the TimeConvertions, but it seems to be used on the fly data in-memory processing, not during deserialization. Pulsar Trino Plugin users meet a use case that the actual payload is int/long according to the logical type, and there's no Instant
or LocalTime
to be converted.
I don't have a good idea to make further revision on this codepath. You're welcome to show a patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any update on this patch?
You can directly send a patch onto this one for your suggestions, or at least point out what APIs provided by Avro are related to implementing it. |
👋 @tisonkun @Praveen2112 @hashhar - this PR has become inactive. We hope you are still interested in working on it. Please let us know, and we can try to get reviewers to help with that. We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks. |
No longer work on this. |
Improvement.
Avro part of trino-record-decoder.
Support parsing trino's time types in Avro column decoder.
Related issues, pull requests, and links
AvroColumnDecoder
#13069.From Avro's spec, actually all of these types should have either
int
orlong
in "type" field with a "logicalType" field distinguish.So to support these types in AvroColumnDecoder aspect, simply add them to supported types list is enough and the actual parsing logic will be handled in caller side for example
PulsarAvroRowDecoderFactory
.Documentation
(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
(x) No release notes entries required.
This is almost an internal change.
( ) Release notes entries required with the following suggested text: