Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for date, time_millis, and timestamp_millis to AvroColumnDecoder #13070

Closed
wants to merge 1 commit into from

Conversation

tisonkun
Copy link

@tisonkun tisonkun commented Jul 3, 2022

Is this change a fix, improvement, new feature, refactoring, or other?

Improvement.

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Avro part of trino-record-decoder.

How would you describe this change to a non-technical end user or system administrator?

Support parsing trino's time types in Avro column decoder.

Related issues, pull requests, and links

From Avro's spec, actually all of these types should have either int or long in "type" field with a "logicalType" field distinguish.

So to support these types in AvroColumnDecoder aspect, simply add them to supported types list is enough and the actual parsing logic will be handled in caller side for example PulsarAvroRowDecoderFactory.

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.

This is almost an internal change.

( ) Release notes entries required with the following suggested text:

@cla-bot
Copy link

cla-bot bot commented Jul 3, 2022

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to [email protected]. For more information, see https://github.com/trinodb/cla.

@tisonkun
Copy link
Author

tisonkun commented Jul 3, 2022

I have submitted CLA to [email protected]. Please check.

@tisonkun
Copy link
Author

tisonkun commented Jul 8, 2022

cc @kokosing I think I've sent the CLA already but the test still failed. Did I miss something here?

@ebyhr
Copy link
Member

ebyhr commented Jul 8, 2022

@tisonkun The registering process takes a few days. We will ping cla-bot once it's registered.

@tisonkun
Copy link
Author

@ebyhr two weeks passed. May I have an estimate about your team's reviewing?

@tisonkun
Copy link
Author

When I working on apache/pulsar#16683, I notice that after Trino supports parametric timestamp, the decode case should be review in another round.

Related to:

@ebyhr
Copy link
Member

ebyhr commented Jul 21, 2022

@cla-bot check

@cla-bot cla-bot bot added the cla-signed label Jul 21, 2022
@cla-bot
Copy link

cla-bot bot commented Jul 21, 2022

The cla-bot has been summoned, and re-checked this pull request!

@tisonkun tisonkun requested a review from ebyhr July 22, 2022 04:38
@tisonkun
Copy link
Author

ping @ebyhr @Praveen2112 @hashhar

Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we opt for custom encoding instead of using what Avro already provides out of the box?

See for example https://avro.apache.org/docs/current/spec.html#Timestamp+%28millisecond+precision%29

This custom encoding won't be readable by other tools which don't know how we've encoded the values.

@tisonkun
Copy link
Author

@hashhar I don't get your point. This patch follow the Avro spec and adapt to Trino internal representation design.

Trino's TimestampType always store the payload in epochMicros precision.

@tisonkun
Copy link
Author

And yes, it breaks downstream adoption as in apache/pulsar@c418395.

This is introduced by how we implement parametric timestamp type 21e3ddf, and out of the scope of this patch.

@hashhar
Copy link
Member

hashhar commented Jul 29, 2022

But for other Avro readers to be able to get a timestamp value without manually decoding it the Avro schema needs to annotate the field with a logical type.

@tisonkun
Copy link
Author

tisonkun commented Jul 29, 2022

@hashhar You can read this commit 52f4cf2 that I revert trino-kafka adapting Avro logical type.

Still I don't know what changes you proposed here. It follows Avro spec, schema factory can infer Trino's type from Avro type and logical type, but that's the responsibility of downstream plugins.

@hashhar
Copy link
Member

hashhar commented Jul 29, 2022

I'm a bit confused about this now. I probably need to re-read existing code more thoroughly and re-review later. In meantime I'll defer to @Praveen2112 .

@tisonkun
Copy link
Author

tisonkun commented Aug 9, 2022

ping @ebyhr @hashhar @Praveen2112

Copy link
Member

@Praveen2112 Praveen2112 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we squash all the commits into one.

…Decoder

squash. handle exactly time_millis and timestamp_millis type
squash. test avro decoder
squash. test avro schema converter
squash. handle DateType when serialize record
squash. fix style and revert trino-kafka changes
squash. revert all trino-kafka changes
squash. Update lib/trino-record-decoder/src/main/java/io/trino/decoder/avro/AvroColumnDecoder.java
squash. pass logical type in tests

Signed-off-by: tison <[email protected]>
@tisonkun
Copy link
Author

tisonkun commented Sep 1, 2022

@Praveen2112 thanks for your reivews! Updated and squashed.

public void testIntDecodedAsTimeMillis()
{
DecoderTestColumnHandle row = new DecoderTestColumnHandle(0, "row", TIME_MILLIS, "id", null, null, false, false, false);
Map<DecoderColumnHandle, FieldValueProvider> decodedRow = buildAndDecodeColumn(row, "id", "{\"type\":\"int\",\"logicalType\":\"time-millis\"}", 100);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about for time-micros ? I guess it is also an valid Avro logical type ? Similarly for timestamp

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can work. For the original purpose, I prepare this patch only to adapt Pulsar Trino Plugin's use case. Since the review process goes so long now, I'd prefer after a good result on the mills part and I may continue the micros part. Otherwise, it's extending the scope and I don't see an end.

BTW, Trino itself handles layout changes of TimestampType and TimeType in different PRs, so I think it's not a requirement we must do it at once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree we can have different PRs for a dedicated datatype. But is there a way to differentiate between if the data type is of time-micros or time-mills. If a datatype is annotated with a LogicalType time-micro how do we restrict decoding that datatype ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the convertor's responsibility to pass the correct Trino type. AvroColumnDecoder accepts the Trino type the caller passes to it.

public void testLongDecodedAsTimestampMillis()
{
DecoderTestColumnHandle row = new DecoderTestColumnHandle(0, "row", TIMESTAMP_MILLIS, "id", null, null, false, false, false);
Map<DecoderColumnHandle, FieldValueProvider> decodedRow = buildAndDecodeColumn(row, "id", "{\"type\":\"long\",\"logicalType\":\"timestamp-millis\"}", 1658463479L);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it is the correct way to test it. Since we pass the value to be serialized as long - Avro serializes/de-serializes it as long but as per specification it looks like for Java we would pass the input as Instant for Timestamp logical type - register the conversion for GeneircRecord and serialize the data so when we de-serialize we would get them as Instant instead of long (we could do the conversion manually). Similar mapping/conversion is implemented for Java classes generated by Avro.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GenericData.get().addLogicalTypeConversion(new TimeConversions.TimestampMillisConversion());

TypeConversion can be added using similar snippets.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avro serializes/de-serializes it as long but as per specification it looks like for Java we would pass the input as Instant for Timestamp logical type

Yes. To support many and many cases, there's a lot of work to do. I don't think one PR should handle all of it or you can try it out. I'd say the patch here can resolve the specific issue of bumping trino version for Pulsar Trino Plugin.

GenericData.get().addLogicalTypeConversion(new TimeConversions.TimestampMillisConversion());

What does it mean? If I get it right you suggest one more possible case avro timestamp can be represented as. Then my answer is as above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree but I'm afraid if this is the correct approach. One major thing which bugs me is that we manually do the conversion from a long to a micro-second precision compatible format - but ideally it should happen as a part of Avro library and it does support it. And one additional benefit is that in case if Avro modifies their internal representation it shouldn't affect us.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but ideally it should happen as a part of Avro library and it does support it

Can you show me how Avro "does support it"? I'm glad to adapt such a way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I'm looking into it this week :)

Copy link
Author

@tisonkun tisonkun Sep 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Praveen2112 after a review I think the conversion is inevitable because Trino has it's own representation, introduced by:

I take a look at the TimeConvertions, but it seems to be used on the fly data in-memory processing, not during deserialization. Pulsar Trino Plugin users meet a use case that the actual payload is int/long according to the logical type, and there's no Instant or LocalTime to be converted.

I don't have a good idea to make further revision on this codepath. You're welcome to show a patch.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any update on this patch?

cc @Praveen2112 @kokosing @hashhar

@tisonkun
Copy link
Author

@Praveen2112

but ideally it should happen as a part of Avro library and it does support it.

You can directly send a patch onto this one for your suggestions, or at least point out what APIs provided by Avro are related to implementing it.

@mosabua
Copy link
Member

mosabua commented Jan 12, 2024

👋 @tisonkun @Praveen2112 @hashhar - this PR has become inactive. We hope you are still interested in working on it. Please let us know, and we can try to get reviewers to help with that.

We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks.

@tisonkun
Copy link
Author

No longer work on this.

@tisonkun tisonkun closed this Feb 26, 2024
@tisonkun tisonkun deleted the issue-13069 branch February 26, 2024 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Add support for short timestamp, date and time to AvroColumnDecoder
6 participants