Add write support for kafka connector #4230

charlesjmorgan · 2020-06-25T20:45:33Z

Add write support for the kafka connector
Add encoder to serialize message into avro, csv, json, and raw formats (works for primitives and json date/time types)
Currently some changes proposed in #4183 are included in this pr, but once those get merged I'll rebase (done)
Closes #3980

aalbu

I reviewed part of it and it looks good! I made a few comments, I will continue reviewing tomorrow.

presto-kafka/src/main/java/io/prestosql/plugin/kafka/KafkaMetadata.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/KafkaPageSink.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/KafkaHandleResolver.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/KafkaPageSinkProvider.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/encoder/EncoderColumnHandle.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/encoder/avro/AvroRowEncoder.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/encoder/csv/CsvColumnEncoder.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/encoder/csv/CsvRowEncoder.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/encoder/avro/AvroRowEncoder.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/encoder/EncoderColumnHandle.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/KafkaPageSinkProvider.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/KafkaPageSink.java

charlesjmorgan · 2020-06-26T21:11:26Z

Made revisions suggested by @aalbu

Changed RowEncoder#encodeRow to return a ProducerRecord
Added serializers for different data formats
Removed encoder tests that don't work anymore (will add more in the future)
Improved round trip test design in TestKafkaIntegrationSmokeTest
Rebased and brought in bug fix changes for the RawRowDecoder

findepi · 2020-06-26T21:43:20Z

cc @elonazoulay

kokosing

few random comments, I need to get back it later

presto-kafka/src/main/java/io/prestosql/plugin/kafka/KafkaFieldValueProvider.java

presto-kafka/pom.xml

presto-kafka/src/main/java/io/prestosql/plugin/kafka/KafkaInsertTableHandle.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/KafkaOutputTableHandle.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/encoder/csv/CsvRowEncoder.java

...fka/src/main/java/io/prestosql/plugin/kafka/encoder/json/CustomDateTimeJsonFieldEncoder.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/KafkaPageSink.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/encoder/dummy/DummyRowEncoder.java

...fka/src/main/java/io/prestosql/plugin/kafka/encoder/json/CustomDateTimeJsonFieldEncoder.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/serdes/csv/KafkaCsvSerializer.java

findepi

(skimming)

presto-kafka/src/main/java/io/prestosql/plugin/kafka/encoder/EncoderColumnHandle.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/encoder/avro/AvroRowEncoder.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/PlainTextKafkaProducerFactory.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/KafkaProducerFactory.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/KafkaFieldValueProvider.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/KafkaPageSink.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/encoder/EncoderErrorCode.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/encoder/EncoderFieldValueProvider.java

...fka/src/main/java/io/prestosql/plugin/kafka/encoder/json/CustomDateTimeJsonFieldEncoder.java

findepi · 2020-06-29T15:32:41Z

One non-obvious thing to me is about:
should we have set similar classes, interfaces, factories & providers for decoers & encoders
OR should we enhance the decoders' interface to handle encoding as well.

i assume @charlesjmorgan @aalbu you considered that, so perhaps there is an answer ready.

charlesjmorgan · 2020-06-29T16:41:36Z

@findepi
The goal originally was to make an encoder that could be used by different connectors like the current record decoder. Any interfaces that seem unnecessary are there because of that.

Recently we changed the encoder to be Kafka specific so instead of returning a byte [] it returns a ProducerRecord and then the serializers take care of... well serialization. @aalbu brought up what I think is a good point about keeping encoding and decoding separate and the single-responsibility principle. That is why while the EncoderColumnHandle might seem functionally equivalent to the DecoderColumnHandle they are two separate classes. If we were to combine the encoder and decoder in the future I think that there is a case to be made for just using one column handle.

I'm not sure what the best option is, keeping encoding and decoding separate or combining them. I'm also not sure if it would be better to have a one-size-fits-all encoder/decoder or make specific ones for each connector that might be optimized for each specific use case. Any thoughts you (or anyone else) have on this would be helpful.

aalbu · 2020-06-29T18:48:09Z

should we have set similar classes, interfaces, factories & providers for decoers & encoders

Perhaps. I advocated for an iterative approach, where we implement the functionality for the connector at hand and then we can consider some abstractions that are more generic. I feel that trying to generalize too early can lead to constraining decisions.

OR should we enhance the decoders' interface to handle encoding as well.

So basically RowDecoder -> RowCodec? That is a possibility, if we end up with symmetric abstractions. I think that the RowDecoder could be generalized more (why do we assume that the source is a byte[]?).

findepi · 2020-06-29T20:55:15Z

I'm also not sure if it would be better to have a one-size-fits-all encoder/decoder or make specific ones for each connector that might be optimized for each specific use case

that's a very valid question.
I am not sure whether decoder's reuse is the optimal choice, but we have it today anyway.

Since encoding seems pretty symmetric to decoding, it seems justified to bundle them in single class.
Doing so emphasizes this symmetry and may help understand the implementation.

I understand that, to realize this symmetry we would need kafka-independent types in the encoders interfaces.
byte[] or Slice seems like a safe choice. We would leave submitting this byte[] key/val to kafka topic in kafka-specific code.

I think that the RowDecoder could be generalized more (why do we assume that the source is a byte[]?).

Because that's just convenient. You need to plug flexible interface when cost of copying data is large (ie data is large).
Current code is probably not optimized to avoid that, and I am not sure it should be.

@losipiuk please chime in, since you were working with this code more than I did.

presto-kafka/pom.xml

presto-kafka/src/main/java/io/prestosql/plugin/kafka/KafkaPageSink.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/KafkaProducerFactory.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/KafkaPageSinkProvider.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/KafkaInsertTableHandle.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/PlainTextKafkaProducerFactory.java

losipiuk

Partial review.

presto-kafka/pom.xml

presto-kafka/src/main/java/io/prestosql/plugin/kafka/KafkaFieldValueProvider.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/KafkaPageSink.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/KafkaPageSinkProvider.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/KafkaProducerFactory.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/encoder/avro/AvroColumnEncoder.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/encoder/dummy/DummyRowEncoder.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/encoder/raw/RawColumnEncoder.java

presto-kafka/src/main/java/io/prestosql/plugin/kafka/encoder/avro/AvroColumnEncoder.java

aalbu · 2020-06-30T10:54:37Z

I understand that, to realize this symmetry we would need kafka-independent types in the encoders interfaces.
byte[] or Slice seems like a safe choice. We would leave submitting this byte[] key/val to kafka topic in kafka-specific code.

But what's the value of that? There isn't much code reuse. It's not an intuitive way of working with target systems. We might make Kafka work with it, but do we know that it's a good fit for other systems? Redis is using the RecordDecoder - can we make Redis writes work by providing it byte arrays? Would it work when you persist a list? I don't think we have given that much thought, so I feel there is no supporting evidence that extending the interface with encoding methods producing a 'byte[]` is appropriate.

findepi · 2020-06-30T11:11:09Z

But what's the value of that? There isn't much code reuse.

This isn't about code reuse, but more about physically bundling things that are coupled.
For example, i would want Integer#parseInt and Integer#toString conversions to be implemented
in one class, as one is reverse of the other (and they are not so complex to preclude being one class).

Redis is using the RecordDecoder - can we make Redis writes work by providing it byte arrays?

I don't know. IF the decoding was accepting byte[] THEN it would be reasonable for the encoding to return byte[].
However, I just realized this is worse than that because decoding takes byte[] data, Map<String, String> dataMap
(and the dataMap is specific to redis).
That suggests redis is not a good use-case for byte[]-based row decoding/encoding at all.

losipiuk · 2020-06-30T11:49:03Z

This isn't about code reuse, but more about physically bundling things that are coupled

I agree that bundling encoders and decoders in single class would make reading and understanding implementation easier. Though it will push us more towards reusing same logic in very much different connectors. I think the fact that Kafka and Redis are sharing decoder implementation does more harm than good (see example @findepi gave above).

I think (if we find resources for that) we should untangle the situation and do either:

a) rework presto-decoder module to make it more connector agnostic and keep it reused where we currently use it. Then we can combine PrestoDecoder with PrestoEncoder in the shared module.

b) stop using presto-decoder in connectors it does not play well with (Redis?). This can potentially result in dropping presto-decoder at all if it looks like having separate implementation for each connector is better. Then we can combine decoder and encoder logic in Kafka connector module.

Short term probably it does not matter much. At least for this PR I would keep them separate so we can focus on merging it sooner. Then as next step (we should not postpone that) we can work on refactor and move it towards either a) or b). WDYT?

findepi · 2020-06-30T12:32:33Z

FWIW, the presto-record-decoder module is currently used in:

Kafka
Redis
Kinesis

At least for this PR I would keep them separate so we can focus on merging it sooner.

@losipiuk not sure what you mean here. How does separate classes vs extending existing classes matter?

FYI i do not have very strong opinion either way, but i expect it may be some work to change between approaches.

losipiuk · 2020-06-30T12:51:09Z

@losipiuk not sure what you mean here. How does separate classes vs extending existing classes matter?

It matters in terms how many rounds of review we need. I would prefer to not bloat this PR with extra refactorings. Smaller PRs are much easier to review and work with. And sooner we merge this beast and continue with smaller gradual changes the better IMO.

charlesjmorgan · 2020-06-30T21:48:45Z

Thank you @aalbu @findepi @losipiuk and @kokosing for all your feedback, it has been very helpful! I hope I got everything (I triple checked so should be good). I intentionally left out changes that I didn't think were in the scope of this PR, I will revisit those in the future. I am going to split this pr into 5 parts. The first will be basic functionality for inserts and the four after that will each be for a specific encoder format. lmk if you think I should split it up differently

Base functionality/CSV encoder PR - #4287

Json Encoder PR - Add Kafka Json encoder #4477
Raw Encoder PR - Add Kafka raw encoder #4417
Avro Encoder PR - Add Kafka Avro encoder #4418

cla-bot bot added the cla-signed label Jun 25, 2020

charlesjmorgan requested review from aalbu and findepi June 25, 2020 20:45

charlesjmorgan changed the title ~~Implement inserts for kafka connector~~ Add write support for kafka connector Jun 25, 2020

charlesjmorgan force-pushed the kafka-writes branch 2 times, most recently from a119b37 to 70648c3 Compare June 26, 2020 03:53

aalbu reviewed Jun 26, 2020

View reviewed changes

charlesjmorgan force-pushed the kafka-writes branch from 70648c3 to 7bd5d93 Compare June 26, 2020 21:08

charlesjmorgan force-pushed the kafka-writes branch 3 times, most recently from d441a7b to b4bd592 Compare June 27, 2020 15:18

kokosing reviewed Jun 27, 2020

View reviewed changes

aalbu reviewed Jun 29, 2020

View reviewed changes

charlesjmorgan requested a review from losipiuk June 29, 2020 13:48

findepi reviewed Jun 29, 2020

View reviewed changes

kokosing reviewed Jun 30, 2020

View reviewed changes

losipiuk reviewed Jun 30, 2020

View reviewed changes

findepi mentioned this pull request Jun 30, 2020

Make RowDecoder not require null map #4278

Merged

Implement inserts for kafka connector

acdfe1b

charlesjmorgan force-pushed the kafka-writes branch from b4bd592 to acdfe1b Compare June 30, 2020 21:43

charlesjmorgan mentioned this pull request Jun 30, 2020

Implement base functionality for kafka connector inserts #4287

Merged

charlesjmorgan closed this Jun 30, 2020

charlesjmorgan deleted the kafka-writes branch June 7, 2021 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add write support for kafka connector #4230

Add write support for kafka connector #4230

charlesjmorgan commented Jun 25, 2020 •

edited

Loading

aalbu left a comment

charlesjmorgan commented Jun 26, 2020 •

edited

Loading

findepi commented Jun 26, 2020

kokosing left a comment

findepi left a comment

findepi commented Jun 29, 2020

charlesjmorgan commented Jun 29, 2020

aalbu commented Jun 29, 2020

findepi commented Jun 29, 2020

losipiuk left a comment

aalbu commented Jun 30, 2020

findepi commented Jun 30, 2020

losipiuk commented Jun 30, 2020

findepi commented Jun 30, 2020

losipiuk commented Jun 30, 2020

charlesjmorgan commented Jun 30, 2020 •

edited

Loading

Add write support for kafka connector #4230

Add write support for kafka connector #4230

Conversation

charlesjmorgan commented Jun 25, 2020 • edited Loading

aalbu left a comment

Choose a reason for hiding this comment

charlesjmorgan commented Jun 26, 2020 • edited Loading

findepi commented Jun 26, 2020

kokosing left a comment

Choose a reason for hiding this comment

findepi left a comment

Choose a reason for hiding this comment

findepi commented Jun 29, 2020

charlesjmorgan commented Jun 29, 2020

aalbu commented Jun 29, 2020

findepi commented Jun 29, 2020

losipiuk left a comment

Choose a reason for hiding this comment

aalbu commented Jun 30, 2020

findepi commented Jun 30, 2020

losipiuk commented Jun 30, 2020

findepi commented Jun 30, 2020

losipiuk commented Jun 30, 2020

charlesjmorgan commented Jun 30, 2020 • edited Loading

charlesjmorgan commented Jun 25, 2020 •

edited

Loading

charlesjmorgan commented Jun 26, 2020 •

edited

Loading

charlesjmorgan commented Jun 30, 2020 •

edited

Loading