feat(serde): kafka format #3065

big-andy-coates · 2019-07-11T09:03:15Z

Description

Note: contains PR #3066

Part of the work to introduce primitive, then structured, keys.

Keys are currently assumed to be Kafka serialized Strings. When we introduce a way to specify a KEY_FORMAT there must be a way to declare the key is a Kafka serialized string if we're to maintain backwards compatibility. Also, many companies using Kafka serialized ints or longs as keys.

This PR brings a new KAFKA format, which will use the appropriate standards Kafka serde classes to deserialize a primitive key or value, e.g.

CREATE STREAM FOO (ROWKEY BIGINT KEY, ....) WITH (KEY_FORMAT='KAFKA', ...);

Will handle the case where the keys of the messages are longs that have been serialized using Kafka'sLongSerializer class. (Or will when the rest of the associated work is also complete).

The new KAFKA format supports INT, BIGINT, DOUBLE and STRING fields only, as that's the set of Kafka serde classes that match up to our KSQL types.

The format only supports single values, i.e. only single field, being primarily intended for use as a key format.

However, users can use it as a value format too. But if they do so then they can't use the source in a statement with a JOIN or GROUP BY clause. This is because such statements generally require repartition and changelog topics and such internal topics currently use the same value format as their source, i.e. they'd use KAFKA value format, and they also currently copy ROWTIME and ROWKEY into the value schema, i.e. they have multiple fields in the value schema, and KAFKA format can not support multiple fields...

So I have explicitly disabled JOIN and GROUP BY where the VALUE_FORMAT is KAFKA so that the user gets a more useful error message.

In the future we can fix this by either/both:

Use a standard serialization format for all internal topics, i.e. don't use the source's formats for internal formats.
Don't copy the two fields in to the value of the internal topics, which has a space saving as well.

This PR also enhanced the validation done on C* statements that use the DELIMITED value format. Previously a statement such as the one below would succeed:

CREATE STREAM FOO (V0 ARRAY<INT>, V1 MAP<STRING, BIGINT>, V2 STRUCT<F0 INT>) WITH(VALUE_FORMAT='DELIMITED',...);

Even though DELIMITED does not support such complex types. Any C*AS statement built off FOO would then fail with a cryptic error message. I picked this up when testing such statements for the new format, so fixed both.

Testing done

Lots of appropriate tests added.

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

A new ``KAFKA`` format that supports ``INT``, ``BIGINT``, ``DOUBLE`` and ``STRING`` fields that have been serialized using the standard Kafka serializers, e.g. ``org.apache.kafka.common.serialization.LongSerializer``, or equivalent. The format only supports single values, i.e. only single field, being primarily intended for use as a key format.

rmoff · 2019-07-12T11:15:53Z

Will this work with Kafka Connect?

Conflicting files ksql-engine/src/main/java/io/confluent/ksql/analyzer/Analyzer.java ksql-engine/src/main/java/io/confluent/ksql/ddl/commands/CreateSourceCommand.java ksql-engine/src/main/java/io/confluent/ksql/serde/KsqlSerdeFactories.java ksql-engine/src/test/java/io/confluent/ksql/analyzer/AnalyzerTest.java ksql-serde/src/main/java/io/confluent/ksql/serde/Format.java

…afka_format

big-andy-coates · 2019-07-12T12:59:56Z

@rmoff

Will this work with Kafka Connect?

I don't think this will help connect, which tends to use Avro/Json keys, right? I'll be looking into Avro/Json keys v. soon.

The new KAFKA format just allows users to import data where the key is, for example, a long that's been serialized using Kafka's LongSerializer. How they then dump the data back out through connect, where there key is not a string, I'm not sure yet. Need to see what Connect can handle and also look into allowing the type of the key to change across statements, i.e. so users can switch to a string key in their output topic if needed

docs/developer-guide/serialization.rst

docs/developer-guide/syntax-reference.rst

docs/faq.rst

docs/developer-guide/serialization.rst

JimGalasyn

LGTM, with a few suggestions.

big-andy-coates · 2019-07-12T16:55:40Z

Update on Connect friendlyness of new KAFKA format from Connect team.

Connect requires a Converter, but AK 2.0 introduced a bunch of numeric converters, including https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/converters/LongConverter.java as part of KIP-305 (https://cwiki.apache.org/confluence/display/KAFKA/KIP-305%3A+Add+Connect+primitive+number+converters)

In other words, yes you can configure Connect to serialize something like a numeric User Id as a Long / BIGINT. Saaaaaaaaweeet. Who's doing the blog post????? eh? eh? cc @rmoff

Co-Authored-By: Jim Galasyn <[email protected]>

rodesai

Thanks @big-andy-coates, this mostly looks good. 1 issue inline.

ksql-serde/src/main/java/io/confluent/ksql/serde/kafka/KafkaSerdeFactory.java

ksql-common/src/main/java/io/confluent/ksql/schema/ksql/SchemaConverters.java

agavra

LGTM - I agree with Rohan's comment, so if you feel strongly I think we should justify it. Maybe we can use @vcrfxia's benchmarks?

ksql-functional-tests/src/main/java/io/confluent/ksql/test/serde/kafka/KafkaSerdeSupplier.java

ksql-functional-tests/src/test/resources/query-validation-tests/elements.json

ksql-serde/src/main/java/io/confluent/ksql/serde/delimited/KsqlDelimitedSerdeFactory.java

ksql-serde/src/main/java/io/confluent/ksql/serde/kafka/KafkaSerdeFactory.java

Conflicting files ksql-engine/src/test/java/io/confluent/ksql/analyzer/AnalyzerTest.java ksql-serde/src/main/java/io/confluent/ksql/serde/delimited/KsqlDelimitedSerdeFactory.java

Conflicting files ksql-engine/src/test/java/io/confluent/ksql/analyzer/AnalyzerTest.java

rodesai

LGTM

big-andy-coates requested review from JimGalasyn and a team as code owners July 11, 2019 09:03

big-andy-coates and others added 2 commits July 11, 2019 16:24

chore: merge from master

4f4a317

Update docs to include VARCHAR

f5d05eb

big-andy-coates added 2 commits July 12, 2019 13:35

Merge branch 'kafka_format' of github.com:big-andy-coates/ksql into k…

fb64d2b

…afka_format