Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce that all field schemas must be optional in KSQL #2768

Merged
merged 10 commits into from
May 7, 2019

Conversation

big-andy-coates
Copy link
Contributor

@big-andy-coates big-andy-coates commented May 2, 2019

Description

Fixes: #2769

The fields in the row schema in KSQL as supposed to always be optional. (This is work @rodesai did previously). This is important because at the moment if a UDF throws an exception while processing a field KSQL outputs the row with a null value in the field. So fields must be nullable.

However, while introducing KsqlSchema recently I noted there are some points in the code base where this is not being adhered to: the return values of UDF and UDAFs. This PR looks to address this.

Obviously, in time we'll likely want to support NOT NULL column defs. But that requires some thought! Until that happens, we should be consistent and always have nullable fields.

The introduction of our own schema type KsqlSchema allows us to add a check for non-optional schemas in one place. Which this PR does.

Reviewing notes:

  1. Check out the ensureOptional method in SchemaUtil and associated tests.
  2. Check out the checks for non-optional schemas in KsqlSchema and associated tests.
  3. Check out the checks for non-optional return type schemas in both KsqlFunction and BaseAggregateFunction, and associated tests.
  4. Check out the changes in UdfCompiler to ensure UDAFs have optional return types, and associated tests.
  5. Check out the changes in UdfLoader to ensure UDFs have optional return types, and associated tests. Note: I've refactored UdfLoaderTest as well to make it faster. It was taking ~15 seconds, where as it now takes < 1s.
  6. There are then a load of 5_3_0_pre expected topology files that have changed to reflect that fields are now deep-optional. See backwards compatibility section below.
  7. I've had to remove some tests from InsertValuesExecutorTest as it is no longer possible to get the test pre-conditions i.e. ksql schemas with non-optionals.
  8. There are some other tests where non-optional schemas have been fixed.
Backwards compatibility

Obviously, this PR changes the schema KSQL is using to serialize and deserialize data to/from topics, so we need to be careful this is not breaking compatibility.

For the schema-less formats, (JSON, DELIMITED) there is no issue as we're widening the range of permissible values. The serde classes will continue to work as expected.

That leaves AVRO, and luckily AVRO would see this as a forward compatible change, i.e. applications using the new schema can read data written with the older schema. There would be issues if a user was to roll back after writing data, but this is something we've done in the past. Likewise, if there are any downstream clients still using the old schema to read, this will fail. While not ideal, I think this is acceptable and I'll call it out in the release and upgrade notes.

Testing done

Unit and functional

Manual testing:

  1. brought up 5.2 KSQL node and created a persistent query that would have non-optional schema. NB: collect_list has a non-optional return value in 5.2.

    $ bin/ksql-datagen quickstart=clickstream format=avro topic=clickstream maxInterval=100 iterations=100

    ksql> CREATE STREAM clickstream with (kafka_topic='clickstream',value_format='AVRO');
    ksql> CREATE TABLE TEST AS SELECT userid, collect_list(status) as statuses FROM clickstream GROUP BY userid;

  2. Confirmed I can select from output topic SELECT * FROM TEST LIMIT 5; when re-running datagen.

  3. Stopped KSQL server and CLI

  4. Started KSQL on this PR's version.

  5. Re-confirmed I can select from output topic SELECT * FROM TEST LIMIT 5; when re-running datagen.

  6. Confirmed two Avro schema versions in SR for TEST-value

    $ curl -X GET http://localhost:8081/subjects/TEST-value/versions
    [1,2]
    $ curl -X GET http://localhost:8081/subjects/TEST-value/versions/1
    {"subject":"TEST-value","version":1,"id":6,"schema":"{"type":"record","name":"KsqlDataSourceSchema","namespace":"io.confluent.ksql.avro_schemas","fields":[{"name":"USERID","type":["null","int"],"default":null},{"name":"STATUSES","type":{"type":"array","items":["null","string"]}}]}"}
    $ curl -X GET http://localhost:8081/subjects/TEST-value/versions/2
    {"subject":"TEST-value","version":2,"id":9,"schema":"{"type":"record","name":"KsqlDataSourceSchema","namespace":"io.confluent.ksql.avro_schemas","fields":[{"name":"USERID","type":["null","int"],"default":null},{"name":"STATUSES","type":["null",{"type":"array","items":["null","string"]}],"default":null}]}"}

    Note: the type of STATUSES has changed from an array type to a union of either null or array. Yay!

No errors reported & everything worked as expected.

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@big-andy-coates big-andy-coates requested a review from a team as a code owner May 2, 2019 09:20
Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @big-andy-coates. I have one concern with behavior regarding recursive optionality inline.

@agavra agavra requested a review from a team May 2, 2019 15:26
Copy link
Member

@JimGalasyn JimGalasyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with one suggestion.

@big-andy-coates big-andy-coates merged commit ec05712 into confluentinc:master May 7, 2019
@big-andy-coates big-andy-coates deleted the ksql_schema_3 branch May 7, 2019 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KSQL UDF and UDAFs can have non-optional return schemas
4 participants