Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KSQL-1797: First draft of Transform a Stream topic #2180

Merged
merged 5 commits into from
Nov 26, 2018
Merged

KSQL-1797: First draft of Transform a Stream topic #2180

merged 5 commits into from
Nov 26, 2018

Conversation

JimGalasyn
Copy link
Member

No description provided.

@JimGalasyn JimGalasyn requested review from a team and joel-hamill November 21, 2018 21:39
############################

KSQL enables *streaming transformations*, which you can use to convert
streaming data from one format to another in real time. With a streaming
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Careful here, this is not true in general/by default:

With a streaming transformation, not only is all existing data in a stream converted, but so is every record that subsequently arrives on the source stream.

This is only the case if you set auto.offset.reset to "earliest", which is not the default setting. By default, only newly arriving data in the source stream will be transformed by a launched query.

transformation, not only is all existing data in a stream converted, but so
is every record that subsequently arrives on the source stream.

Transform a Stream's Properties
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intention is to describe that the WITH clause (in the example below) can be used for a variety of things, right?

If so, I'm not sure whether "a Stream's Properties" is very descriptive. I mean, I get it, but effectively there are five things you can do:

  • Change the data format (today: for message values; in the future: for both keys and values).
  • Change the number of partitions.
  • Change the number of replicas.
  • Change the timestamp field and/or the timestamp format.
  • Change the new stream's underlying Kafka topic name.

I am wondering whether we should call these out explicitly, rather than hiding them all under the rather non-descriptive name of "Properties"? I think most readers will not really know what "the properties of a stream" are.

WITH (TIMESTAMP='viewtime',
PARTITIONS=5,
VALUE_FORMAT='JSON') AS
SELECT viewtime, \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize now that, in these examples, which target 5.0, we do need the \ as line continuation character. And we need it for all multi-line examples in this document (and perhaps other PRs that are in flight).

Even for upcoming 5.1 we will not yet have gotten rid of the \ for multi-line statements, because the (completed!) work on that barely missed the 5.1 code freeze.

.. code:: sql

CREATE STREAM pageviews_transformed_priority_1
WITH (TIMESTAMP='viewtime',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO we should remove the WITH clause here, and in the second stream below. They are not relevant for the example, and I find they are distracting from the key message (pun intended) we want to convey here.


.. code:: sql

CREATE STREAM pageviews_transformed_priority_1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to suggest a slightly more realistic/descriptive example because the use of "priority_1" vs. "priority_2" in the stream name is a bit odd because nothing in the query looks like it is about priority. But then I suppose the problem is that, for the running example of pageviews, there's no such information in the example data, right? (It would be nicer to, say, separate the input stream into a stream for VIP users and another stream for regular users.)

But if we have to work under the constraints of pageviews, perhaps we can find somewhat better names for the streams?

Like:

  • pageviews_for_first_two_users
  • pageviews_for_other_users

or

  • pageviews_split_1
  • pageviews_split_2

?

One idea for a better example: We could split pageviews by the viewtime, like "morning pageviews" and "afternoon pageviews"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with pageviews_for_first_two_users/pageviews_for_other_users and opened ticket KSQL-1929 to track the viewtime improvement.

@JimGalasyn JimGalasyn merged commit 33acccb into confluentinc:5.0.0-post Nov 26, 2018
@JimGalasyn JimGalasyn deleted the ksql-1797 branch November 26, 2018 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants