-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KSQL-1797: First draft of Transform a Stream topic #2180
Conversation
############################ | ||
|
||
KSQL enables *streaming transformations*, which you can use to convert | ||
streaming data from one format to another in real time. With a streaming |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Careful here, this is not true in general/by default:
With a streaming transformation, not only is all existing data in a stream converted, but so is every record that subsequently arrives on the source stream.
This is only the case if you set auto.offset.reset
to "earliest", which is not the default setting. By default, only newly arriving data in the source stream will be transformed by a launched query.
transformation, not only is all existing data in a stream converted, but so | ||
is every record that subsequently arrives on the source stream. | ||
|
||
Transform a Stream's Properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the intention is to describe that the WITH clause (in the example below) can be used for a variety of things, right?
If so, I'm not sure whether "a Stream's Properties" is very descriptive. I mean, I get it, but effectively there are five things you can do:
- Change the data format (today: for message values; in the future: for both keys and values).
- Change the number of partitions.
- Change the number of replicas.
- Change the timestamp field and/or the timestamp format.
- Change the new stream's underlying Kafka topic name.
I am wondering whether we should call these out explicitly, rather than hiding them all under the rather non-descriptive name of "Properties"? I think most readers will not really know what "the properties of a stream" are.
WITH (TIMESTAMP='viewtime', | ||
PARTITIONS=5, | ||
VALUE_FORMAT='JSON') AS | ||
SELECT viewtime, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize now that, in these examples, which target 5.0, we do need the \
as line continuation character. And we need it for all multi-line examples in this document (and perhaps other PRs that are in flight).
Even for upcoming 5.1 we will not yet have gotten rid of the \
for multi-line statements, because the (completed!) work on that barely missed the 5.1 code freeze.
.. code:: sql | ||
|
||
CREATE STREAM pageviews_transformed_priority_1 | ||
WITH (TIMESTAMP='viewtime', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO we should remove the WITH clause here, and in the second stream below. They are not relevant for the example, and I find they are distracting from the key message (pun intended) we want to convey here.
|
||
.. code:: sql | ||
|
||
CREATE STREAM pageviews_transformed_priority_1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to suggest a slightly more realistic/descriptive example because the use of "priority_1" vs. "priority_2" in the stream name is a bit odd because nothing in the query looks like it is about priority. But then I suppose the problem is that, for the running example of pageviews
, there's no such information in the example data, right? (It would be nicer to, say, separate the input stream into a stream for VIP users and another stream for regular users.)
But if we have to work under the constraints of pageviews
, perhaps we can find somewhat better names for the streams?
Like:
- pageviews_for_first_two_users
- pageviews_for_other_users
or
- pageviews_split_1
- pageviews_split_2
?
One idea for a better example: We could split pageviews by the viewtime, like "morning pageviews" and "afternoon pageviews"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with pageviews_for_first_two_users/pageviews_for_other_users
and opened ticket KSQL-1929 to track the viewtime
improvement.
No description provided.