Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS][MINOR] Fixed a few typos in the Structured Streaming documentation #17443

Closed
wants to merge 2 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 11 additions & 11 deletions docs/structured-streaming-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -713,15 +713,15 @@ old windows correctly, as illustrated below.

![Handling Late Data](img/structured-streaming-late-data.png)

However, to run this query for days, it's necessary for the system to bound the amount of
However, to run this query for days, it is necessary for the system to bound the amount of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a typo

intermediate in-memory state it accumulates. This means the system needs to know when an old
aggregate can be dropped from the in-memory state because the application is not going to receive
late data for that aggregate any more. To enable this, in Spark 2.1, we have introduced
**watermarking**, which let's the engine automatically track the current event time in the data and
**watermarking**, which lets the engine automatically track the current event time in the data
and attempt to clean up old state accordingly. You can define the watermark of a query by
specifying the event time column and the threshold on how late the data is expected be in terms of
specifying the event time column and the threshold on how late the data is expected to be in terms of
event time. For a specific window starting at time `T`, the engine will maintain state and allow late
data to be update the state until `(max event time seen by the engine - late threshold > T)`.
data to update the state until `(max event time seen by the engine - late threshold > T)`.
In other words, late data within the threshold will be aggregated,
but data later than the threshold will be dropped. Let's understand this with an example. We can
easily define watermarking on the previous example using `withWatermark()` as shown below.
Expand Down Expand Up @@ -792,7 +792,7 @@ This watermark lets the engine maintain intermediate state for additional 10 min
data to be counted. For example, the data `(12:09, cat)` is out of order and late, and it falls in
windows `12:05 - 12:15` and `12:10 - 12:20`. Since, it is still ahead of the watermark `12:04` in
the trigger, the engine still maintains the intermediate counts as state and correctly updates the
counts of the related windows. However, when the watermark is updated to 12:11, the intermediate
counts of the related windows. However, when the watermark is updated to `12:11`, the intermediate
state for window `(12:00 - 12:10)` is cleared, and all subsequent data (e.g. `(12:04, donkey)`)
is considered "too late" and therefore ignored. Note that after every trigger,
the updated counts (i.e. purple rows) are written to sink as the trigger output, as dictated by
Expand Down Expand Up @@ -825,7 +825,7 @@ section for detailed explanation of the semantics of each output mode.
same column as the timestamp column used in the aggregate. For example,
`df.withWatermark("time", "1 min").groupBy("time2").count()` is invalid
in Append output mode, as watermark is defined on a different column
as the aggregation column.
from the aggregation column.

- `withWatermark` must be called before the aggregation for the watermark details to be used.
For example, `df.groupBy("time").count().withWatermark("time", "1 min")` is invalid in Append
Expand Down Expand Up @@ -909,7 +909,7 @@ track of all the data received in the stream. This is therefore fundamentally ha
efficiently.

## Starting Streaming Queries
Once you have defined the final result DataFrame/Dataset, all that is left is for you start the streaming computation. To do that, you have to use the `DataStreamWriter`
Once you have defined the final result DataFrame/Dataset, all that is left is for you to start the streaming computation. To do that, you have to use the `DataStreamWriter`
([Scala](api/scala/index.html#org.apache.spark.sql.streaming.DataStreamWriter)/[Java](api/java/org/apache/spark/sql/streaming/DataStreamWriter.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.streaming.DataStreamWriter) docs)
returned through `Dataset.writeStream()`. You will have to specify one or more of the following in this interface.

Expand All @@ -930,7 +930,7 @@ There are a few types of output modes.
new rows added to the Result Table since the last trigger will be
outputted to the sink. This is supported for only those queries where
rows added to the Result Table is never going to change. Hence, this mode
guarantees that each row will be output only once (assuming
guarantees that each row will be outputted only once (assuming
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a typo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a typo, I agree, but I thought it made sense to keep the same formulation as what is used in the same paragraph

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, hm. As a native speaker, I would have said "outputted" is incorrect actually. But it looks like it's an accepted alternative. Well, whatever's the smaller change to achieve consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! I also thought "outputted" was strange.

I have reverted my change anyway.

fault-tolerant sink). For example, queries with only `select`,
`where`, `map`, `flatMap`, `filter`, `join`, etc. will support Append mode.

Expand Down Expand Up @@ -1396,15 +1396,15 @@ You can directly get the current status and metrics of an active query using
`lastProgress()` returns a `StreamingQueryProgress` object
in [Scala](api/scala/index.html#org.apache.spark.sql.streaming.StreamingQueryProgress)
and [Java](api/java/org/apache/spark/sql/streaming/StreamingQueryProgress.html)
and an dictionary with the same fields in Python. It has all the information about
and a dictionary with the same fields in Python. It has all the information about
the progress made in the last trigger of the stream - what data was processed,
what were the processing rates, latencies, etc. There is also
`streamingQuery.recentProgress` which returns an array of last few progresses.

In addition, `streamingQuery.status()` returns `StreamingQueryStatus` object
In addition, `streamingQuery.status()` returns a `StreamingQueryStatus` object
in [Scala](api/scala/index.html#org.apache.spark.sql.streaming.StreamingQueryStatus)
and [Java](api/java/org/apache/spark/sql/streaming/StreamingQueryStatus.html)
and an dictionary with the same fields in Python. It gives information about
and a dictionary with the same fields in Python. It gives information about
what the query is immediately doing - is a trigger active, is data being processed, etc.

Here are a few examples.
Expand Down