[SPARK-18834][SS] Expose event time stats through StreamingQueryProgress #16258

tdas · 2016-12-12T23:36:57Z

What changes were proposed in this pull request?

Changed StreamingQueryProgress.watermark to StreamingQueryProgress.queryTimestamps which is a Map[String, String] containing the following keys: "eventTime.max", "eventTime.min", "eventTime.avg", "processingTime", "watermark". All of them UTC formatted strings.
Renamed StreamingQuery.timestamp to StreamingQueryProgress.triggerTimestamp to differentiate from queryTimestamps. It has the timestamp of when the trigger was started.

How was this patch tested?

Updated tests

tdas · 2016-12-12T23:39:33Z

...ore/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryStatusAndProgressSuite.scala

        |  "numInputRows" : 678,
        |  "inputRowsPerSecond" : 10.0,
        |  "durationMs" : {
        |    "total" : 0
        |  },
-        |  "currentWatermark" : 3,
+        |  "queryTimestamps" : {
+        |    "eventTime.avg" : "2016-12-05T20:54:20.827Z",


@marmbrus the dot is in the key. Does this mean that when parsing using our json DF, we may have to refer to the column as

progress.queryTimestamps.`eventTime.max`

One option is we could use dashes instead, but honestly I'm getting a little confused by all these timestamps. In particular, I'm not sure what is the difference between triggerTimestamp and processingTime (and, I think that having processingTime mean anything different than System.currentTimeMillis will be confusing to users coming from other systems).

The two things that I think you really want to be able to track from the metrics feed are:

the actual timestamp when this progress update was produced. I think this should remain top level and be called timestamp.

Stats about the event time, so that I can know roughly what data is present in this batch. This can be useful for several reasons, including finding other problems upstream. The more I think about this, I think you just want to see this as "eventTime": { "min": ..., "max": ..., "watermark": ... }

Its actually not clear to me why you need to know the original batch start time. In 99% of cases this is the same as the timestamp. If you are executing, a batch due to a failure, it will be different. But, why does an outside monitoring job care? I can't come up with any interesting graphs that you would make with this extra field.

My view is that the StreamingQueryProgress class is not just for monitoring but for debugging as well. The batchProcessingTime may be important for debugging why a batch generate some results in that 1% of the case where trigger time is different from the processing time. And in those cases, there is no other way to expose what the batchProcessingTime was that batch was if not exposed through the Progress API.

That said, we could not expose batchProcessingTime now and expose only eventTime. But it may be more complex to add another new field in Progress to expose the processing time (as it cannot be added to the map once we name it eventTime).

It is available, it is stored in a human readable format in the offset log (BTW, is that a long or a timestamp?). I think in the log run we'll want to open up another API that gives you access to this log, but I think that can come later. For now, its still pretty easy to find.

But the log gets cleaned up continuously, so will not be available if you are trying to debug it a day or an hour later.

Not any more... we keep 1000 now, right? This just really feels like we are reaching for a use case. We can always add it, but I think the way it is done currently in this PR is very confusing and having timestamp and eventTime are very clear.

1000 * 5 seconds batch is < 2 hours. Anyways, I think it will be equally or more confusing adding it later as a top-level field in Progress (e.g. timestamp, eventTime, processingTime ?). It may be better to have the top-level timestamp, and all other execution level detailed timestamps inside a single map.

Anyways, I am updating the PR, but I think this is a little short-sighted.

tdas · 2016-12-12T23:41:50Z

...ore/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryStatusAndProgressSuite.scala

@@ -38,13 +38,18 @@ class StreamingQueryStatusAndProgressSuite extends SparkFunSuite {
        |  "id" : "${testProgress1.id.toString}",
        |  "runId" : "${testProgress1.runId.toString}",
        |  "name" : "myName",
-        |  "timestamp" : "2016-12-05T20:54:20.827Z",
+        |  "triggerTimestamp" : "2016-12-05T20:54:20.827Z",


@marmbrus One idea is to not have a top level timestamp, and merge this with the queryTimestamps. Then this would be a key triggerStartTime in the queryTimestamps map. In fact then we can rename queryTimestamps to timestamps.

tdas · 2016-12-12T23:43:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

@@ -360,6 +360,24 @@ class StreamExecution(
    if (hasNewData) {
      // Current batch timestamp in milliseconds
      offsetSeqMetadata.batchTimestampMs = triggerClock.getTimeMillis()
+      // Update the eventTime watermark if we find one in the plan.


I moved this such that the watermark is updated before starting a batch, rather than after finishing a batch. This keeps batchWatermarkMs consistent with batchTimestampsMs, both are set before starting a batch, and reduces complexities in the ProgressReporter.

+1, this makes sense.

tdas · 2016-12-13T01:03:53Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/progress.scala

@@ -33,27 +34,6 @@ import org.apache.spark.sql.catalyst.util.DateTimeUtils

 /**
 * :: Experimental ::
- * Information about updates made to stateful operators in a [[StreamingQuery]] during a trigger.


I moved this below to keep the StreamingQueryProgress class first in the file. More important code first.

Thats actually the opposite of how most code in SQL is laid out, so I think it would be better to avoid this change. The logic here is declarations that are use later should come first (references before declaration make it harder to read), and stuff at the end of the file is kind of hidden.

aah, then for consistency, SourceProgress and SinkProgress should also be before StreamingQueryProgress. But thats a bigger change should be done in a different PR.

brkyvz · 2016-12-13T01:10:02Z

...ore/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryStatusAndProgressSuite.scala

@@ -44,7 +44,12 @@ class StreamingQueryStatusAndProgressSuite extends SparkFunSuite {
        |  "durationMs" : {
        |    "total" : 0
        |  },
-        |  "currentWatermark" : 3,
+        |  "eventTime" : {


SparkQA · 2016-12-13T01:39:22Z

Test build #70042 has finished for PR 16258 at commit e9d34ed.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class EventTimeStats(var max: Long, var min: Long, var sum: Long, var count: Long)
class EventTimeStatsAccum(protected var currentStats: EventTimeStats = EventTimeStats.zero)

zsxwing · 2016-12-13T02:20:53Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/progress.scala

@@ -67,7 +67,7 @@ class StateOperatorProgress private[sql](
 *                Similarly, when there is no data to be processed, the batchId will not be
 *                incremented.
 * @param durationMs The amount of time taken to perform various operations in milliseconds.
- * @param currentWatermark The current event time watermark in milliseconds
+ * @param eventTime Statistics of event time seen in this batch


nit: since you are touching this file, could you also fix the comment of @param timestamp? It's better to document the format as well, such as, The beginning time of the trigger in ISO8601 format. (e.g., 2016-12-05T20:54:20.827Z)

In addition, could you also add an example in the comment of eventTime?

zsxwing

Looks good overall. Just some nits.

zsxwing · 2016-12-13T02:27:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/EventTimeWatermarkExec.scala

+    this.count += that.count
+  }
+
+  def avg: Long = (sum.toDouble / count).toLong


nit: why not use sum / count?

zsxwing · 2016-12-13T02:31:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/EventTimeWatermarkExec.scala


  override protected def doExecute(): RDD[InternalRow] = {
    child.execute().mapPartitions { iter =>
      val getEventTime = UnsafeProjection.create(eventTime :: Nil, child.output)
      iter.map { row =>
-        maxEventTime.add(getEventTime(row).getLong(0))
+        eventTimeStats.add((getEventTime(row).getLong(0).toDouble / 1000).toLong)


nit: could be getEventTime(row).getLong(0) / 1000

zsxwing · 2016-12-13T02:31:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/EventTimeWatermarkExec.scala

-/** Tracks the maximum positive long seen. */
-class MaxLong(protected var currentValue: Long = 0)
-  extends AccumulatorV2[Long, Long] {
+/** Class for collecting event time stats with an accumulator */


nit: please document the time unit.

Documented it in EventTimeWatermarkExec

SparkQA · 2016-12-13T03:13:14Z

Test build #70051 has finished for PR 16258 at commit a14efdd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-13T03:24:19Z

Test build #70053 has finished for PR 16258 at commit 8938992.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-13T03:27:14Z

Test build #70052 has finished for PR 16258 at commit 9a2e941.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-13T06:19:18Z

Test build #70058 has finished for PR 16258 at commit b59ab80.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-12-13T18:50:56Z

LGTM

marmbrus · 2016-12-13T22:08:59Z

LGTM

tdas · 2016-12-13T22:13:20Z

Merging to master and 2.1

## What changes were proposed in this pull request? - Changed `StreamingQueryProgress.watermark` to `StreamingQueryProgress.queryTimestamps` which is a `Map[String, String]` containing the following keys: "eventTime.max", "eventTime.min", "eventTime.avg", "processingTime", "watermark". All of them UTC formatted strings. - Renamed `StreamingQuery.timestamp` to `StreamingQueryProgress.triggerTimestamp` to differentiate from `queryTimestamps`. It has the timestamp of when the trigger was started. ## How was this patch tested? Updated tests Author: Tathagata Das <[email protected]> Closes #16258 from tdas/SPARK-18834. (cherry picked from commit c68fb42) Signed-off-by: Tathagata Das <[email protected]>

HyukjinKwon · 2016-12-14T08:05:42Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/progress.scala

- * @param currentWatermark The current event time watermark in milliseconds
+ * @param eventTime Statistics of event time seen in this batch. It may contain the following keys:
+ *                 {
+ *                   "max" -> "2016-12-05T20:54:20.827Z"  // maximum event time seen in this trigger


Hi all, I am just leaving a comment as a gentle reminder to note that we probably should replace < or > to other ones such as {@literal <} or {@literal >} in the future. Please refer #16013 (comment). This causes javadoc8 break.

[error] .../java/org/apache/spark/sql/streaming/StreamingQueryProgress.java:19: error: bad use of '>' [error] * "max" -> "2016-12-05T20:54:20.827Z" // maximum event time seen in this trigger [error] ^ [error] .../java/org/apache/spark/sql/streaming/StreamingQueryProgress.java:20: error: bad use of '>' [error] * "min" -> "2016-12-05T20:54:20.827Z" // minimum event time seen in this trigger [error] ^ [error] .../java/org/apache/spark/sql/streaming/StreamingQueryProgress.java:21: error: bad use of '>' [error] * "avg" -> "2016-12-05T20:54:20.827Z" // average event time seen in this trigger [error] ^ [error] .../java/org/apache/spark/sql/streaming/StreamingQueryProgress.java:22: error: bad use of '>' [error] * "watermark" -> "2016-12-05T20:54:20.827Z" // watermark used in this trigger [error] ^

## What changes were proposed in this pull request? - Changed `StreamingQueryProgress.watermark` to `StreamingQueryProgress.queryTimestamps` which is a `Map[String, String]` containing the following keys: "eventTime.max", "eventTime.min", "eventTime.avg", "processingTime", "watermark". All of them UTC formatted strings. - Renamed `StreamingQuery.timestamp` to `StreamingQueryProgress.triggerTimestamp` to differentiate from `queryTimestamps`. It has the timestamp of when the trigger was started. ## How was this patch tested? Updated tests Author: Tathagata Das <[email protected]> Closes apache#16258 from tdas/SPARK-18834.

…in months or years ## What changes were proposed in this pull request? Two changes - Fix how delays specified in months and years are translated to milliseconds - Following up on apache#16258, not show watermark when there is no watermarking in the query ## How was this patch tested? Updated and new unit tests Author: Tathagata Das <[email protected]> Closes apache#16304 from tdas/SPARK-18834-1.

…in months or years ## What changes were proposed in this pull request? Two changes - Fix how delays specified in months and years are translated to milliseconds - Following up on #16258, not show watermark when there is no watermarking in the query ## How was this patch tested? Updated and new unit tests Author: Tathagata Das <[email protected]> Closes #16304 from tdas/SPARK-18834-1. (cherry picked from commit 607a1e6) Signed-off-by: Shixiong Zhu <[email protected]>

## What changes were proposed in this pull request? - Changed `StreamingQueryProgress.watermark` to `StreamingQueryProgress.queryTimestamps` which is a `Map[String, String]` containing the following keys: "eventTime.max", "eventTime.min", "eventTime.avg", "processingTime", "watermark". All of them UTC formatted strings. - Renamed `StreamingQuery.timestamp` to `StreamingQueryProgress.triggerTimestamp` to differentiate from `queryTimestamps`. It has the timestamp of when the trigger was started. ## How was this patch tested? Updated tests Author: Tathagata Das <[email protected]> Closes apache#16258 from tdas/SPARK-18834.

…in months or years ## What changes were proposed in this pull request? Two changes - Fix how delays specified in months and years are translated to milliseconds - Following up on apache#16258, not show watermark when there is no watermarking in the query ## How was this patch tested? Updated and new unit tests Author: Tathagata Das <[email protected]> Closes apache#16304 from tdas/SPARK-18834-1.

Added timestamps to progress

e9d34ed

tdas commented Dec 12, 2016

View reviewed changes

Refactored

a14efdd

tdas commented Dec 13, 2016

View reviewed changes

Minor changes

9a2e941

brkyvz reviewed Dec 13, 2016

View reviewed changes

tdas added 2 commits December 12, 2016 17:14

Minor change

8de17d2

Add line

8938992

tdas changed the title ~~[SPARK-18834][SS] Expose event time and processing time stats through StreamingQueryProgress~~ [SPARK-18834][SS] Expose event time stats through StreamingQueryProgress Dec 13, 2016

zsxwing reviewed Dec 13, 2016

View reviewed changes

Addressed comments

b59ab80

tdas closed this Dec 13, 2016

tdas reopened this Dec 13, 2016

asfgit closed this in c68fb42 Dec 13, 2016

HyukjinKwon reviewed Dec 14, 2016

View reviewed changes

tdas mentioned this pull request Dec 16, 2016

[SPARK-18894][SS] Fix event time watermark delay threshold specified in months or years #16304

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18834][SS] Expose event time stats through StreamingQueryProgress #16258

[SPARK-18834][SS] Expose event time stats through StreamingQueryProgress #16258

tdas commented Dec 12, 2016

tdas Dec 12, 2016 •

edited

Loading

marmbrus Dec 12, 2016

tdas Dec 13, 2016

marmbrus Dec 13, 2016 •

edited

Loading

tdas Dec 13, 2016

marmbrus Dec 13, 2016

tdas Dec 13, 2016

tdas Dec 12, 2016

tdas Dec 12, 2016

marmbrus Dec 13, 2016

tdas Dec 13, 2016

marmbrus Dec 13, 2016

tdas Dec 13, 2016

brkyvz Dec 13, 2016

SparkQA commented Dec 13, 2016

zsxwing Dec 13, 2016

zsxwing Dec 13, 2016

zsxwing left a comment

zsxwing Dec 13, 2016

zsxwing Dec 13, 2016

zsxwing Dec 13, 2016

tdas Dec 13, 2016

SparkQA commented Dec 13, 2016

SparkQA commented Dec 13, 2016

SparkQA commented Dec 13, 2016

SparkQA commented Dec 13, 2016

zsxwing commented Dec 13, 2016

marmbrus commented Dec 13, 2016

tdas commented Dec 13, 2016

HyukjinKwon Dec 14, 2016

[SPARK-18834][SS] Expose event time stats through StreamingQueryProgress #16258

[SPARK-18834][SS] Expose event time stats through StreamingQueryProgress #16258

Conversation

tdas commented Dec 12, 2016

What changes were proposed in this pull request?

How was this patch tested?

tdas Dec 12, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marmbrus Dec 13, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 13, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zsxwing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 13, 2016

SparkQA commented Dec 13, 2016

SparkQA commented Dec 13, 2016

SparkQA commented Dec 13, 2016

zsxwing commented Dec 13, 2016

marmbrus commented Dec 13, 2016

tdas commented Dec 13, 2016

Choose a reason for hiding this comment

tdas Dec 12, 2016 •

edited

Loading

marmbrus Dec 13, 2016 •

edited

Loading