[SPARK-22912] v2 data source support in MicroBatchExecution #20097

jose-torres · 2017-12-27T21:34:23Z

What changes were proposed in this pull request?

Support for v2 data sources in microbatch streaming.

How was this patch tested?

A very basic new unit test on the toy v2 implementation of rate source. Once we have a v1 source fully migrated to v2, we'll need to do more detailed compatibility testing.

jose-torres · 2017-12-27T21:34:32Z

/cc @zsxwing

SparkQA · 2017-12-28T01:47:49Z

Test build #85445 has finished for PR 20097 at commit 19919de.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-28T22:57:44Z

Test build #85478 has finished for PR 20097 at commit e27d1db.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-29T04:38:22Z

Test build #85491 has finished for PR 20097 at commit 9ffb92c.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2017-12-29T20:53:58Z

retest this please

SparkQA · 2017-12-29T23:53:34Z

Test build #85525 has finished for PR 20097 at commit 9ffb92c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas

Overall looks quite good, just a few minor comments.

tdas · 2017-12-29T20:54:36Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

@@ -35,6 +35,12 @@ case class DataSourceV2Relation(
  }
 }

+class StreamingDataSourceV2Relation(


Add docs. What is the purpose of this class?

tdas · 2017-12-29T23:16:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

@@ -68,8 +76,20 @@ class MicroBatchExecution(
          // "df.logicalPlan" has already used attributes of the previous `output`.
          StreamingExecutionRelation(source, output)(sparkSession)
        })
-      case s @ StreamingRelationV2(v2DataSource, _, _, output, v1DataSource)
-          if !v2DataSource.isInstanceOf[MicroBatchReadSupport] =>
+      case s @ StreamingRelationV2(source: MicroBatchReadSupport, _, options, output, _) =>


Nit: can you add a comment above this section of the code to explain what this is doing. took me a while to remember the context.

tdas · 2017-12-29T23:51:13Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala

+          options)
+        // Generate the V1 node to catch errors thrown within generation.
+        try {
+          StreamingRelation(v1DataSource)


nit: This line seems to be used in all 3 cases. can be deduped.
in fact, its confusing to have a single line like with no variable, just to enforce the side effects of StreamingRelation.apply().

tdas · 2017-12-29T23:53:37Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala

+          case e: UnsupportedOperationException
+              if e.getMessage.contains("does not support streamed reading") =>
+            // If v1 wasn't supported for this source, that's fine; just proceed onwards with v2.
+        }


shouldnt this try...catch be added to the next case as well?

tdas · 2017-12-29T23:55:25Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala

+        try {
+          StreamingRelation(v1DataSource)
+        } catch {
+          case e: UnsupportedOperationException


can you link to the exception that this supposed to throw? do we really have a check the message string to match? seems pretty brittle for something this crucial item of checking whether something is supported.

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

Line 266 in 9a2b65a

throw new UnsupportedOperationException(

I agree that it would be nice to change this exception, but I don't know whether we can.

On reflection, there's actually a better way to do this which does not need to use exceptions as control flow. I didn't notice before because lookupDataSource returns Class[_] for some reason.

tdas · 2018-01-03T03:12:09Z

...ore/src/main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamSourceV2.scala


 class RateStreamV2Reader(options: DataSourceV2Options)
  extends MicroBatchReader {
  implicit val defaultFormats: DefaultFormats = DefaultFormats

-  val clock = new SystemClock
+  val clock = if (options.get("useManualClock").orElse("false").toBoolean) new ManualClock


nit: put this in a { ... }

also mention that manualClock is only used for testing so that someone looking at the source does not confuse this to be a publicly visible feature.

tdas · 2018-01-03T03:12:24Z

...ore/src/main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamSourceV2.scala


  private val numPartitions =
-    options.get(RateStreamSourceV2.NUM_PARTITIONS).orElse("5").toInt
+    options.get(RateStreamSourceV2.NUM_PARTITIONS).orElse("1").toInt


why this change?

tdas · 2018-01-03T03:12:42Z

...ore/src/main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamSourceV2.scala

@@ -111,7 +112,7 @@ class RateStreamV2Reader(options: DataSourceV2Options)

      val packedRows = mutable.ListBuffer[(Long, Long)]()
      var outVal = startVal + numPartitions
-      var outTimeMs = startTimeMs + msPerPartitionBetweenRows
+      var outTimeMs = startTimeMs


why this change?

The original behavior was an off-by-one error. With 1 partition and 1 row per second, for example, every row would come timestamped 1 second after it was actually generated.

tdas · 2018-01-03T05:58:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

@@ -317,6 +355,8 @@ class MicroBatchExecution(
          if (prevBatchOff.isDefined) {
            prevBatchOff.get.toStreamProgress(sources).foreach {
              case (src: Source, off) => src.commit(off)
+              case (reader: MicroBatchReader, off) =>
+                reader.commit(reader.deserializeOffset(off.json))


why not call reader.commit(off) directly?

Isn't this a SerializedOffset here?

I am not sure. I am just comparing with the previous case src.commit(off)
I think till now, it was the responsibility of the Source to check whether the pass Offset instance was an instance of the custom offset type (e.g. KafkaSourceOffset) and accordingly either use it directly or deserialize it. Avoids unnecessary conversions.

Quick summary:

V1 sources were silently responsible for checking every offset they receive, and deserializing it if it's a SerializedOffset.

This is awkward, so SerializedOffset isn't being migrated to the v2 API. For v2 sources, all Offset instances passed to a reader will have the right type in the first place, and the execution engine will deserialize them from JSON form with the deserializeOffset handle. In the long term, the conversion to SerializedOffset can be removed entirely.

But as long as the v1 path is around, we can't (without a lot of pointless effort) change the offset log to not return Offset instances. So we have to pull the JSON out of the SerializedOffset and then deserialize it properly.

Gotcha. Makes sense.

tdas · 2018-01-03T06:08:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

@@ -236,14 +257,31 @@ class MicroBatchExecution(
    val hasNewData = {
      awaitProgressLock.lock()
      try {
-        val latestOffsets: Map[Source, Option[Offset]] = uniqueSources.map {
+        val latestOffsets: Map[BaseStreamingSource, Option[Offset]] = uniqueSources.map {


nit: Can you add more comments in this section to explain what this does? This section is becoming more complicated.

I tried, but I'm not quite sure what to add.

tdas · 2018-01-03T06:16:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

+            Optional.of(available.asInstanceOf[v2.streaming.reader.Offset]))
+          logDebug(s"Retrieving data from $reader: $current -> $available")
+          Some(reader ->
+            new StreamingDataSourceV2Relation(reader.readSchema().toAttributes, reader))


Maybe name this class MicrobatchDataSourceV2Relation to be more specific?

Continuous execution actually should be using it too, since the isStreaming bit should still be set.

Does the ContinuousExecution use it? Seems like this class was added in this PR. Are you planning to modify the ContinuousExecution to use it in future?

tdas · 2018-01-03T06:19:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala

@@ -226,7 +226,7 @@ trait ProgressReporter extends Logging {
    // 3. For each source, we sum the metrics of the associated execution plan leaves.
    //
    val logicalPlanLeafToSource = newData.flatMap { case (source, df) =>
-      df.logicalPlan.collectLeaves().map { leaf => leaf -> source }
+      df.collectLeaves().map { leaf => leaf -> source }


rename df -> logicalPlan

SparkQA · 2018-01-03T21:04:19Z

Test build #85641 has finished for PR 20097 at commit 971e7a4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-01-03T23:24:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

+          case s: Sink => s.addBatch(currentBatchId, nextBatch)
+          case s: MicroBatchWriteSupport =>
+            // Execute the V2 writer node in the query plan.
+            nextBatch.collect()


Make it clear in the comments that this collect() does not accumulate any data, only forces the execution of the writer.

tdas · 2018-01-03T23:37:11Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala

          trigger,
          triggerClock,
          outputMode,
          extraOptions,
          deleteCheckpointOnStop))
+      case _ =>
+        throw new AnalysisException(


is this the only other option?
I am afraid that with some random situation that matched none of the cases, this error will be thrown which is misleading.
For example, what happens if the sink supports only MicroBatchWrite and ContinuousTrigger is mentioned? Shouldnt that be an error with the different error message?

I think is the only other option. MicroBatchWriteSupport and Sink will have already matched with any trigger, ContinuousWriteSupport will have already matched with a continuous trigger, and there aren't any other implementations of BaseStreamingSink.

I agree it's cleaner to check explicitly.

SparkQA · 2018-01-03T23:42:21Z

Test build #85643 has finished for PR 20097 at commit 0059ff3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-01-03T23:58:03Z

...ore/src/main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamSourceV2.scala

@@ -32,13 +32,17 @@ import org.apache.spark.sql.sources.v2.DataSourceV2Options
 import org.apache.spark.sql.sources.v2.reader._
 import org.apache.spark.sql.sources.v2.streaming.reader.{MicroBatchReader, Offset}
 import org.apache.spark.sql.types.{LongType, StructField, StructType, TimestampType}
-import org.apache.spark.util.SystemClock
+import org.apache.spark.util.{ManualClock, SystemClock}

 class RateStreamV2Reader(options: DataSourceV2Options)


Can you rename this to MicroBatchRateStreamReader, to make it consistent with ContinuousRateStreamReader?

tdas · 2018-01-04T00:10:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/RateSourceProvider.scala

+ * This is a temporary register as we build out v2 migration. Microbatch read support should
+ * be implemented in the same register as v1.
+ */
+class RateSourceProviderV2 extends DataSourceV2 with MicroBatchReadSupport with DataSourceRegister {


Move this to the file RateStreamSourceV2.scala

SparkQA · 2018-01-04T02:25:49Z

Test build #85646 has finished for PR 20097 at commit 8f3629d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-04T04:10:40Z

Test build #85647 has finished for PR 20097 at commit 2867880.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class RateSourceProviderV2 extends DataSourceV2 with MicroBatchReadSupport with DataSourceRegister

tdas

Just a few more minor comments.

tdas · 2018-01-05T19:20:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

    deleteCheckpointOnStop: Boolean)
  extends StreamExecution(
    sparkSession, name, checkpointRoot, analyzedPlan, sink,
    trigger, triggerClock, outputMode, deleteCheckpointOnStop) {

+  private def toJava(


super nit: I usually prefer moving such small less-important methods at the end of the class

tdas · 2018-01-05T19:28:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

+              // Once v1 streaming source execution is gone, we can refactor this away.
+              // For now, we set the range here to get the source to infer the available end offset,
+              // get that offset, and then set the range again when we later execute.
+            s.setOffsetRange(


incorrect indentation.

tdas · 2018-01-05T19:33:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

+          val current = committedOffsets.get(reader).map(off => reader.deserializeOffset(off.json))
+          reader.setOffsetRange(
+            toJava(current),
+            Optional.of(available.asInstanceOf[v2.streaming.reader.Offset]))


v2.streaming.reader.Offset is being used in a lot of places. Please rename it to OffsetV2 in the imports and use that in all places.

tdas · 2018-01-05T20:56:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

+          newAttributePlan.schema,
+          outputMode,
+          new DataSourceV2Options(extraOptions.asJava))
+        Option(writer.orElse(null)).map(WriteToDataSourceV2(_, newAttributePlan)).getOrElse {


can you add a comment explaining why the fallback in a LocalRelation? and when can the writer be empty.

I don't think the writer can ever be empty. Would you prefer an assert here?

The writer can be empty. If the sink decides that data does not need to be written, then the returned writer can be None. See the docs of createWriter.
So just documenting that here is fine. To avoid confusion like this.

tdas · 2018-01-05T20:57:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

+        Option(writer.orElse(null)).map(WriteToDataSourceV2(_, newAttributePlan)).getOrElse {
+          LocalRelation(newAttributePlan.schema.toAttributes, isStreaming = true)
+        }
+      case _ => throw new IllegalArgumentException("unknown sink type")


add the sink in the string as well so that its easy to debug.

tdas · 2018-01-05T21:06:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

    deleteCheckpointOnStop: Boolean)
  extends StreamExecution(
    sparkSession, name, checkpointRoot, analyzedPlan, sink,
    trigger, triggerClock, outputMode, deleteCheckpointOnStop) {

+  private def toJava(
+      scalaOption: Option[v2.streaming.reader.Offset]): Optional[v2.streaming.reader.Offset] = {


mentioned elsewhere as well, import new Offset as OffsetV2 instead of using full package path.

tdas · 2018-01-05T21:07:24Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala

    ds match {
+      case s: MicroBatchReadSupport =>
+        val tempReader = s.createMicroBatchReader(
+          java.util.Optional.ofNullable(userSpecifiedSchema.orNull),


Please import Optional, the full path is used in multiple places.

tdas · 2018-01-05T21:10:04Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

@@ -35,6 +35,16 @@ case class DataSourceV2Relation(
  }
 }

+/**
+ * A specialization of DataSourceV2Relation with the streaming bit set to true. Otherwwise identical


nit: Otherwwise?? :)

tdas · 2018-01-05T21:12:32Z

...ore/src/main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamSourceV2.scala

+  override def shortName(): String = "ratev2"
+}
+
+class MicroBatchRateStreamReader(options: DataSourceV2Options)


As with the other kafka PR, can you rename these classes to start with "RateStream"? Only if it is not too much refactoring, otherwise we can clean this up later.

tdas · 2018-01-05T21:13:27Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala

-      case v2Sink: ContinuousWriteSupport =>
-        UnsupportedOperationChecker.checkForContinuous(analyzedPlan, outputMode)
-        new StreamingQueryWrapper(new ContinuousExecution(
+      case (_: MicroBatchWriteSupport, _) | (_: Sink, _) =>


Shouldnt we throw error for the case MicroBatchWriteSupport (sink does not have ContinuousWriteSupport ) and ContinuousTrigger???

As discussed offline, we do throw that error in the MicroBatchExecution constructor. Once all the pieces are in we'll need to refactor this a bit to get all the checking in the same place.

SparkQA · 2018-01-06T00:42:26Z

Test build #85730 has finished for PR 20097 at commit 5f0a6e2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-01-08T20:51:11Z

LGTM. Merging it to master.

dongjoon-hyun · 2018-01-09T19:19:53Z

Hi, @tdas .
Since this is marked as 2.3 in JIRA, could you merge this to branch-2.3 , too?

tdas · 2018-01-09T19:58:32Z

Yes. My bad. I didnt realize the branch had already been cut.

## What changes were proposed in this pull request? Support for v2 data sources in microbatch streaming. ## How was this patch tested? A very basic new unit test on the toy v2 implementation of rate source. Once we have a v1 source fully migrated to v2, we'll need to do more detailed compatibility testing. Author: Jose Torres <[email protected]> Closes #20097 from jose-torres/v2-impl.

tdas · 2018-01-09T20:15:32Z

Done.
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=44763d93c0d923977c114d63586abfc1b68ad7fc

dongjoon-hyun · 2018-01-09T20:26:13Z

Thank you, @tdas !

citrusraj · 2019-12-30T10:44:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

+      case _: Sink => newAttributePlan
+      case s: MicroBatchWriteSupport =>
+        val writer = s.createMicroBatchWriter(
+          s"$runId",


Should it not be Id. How does having runId handle scenario of restarting stream jobs ??

jose-torres added 3 commits December 28, 2017 10:35

partial v2

3a9f331

v2 microbatch with basic test

2ef35b8

fix after rebase

e27d1db

jose-torres force-pushed the v2-impl branch from 19919de to e27d1db Compare December 28, 2017 18:46

restore shim

9ffb92c

tdas suggested changes Jan 3, 2018

View reviewed changes

tdas reviewed Jan 3, 2018

View reviewed changes

jose-torres added 8 commits January 3, 2018 09:06

add streaming relation v2 docs

c6acee7

comment logical plan transformation

39c5d11

add a few constructNextBatch comments

a9d60f1

withAttributePlan

c714472

newBatchesPlan

42ff754

rename df

9da0709

rate stream source comments

a694c50

change shim

971e7a4

fix tests

0059ff3

tdas reviewed Jan 3, 2018

View reviewed changes

clarify comment

8f3629d

tdas reviewed Jan 3, 2018

View reviewed changes

explicit check match

8d16109

tdas reviewed Jan 3, 2018

View reviewed changes

tdas reviewed Jan 4, 2018

View reviewed changes

jose-torres added 2 commits January 3, 2018 16:56

address comments

0bbff56

move ratev2 register

2867880

tdas suggested changes Jan 5, 2018

View reviewed changes

jose-torres added 5 commits January 5, 2018 13:28

no ww

dd93579

OffsetV2

465810e

refactor triggerLogicalPlan

4836e1a

refactor to start with RateStream

3aa7ee2

import optional

5f0a6e2

asfgit closed this in 4f7e758 Jan 8, 2018

citrusraj reviewed Dec 30, 2019

View reviewed changes

[SPARK-22912] v2 data source support in MicroBatchExecution #20097

[SPARK-22912] v2 data source support in MicroBatchExecution #20097

Conversation

jose-torres commented Dec 27, 2017

What changes were proposed in this pull request?

How was this patch tested?

jose-torres commented Dec 27, 2017

SparkQA commented Dec 28, 2017

SparkQA commented Dec 28, 2017

SparkQA commented Dec 29, 2017

tdas commented Dec 29, 2017

SparkQA commented Dec 29, 2017

tdas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdas Jan 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdas Jan 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 3, 2018

Choose a reason for hiding this comment

tdas Jan 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 4, 2018

SparkQA commented Jan 4, 2018

tdas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdas Jan 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 6, 2018

tdas commented Jan 8, 2018

dongjoon-hyun commented Jan 9, 2018 • edited Loading

tdas commented Jan 9, 2018

tdas commented Jan 9, 2018

dongjoon-hyun commented Jan 9, 2018

Choose a reason for hiding this comment

tdas Jan 3, 2018 •

edited

Loading

tdas Jan 3, 2018 •

edited

Loading

tdas Jan 3, 2018 •

edited

Loading

tdas Jan 8, 2018 •

edited

Loading

dongjoon-hyun commented Jan 9, 2018 •

edited

Loading