[SPARK-23052][SS] Migrate ConsoleSink to data source V2 api. #20243

jose-torres · 2018-01-12T03:32:08Z

What changes were proposed in this pull request?

Migrate ConsoleSink to data source V2 api.

Note that this includes a missing piece in DataStreamWriter required to specify a data source V2 writer.

Note also that I've removed the "Rerun batch" part of the sink, because as far as I can tell this would never have actually happened. A MicroBatchExecution object will only commit each batch once for its lifetime, and a new MicroBatchExecution object would have a new ConsoleSink object which doesn't know it's retrying a batch. So I think this represents an anti-feature rather than a weakness in the V2 API.

How was this patch tested?

new unit test

jose-torres · 2018-01-12T03:51:39Z

I split off PackedRowWriterFactory with the intent to refactor MemorySinkV2 to use it later, but I just realized no refactoring is actually needed. So I've slotted it in.

SparkQA · 2018-01-12T06:17:03Z

Test build #86017 has finished for PR 20243 at commit 71cc6e4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-12T08:02:52Z

Test build #86018 has finished for PR 20243 at commit b52c990.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

jose-torres · 2018-01-12T18:30:53Z

My realization above was not in fact true. I took out the MemorySinkV2 changes and will do them in a later PR.

SparkQA · 2018-01-12T21:18:34Z

Test build #86051 has finished for PR 20243 at commit e3af17c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas

Looks good, though needs more tests with fake data source to verify whether they are handled correctly.

tdas · 2018-01-12T19:24:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/console.scala

+      schema: StructType,
+      mode: OutputMode,
+      options: DataSourceV2Options): Optional[DataSourceV2Writer] = {
+    Optional.of(new ConsoleWriter(epochId, schema, options.asMap.asScala.toMap))
  }

  def createRelation(


What is createRelation used for? For batch?

I assume so. I'm not familiar with it, but it's not on the streaming source codepath.

tdas · 2018-01-12T20:25:03Z

...src/main/scala/org/apache/spark/sql/execution/streaming/sources/PackedRowWriterFactory.scala

+
+/**
+ * A simple [[DataWriterFactory]] whose tasks just pack rows into the commit message for delivery
+ * to the [[org.apache.spark.sql.sources.v2.writer.DataSourceV2Writer]] on the driver.


super nit: to a DataSourceV2Writer

tdas · 2018-01-12T20:46:58Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala

-            case _ => throw new AnalysisException(
-              s"Data source $source does not support continuous writing")
-          }
+      val ds = DataSource.lookupDataSource(source, df.sparkSession.sessionState.conf)


We are checking for the same conditions here as well as in the StreamingQueryManager.createQuery. I think we need to refactor this, probably sometime in the future once we get rid of v1 completely.

Either way, we should immediately add a general test suite (say StreamingDataSourceV2Suite) that tests these cases with various fake data sources.

tdas · 2018-01-12T21:04:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ConsoleWriter.scala

+class ConsoleWriter(batchId: Long, schema: StructType, options: Map[String, String])
+    extends DataSourceV2Writer with Logging {
+  // Number of rows to display, by default 20 rows
+  private val numRowsToShow = options.get("numRows").map(_.toInt).getOrElse(20)


I am not sure get("numRows") on this options map is case-insensitive. Why not just pass on the DataSoureV2Options directly to this writer and use that directly? That is already case-insensitive.

In fact that same pattern should be used for all v2 reader/writer (verify this for Kafka continuous).

ConsoleRelation creates this map from a DataSoureV2Options, it contains lowercased keys.
Using DataSoureV2Options or asking for "numrows" would both work, but with DataSoureV2Options
options.get("numRows").map(_.toInt).getOrElse(20)
could also be simplified to
options.getInt("numRows", 20)

It's not that easy for Kafka continuous, because we're feeding the maps into utility methods (and Kafka-level interfaces) which insist on a java.util.Map[String, Object]. Fortunately the parameters already appear to be case sensitive there, and I think we have tests verifying that various parameters can be specified.

tdas · 2018-01-12T21:24:31Z

...ore/src/test/scala/org/apache/spark/sql/execution/streaming/sources/ConsoleWriterSuite.scala

+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.streaming.sources


Delete this test - https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamReaderWriterSuite.scala#L425

I think this one can be deleted too:
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamReaderWriterSuite.scala#L453

smurakozi · 2018-01-16T13:00:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ConsoleWriter.scala

+class ConsoleWriter(batchId: Long, schema: StructType, options: Map[String, String])
+    extends DataSourceV2Writer with Logging {
+  // Number of rows to display, by default 20 rows
+  private val numRowsToShow = options.get("numRows").map(_.toInt).getOrElse(20)


ConsoleRelation creates this map from a DataSoureV2Options, it contains lowercased keys.
Using DataSoureV2Options or asking for "numrows" would both work, but with DataSoureV2Options
options.get("numRows").map(_.toInt).getOrElse(20)
could also be simplified to
options.getInt("numRows", 20)

smurakozi · 2018-01-16T13:48:45Z

...ore/src/test/scala/org/apache/spark/sql/execution/streaming/sources/ConsoleWriterSuite.scala

+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.streaming.sources


I think this one can be deleted too:
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamReaderWriterSuite.scala#L453

smurakozi · 2018-01-16T13:59:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ConsoleWriter.scala

+  private val numRowsToShow = options.get("numRows").map(_.toInt).getOrElse(20)
+
+  // Truncate the displayed data if it is too long, by default it is true
+  private val isTruncated = options.get("truncate").map(_.toBoolean).getOrElse(true)


Same simplification possibility here, if DataSoureV2Options is used

smurakozi · 2018-01-16T13:59:39Z

...ore/src/test/scala/org/apache/spark/sql/execution/streaming/sources/ConsoleWriterSuite.scala

+        |
+        |""".stripMargin)
+  }
+}


We could have a test to check numrows, something like this:

test("console with numRows") { val input = MemoryStream[Int] val captured = new ByteArrayOutputStream() Console.withOut(captured) { val query = input.toDF().writeStream.format("console").option("NUMROWS", 2).start() try { input.addData(1, 2, 3) query.processAllAvailable() } finally { query.stop() } } assert(captured.toString() == """------------------------------------------- |Batch: 0 |------------------------------------------- |+-----+ ||value| |+-----+ || 1| || 2| |+-----+ |only showing top 2 rows | |""".stripMargin) } test("console with truncation") { val input = MemoryStream[String] val captured = new ByteArrayOutputStream() Console.withOut(captured) { val query = input.toDF().writeStream.format("console").option("TRUNCATE", true).start() try { input.addData("123456789012345678901234567890") query.processAllAvailable() } finally { query.stop() } } assert(captured.toString() == """------------------------------------------- |Batch: 0 |------------------------------------------- |+--------------------+ || value| |+--------------------+ ||12345678901234567...| |+--------------------+ | |""".stripMargin) }

Indeed we could. Thanks for writing out the tests!

tdas · 2018-01-17T00:20:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/console.scala

@@ -17,58 +17,36 @@

 package org.apache.spark.sql.execution.streaming

-import org.apache.spark.internal.Logging


can you move this file into the sources subdirectory to make it consistent with other v2 sources?

in fact this file can be merged into the ConsoleWriter.scala. The combined file will be named console.scala

I can do this in a followup PR. It's not as simple as just moving it; we have to add an alias so that .format("org.apache.spark.sql.execution.streaming.ConsoleSinkProvider") continues to work.

argh. okay. later then.

This reverts commit 8ce6f38.

tdas · 2018-01-17T00:30:35Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/sources/StreamingDataSourceV2Suite.scala

+
+    assert(ex.getMessage.contains(
+      "Data source fake-neither-mode does not support continuous writing"))
+  }


you are testing only different types of sinks, not the different types of sources.

Added tests for all 443 combinations of source/sink/trigger. Note that:

I had to revert the earlier change to initialize ContinuousExecution.sources to null, because it turns out this interferes with error generation on newly constructed executions.

Two of the cases don't throw the error until after start(). This will take a decent amount of disruptive changes to fix; the problem is that DataStreamWriter doesn't have direct visibility to what sources were used to generate it. We'd need to crawl the tree similarly to how we do it in the execution.

SparkQA · 2018-01-17T02:41:07Z

Test build #86209 has finished for PR 20243 at commit be880b1.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
class ConsoleWriter(batchId: Long, schema: StructType, options: DataSourceV2Options)

SparkQA · 2018-01-17T03:07:27Z

Test build #86213 has finished for PR 20243 at commit fac17a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-17T03:50:51Z

Test build #86212 has finished for PR 20243 at commit e4c6429.

This patch fails PySpark unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
class FakeStreamingNeitherMode extends DataSourceRegister with DataSourceV2
class StreamingDataSourceV2Suite extends StreamTest

tdas · 2018-01-17T06:39:17Z

...src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala

@@ -69,7 +69,7 @@ class ContinuousExecution(
          ContinuousExecutionRelation(source, extraReaderOptions, output)(sparkSession)
        })
      case StreamingRelationV2(_, sourceName, _, _, _) =>
-        throw new AnalysisException(
+        throw new UnsupportedOperationException(


Why this change? An incorrect data source is not an operation.

I think there's an argument that it is - you're asking the data source (which is correct in the sense that it's a real, existing source) to do a type of read/write it doesn't support.

The primary motivation is that the existing code has already made the choice to throw an UnsupportedOperationException when you try to stream from a source that only outputs in batch mode.

tdas · 2018-01-17T06:39:39Z

...src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala

@@ -54,7 +54,7 @@ class ContinuousExecution(
    sparkSession, name, checkpointRoot, analyzedPlan, sink,
    trigger, triggerClock, outputMode, deleteCheckpointOnStop) {

-  @volatile protected var continuousSources: Seq[ContinuousReader] = _
+  @volatile protected var continuousSources: Seq[ContinuousReader] = Seq()


why this change. is it related to this PR?

Yes. As mentioned in an earlier comment, initializing to null means the StreamingQueryException won't construct if it happens before sources are set.

tdas · 2018-01-17T06:40:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ConsoleWriter.scala

+import org.apache.spark.sql.sources.v2.writer.{DataSourceV2Writer, DataWriterFactory, WriterCommitMessage}
+import org.apache.spark.sql.types.StructType
+
+class ConsoleWriter(batchId: Long, schema: StructType, options: DataSourceV2Options)


add docs and link it to the ConsoleSinkProvider since it's in a different file.

tdas · 2018-01-17T06:46:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ConsoleWriter.scala

+  override def commit(messages: Array[WriterCommitMessage]): Unit = synchronized {
+    val batch = messages.collect {
+      case PackedRowCommitMessage(rows) => rows
+    }.fold(Array())(_ ++ _)


Why this complicated fold? Just array.collect { ... } returns an Array .. isnt it?

It returns an array of arrays of rows, which isn't what we need.

you can use flatten instead of fold. Much cleaner.

tdas · 2018-01-17T06:48:43Z

...src/main/scala/org/apache/spark/sql/execution/streaming/sources/PackedRowWriterFactory.scala

+
+case class PackedRowCommitMessage(rows: Array[Row]) extends WriterCommitMessage
+
+class PackedRowDataWriter() extends DataWriter[Row] with Logging {


tdas · 2018-01-17T06:48:48Z

...src/main/scala/org/apache/spark/sql/execution/streaming/sources/PackedRowWriterFactory.scala

+  }
+}
+
+case class PackedRowCommitMessage(rows: Array[Row]) extends WriterCommitMessage


tdas · 2018-01-17T06:51:23Z

...src/main/scala/org/apache/spark/sql/execution/streaming/sources/PackedRowWriterFactory.scala

+  override def write(row: Row): Unit = data.append(row)
+
+  override def commit(): PackedRowCommitMessage = {
+    val msg = PackedRowCommitMessage(data.clone().toArray)


why are you cloning and then calling toArray? Just data.toArray will create an immutable copy.

tdas · 2018-01-17T07:02:46Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala

+      val ds = DataSource.lookupDataSource(source, df.sparkSession.sessionState.conf)
+      val sink = (ds.newInstance(), trigger) match {
+        case (w: ContinuousWriteSupport, _: ContinuousTrigger) => w
+        case (_, _: ContinuousTrigger) => throw new UnsupportedOperationException(


AnalysisException.
Incorrect trigger or incompatible data source is not an operation.

tdas · 2018-01-17T07:05:59Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala

+        case (w: ContinuousWriteSupport, _: ContinuousTrigger) => w
+        case (_, _: ContinuousTrigger) => throw new UnsupportedOperationException(
+            s"Data source $source does not support continuous writing")
+        case (w: MicroBatchWriteSupport, _) => w


Isnt there a case where it does not have MicroBatchWriteSupport, but the trigger is ProcessingTime/OneTime? That should have a different error message.

In that case, we have to just fall back to the V1 path, because V1 sinks don't have MicroBatchWriteSupport.

tdas · 2018-01-17T07:10:44Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/sources/StreamingDataSourceV2Suite.scala

+    query.stop()
+  }
+
+  private def testUnsupportedOperationCase(


Just rename to testNegativeCase

tdas · 2018-01-17T07:13:27Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/sources/StreamingDataSourceV2Suite.scala

+      .trigger(trigger)
+      .start()
+
+    eventually(timeout(streamingTimeout)) {


This is the case we should avoid. start() should fail if source/sink has a mismatch. Here start() is not failing. We must validate the source and sink earlier to avoid such a query from being started.

For the reasons I discussed earlier, this is hard to do and won't fit in this PR.

SparkQA · 2018-01-17T07:36:29Z

Test build #86232 has finished for PR 20243 at commit c0ec93f.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2018-01-17T07:48:23Z

Test build #86235 has finished for PR 20243 at commit 516fd4a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-17T21:47:55Z

Test build #86284 has finished for PR 20243 at commit 2916010.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-01-18T01:02:10Z

LGTM, assuming tests pass.

SparkQA · 2018-01-18T02:16:34Z

Test build #86296 has finished for PR 20243 at commit 278eeb4.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2018-01-18T02:17:10Z

Test build #86298 has finished for PR 20243 at commit f3c170e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? Migrate ConsoleSink to data source V2 api. Note that this includes a missing piece in DataStreamWriter required to specify a data source V2 writer. Note also that I've removed the "Rerun batch" part of the sink, because as far as I can tell this would never have actually happened. A MicroBatchExecution object will only commit each batch once for its lifetime, and a new MicroBatchExecution object would have a new ConsoleSink object which doesn't know it's retrying a batch. So I think this represents an anti-feature rather than a weakness in the V2 API. ## How was this patch tested? new unit test Author: Jose Torres <[email protected]> Closes #20243 from jose-torres/console-sink. (cherry picked from commit 1c76a91) Signed-off-by: Tathagata Das <[email protected]>

jose-torres added 2 commits January 11, 2018 19:06

basic writer

3abe75c

add test

71cc6e4

jose-torres changed the title ~~[SPARK-23052] Migrate ConsoleSink to data source V2 api.~~ [SPARK-23052][SS] Migrate ConsoleSink to data source V2 api. Jan 12, 2018

allow empty batch

e3af17c

jose-torres force-pushed the console-sink branch from b52c990 to e3af17c Compare January 12, 2018 18:30

tdas suggested changes Jan 12, 2018

View reviewed changes

smurakozi reviewed Jan 16, 2018

View reviewed changes

jose-torres added 5 commits January 16, 2018 14:41

the -> a

45df30a

rm redundant tests

7154f34

fix options and add tests

be880b1

add data source write selection tests

e4c6429

Merge remote-tracking branch 'apache/master' into console-sink

fac17a4

tdas reviewed Jan 17, 2018

View reviewed changes

jose-torres added 2 commits January 16, 2018 16:27

merge into console.scala

8ce6f38

Revert "merge into console.scala"

d7b4571

This reverts commit 8ce6f38.

tdas reviewed Jan 17, 2018

View reviewed changes

jose-torres added 3 commits January 16, 2018 19:53

test all combinations

c0ec93f

Merge remote-tracking branch 'apache/master' into console-sink

c3abd70

fix test

516fd4a

tdas reviewed Jan 17, 2018

View reviewed changes

jose-torres added 4 commits January 17, 2018 10:21

add docs

7abd2b2

no redundant clone

da13f37

rename method

a9d6b82

Merge remote-tracking branch 'apache/master' into console-sink

2916010

jose-torres added 3 commits January 17, 2018 14:21

initialize continuous plan non-lazy

99109a4

use flatten

278eeb4

Merge remote-tracking branch 'apache/master' into console-sink

f3c170e

asfgit closed this in 1c76a91 Jan 18, 2018

		@@ -17,58 +17,36 @@

		package org.apache.spark.sql.execution.streaming

		import org.apache.spark.internal.Logging


		case class PackedRowCommitMessage(rows: Array[Row]) extends WriterCommitMessage

		class PackedRowDataWriter() extends DataWriter[Row] with Logging {

[SPARK-23052][SS] Migrate ConsoleSink to data source V2 api. #20243

[SPARK-23052][SS] Migrate ConsoleSink to data source V2 api. #20243

Conversation

jose-torres commented Jan 12, 2018

What changes were proposed in this pull request?

How was this patch tested?

jose-torres commented Jan 12, 2018

SparkQA commented Jan 12, 2018

SparkQA commented Jan 12, 2018

jose-torres commented Jan 12, 2018

SparkQA commented Jan 12, 2018

tdas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 17, 2018

SparkQA commented Jan 17, 2018

SparkQA commented Jan 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdas Jan 17, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdas Jan 17, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 17, 2018

SparkQA commented Jan 17, 2018

SparkQA commented Jan 17, 2018

tdas commented Jan 18, 2018

SparkQA commented Jan 18, 2018

SparkQA commented Jan 18, 2018

tdas Jan 17, 2018 •

edited

Loading

tdas Jan 17, 2018 •

edited

Loading