[SPARK-22908] Add kafka source and sink for continuous processing. #20096

jose-torres · 2017-12-27T21:22:18Z

What changes were proposed in this pull request?

Add kafka source and sink for continuous processing. This involves two small changes to the execution engine:

Bring data reader close() into the normal data reader thread to avoid thread safety issues.
Fix up the semantics of the RECONFIGURING StreamExecution state. State updates are now atomic, and we don't have to deal with swallowing an exception.

How was this patch tested?

new unit tests

jose-torres · 2017-12-27T21:22:28Z

/cc @zsxwing

zsxwing · 2017-12-27T21:31:38Z

add to whitelist

SparkQA · 2017-12-27T23:51:41Z

Test build #85444 has finished for PR 20096 at commit 7596e34.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-28T03:11:59Z

Test build #85448 has finished for PR 20096 at commit 607b902.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ContinuousKafkaSuite extends KafkaSourceTest with SharedSQLContext
class ContinuousKafkaStressSuite extends KafkaSourceTest with SharedSQLContext

SparkQA · 2017-12-28T06:12:16Z

Test build #85455 has finished for PR 20096 at commit bcaa694.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-28T21:50:26Z

Test build #85479 has finished for PR 20096 at commit db2dc93.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ContinuousKafkaWriter(

SparkQA · 2017-12-30T04:02:20Z

Test build #85536 has finished for PR 20096 at commit df194c6.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

tdas

Round 1 of review. Some changes required.

tdas · 2018-01-03T19:08:45Z

external/kafka-0-10-sql/src/main/scala/ContinuousKafkaReader.scala

+import org.apache.spark.sql.types.StructType
+import org.apache.spark.unsafe.types.UTF8String
+
+class ContinuousKafkaReader(


Add docs explaining params.

Can we rename this and other classes such that all Kafka class start with "Kafka"?

Also dont forget to rename the file accordingly

tdas · 2018-01-03T19:17:15Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceOffset.scala

 /** Companion object of the [[KafkaSourceOffset]] */
 private[kafka010] object KafkaSourceOffset {

-  def getPartitionOffsets(offset: Offset): Map[TopicPartition, Long] = {
+  def getPartitionOffsets(offset: LegacyOffset): Map[TopicPartition, Long] = {


nit: can we use OffsetV1 or something like that to make the difference more obvious

tdas · 2018-01-04T01:02:25Z

external/kafka-0-10-sql/src/main/scala/ContinuousKafkaReader.scala

+import org.apache.spark.unsafe.types.UTF8String
+
+class ContinuousKafkaReader(
+    kafkaReader: KafkaOffsetReader,


Please rename this to offsetReader or maybe offsetFetcher to distinguish this from all the Reader classes in DataSourceV2

tdas · 2018-01-04T01:28:06Z

external/kafka-0-10-sql/src/main/scala/ContinuousKafkaReader.scala

+      offsets.partitionToOffsets
+  }
+
+  private def fetchAndVerify(specificOffsets: Map[TopicPartition, Long]) = {


This functions seems to be duplicate of that in the KafkaSource. Can you dedup? Maybe move this into the KafkaOffsetReader?

tdas · 2018-01-04T01:29:43Z

external/kafka-0-10-sql/src/main/scala/ContinuousKafkaReader.scala

+   * called in StreamExecutionThread. Otherwise, interrupting a thread while running
+   * `KafkaConsumer.poll` may hang forever (KAFKA-1894).
+   */
+  private lazy val initialPartitionOffsets = {


is this used anywhere??

tdas · 2018-01-04T21:58:54Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSinkV2.scala

+    if (failedWrite != null) return
+
+    val projectedRow = projection(row)
+    val topic = projectedRow.getUTF8String(0)


this topic variable overshadows the constructor param topic. i know that this pattern was present in the KafkaWriterTask, but lets not repeat the mistakes of the past. we can fix KafkaWriterTask once we migrate that to v2.

tdas · 2018-01-04T23:13:24Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala

-          options = extraOptions.toMap,
-          partitionColumns = normalizedParCols.getOrElse(Nil))
+      val sink = trigger match {
+        case _: ContinuousTrigger =>


Note for the future: All the checks for compatibility of sources and sinks wrt the specified trigger should be consolidated in one location. Right now the checks are spread between this file and StreamingQueryManager.

tdas · 2018-01-04T23:49:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

    assertAwaitThread()
    def notDone = {
      val localCommittedOffsets = committedOffsets
-      !localCommittedOffsets.contains(source) || localCommittedOffsets(source) != newOffset
+      if (sources.length <= sourceIndex) {
+        false


why false? shouldnt this throw an excpetion?

Sources is a var which might not be populated yet. (This race condition showed up in AddKafkaData in my tests.)

The race condition is present because sources is initialized to Seq.empty and then assigned to the actual sources. You can actually initialize sources to null, and then return notDone = false when sources is null. Any other mismatch should throw error. I dont like this current code which hides erroneous situations.

tdas · 2018-01-04T23:51:11Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

@@ -255,17 +255,21 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
            }
          }

-        case _ => throw new AnalysisException(s"$cls does not support data writing.")
+        case _ => saveToV1Source()


This section got more complicated with this. Add more comments above this section on this fallback policy.

tdas · 2018-01-04T23:56:28Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

      }

-      Dataset.ofRows(sparkSession, DataSourceV2Relation(reader))
+      if (reader == null) {


This section got more complicated with this. Add more comments above this section on this fallback policy.

SparkQA · 2018-01-05T01:15:48Z

Test build #85699 has finished for PR 20096 at commit dae3a09.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2018-01-09T23:41:59Z

Test build #85876 has finished for PR 20096 at commit 9101ea6.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-09T23:44:51Z

Test build #85887 has finished for PR 20096 at commit a3aaf27.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-10T00:07:10Z

Test build #85875 has finished for PR 20096 at commit f825155.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas

The tests look great now. Few more missing tests here and there and then we will be good to go.

tdas · 2018-01-09T20:28:57Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSuite.scala

+      "continuous-stream-test-sql-context",
+      sparkConf.set("spark.sql.testkey", "true")))
+
+  override protected def setTopicPartitions(


Add comment on what this method does. It is asserting something, so does not look like it only "sets" something.

tdas · 2018-01-09T22:46:30Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSuite.scala

+import org.apache.spark.sql.streaming.{StreamTest, Trigger}
+import org.apache.spark.sql.test.{SharedSQLContext, TestSparkSession}
+
+trait KafkaContinuousTest extends KafkaSourceTest {


Add docs to explain what this class if for.

Also since this is used not just by the source, but also the sink, better to define this in a different file.

tdas · 2018-01-09T22:46:36Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSuite.scala

+  }
+}
+
+class KafkaContinuousSourceSuite extends KafkaSourceSuiteBase with KafkaContinuousTest {


The { } may not be needed.

tdas · 2018-01-09T23:00:54Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala

  }

  protected def makeSureGetOffsetCalled = AssertOnQuery { q =>
    // Because KafkaSource's initialPartitionOffsets is set lazily, we need to make sure
-    // its "getOffset" is called before pushing any data. Otherwise, because of the race contion,
+    // its "getOffset" is called before pushing any data. Otherwise, because of the race contOOion,


spelling mistake?

I remember wondering this morning why my command-O key sequence wasn't working... I guess this is where it went.

tdas · 2018-01-09T23:08:36Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala

+
+  import testImplicits._
+
+  test("(de)serialization of initial offsets") {


Is this needed in the common KafkaSourceSuiteBase?

tdas · 2018-01-09T23:09:03Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala

-      )
-    )
+      StartStream(),
+      StopStream)
  }

  test("cannot stop Kafka stream") {


is this needed in the KafkaSourcesuiteBase?

I think it makes sense to have a common test verifying the basic "start a stream and then stop it" flow, to provide a clear failure in case it's just completely broken by some change.

tdas · 2018-01-09T23:19:08Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala

-      .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
-      .as[(String, String)]
-    val query = kafka.map(kv => kv._2.toInt).writeStream.foreach(new ForeachWriter[Int] {
+  protected def startStream(ds: Dataset[Int]) = {


i think this factoring is not needed. startStream() is not used anywhere else other than in this test. So i dont see a point of refactoring it to define it outside the test.

startStream is overridden in the continuous version of this test.

tdas · 2018-01-09T23:21:29Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSuite.scala

@@ -0,0 +1,135 @@
+/*


Rename this file to KafkaContinuousSourceSuite

tdas · 2018-01-10T00:09:29Z

...l/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSinkSuite.scala

+    }
+  }
+
+  test("streaming - write data with bad schema") {


missing tests for ."w/o topic field, with topic option" and "topic field and topic option".
and also test for the case when topic field is null.

SparkQA · 2018-01-10T02:44:40Z

Test build #85888 has finished for PR 20096 at commit 9158af2.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jose-torres · 2018-01-10T05:09:30Z

retest this please

SparkQA · 2018-01-10T05:37:56Z

Test build #85892 has finished for PR 20096 at commit f94b53e.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-10T08:05:02Z

Test build #85904 has finished for PR 20096 at commit f94b53e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-01-10T10:40:38Z

...re/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochCoordinator.scala

@@ -39,6 +39,15 @@ private[continuous] sealed trait EpochCoordinatorMessage extends Serializable
 */
 private[sql] case object IncrementAndGetEpoch extends EpochCoordinatorMessage

+/**


@zsxwing Can you take a look at these changes in this file.

looks good to me

tdas

Looks good as long as tests pass

jose-torres · 2018-01-10T18:00:31Z

retest this please

SparkQA · 2018-01-10T21:21:14Z

Test build #85926 has finished for PR 20096 at commit f94b53e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-01-10T23:21:30Z

retest this please

SparkQA · 2018-01-11T02:37:55Z

Test build #85934 has finished for PR 20096 at commit f94b53e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? Add kafka source and sink for continuous processing. This involves two small changes to the execution engine: * Bring data reader close() into the normal data reader thread to avoid thread safety issues. * Fix up the semantics of the RECONFIGURING StreamExecution state. State updates are now atomic, and we don't have to deal with swallowing an exception. ## How was this patch tested? new unit tests Author: Jose Torres <[email protected]> Closes #20096 from jose-torres/continuous-kafka. (cherry picked from commit 6f7aaed) Signed-off-by: Tathagata Das <[email protected]>

sameeragarwal · 2018-01-12T22:58:59Z

@jose-torres @zsxwing @tdas as discussed, this is causing a number of build timeouts. I'm going to revert this for now to de-flake the builds and we can add it back once it's fixed.

jose-torres added 8 commits December 28, 2017 10:49

basic kafka

6308690

move reader close to data reader thread in case reader isn't thread safe

95ff103

test + small fixes

599d001

fixes lost in cherrypick

88b261d

fix sink test to use sink

1ff378b

add SharedSQLContext to avoid multiple context error in jenkins

cd778ce

await termination so SparkContext doesn't leak in jenkins

de98a8b

fix after rebase

db2dc93

jose-torres force-pushed the continuous-kafka branch from bcaa694 to db2dc93 Compare December 28, 2017 19:40

fix test framework race condition

df194c6

fix failure semantics in continuous kafka writer

dae3a09

tdas suggested changes Jan 4, 2018

View reviewed changes

jose-torres added 10 commits January 4, 2018 17:19

dedup fetchAndVerify

2574818

document read task and remove unused poll timeout

eac756b

rename from start to startOffset

9e95f63

document data reader

973fc7d

remove redundant resolution

9998d91

consolidate vals and remove unused flag

a3adf1d

explicit close

71f236b

put back error check in write

9530604

name constructor param targetTopic

4dca800

refactor class names to start with kafka

fec5a00

jose-torres added 6 commits January 9, 2018 10:38

cleanup sink properly

3bdc5e7

cleanup right map

2f1cc76

synchronously stop epoch coordinator

eafe670

fix semantics

f825155

add comment

9101ea6

move ser/deser test to microbatch

a3aaf27

rm whitespace

9158af2

tdas suggested changes Jan 10, 2018

View reviewed changes

jose-torres added 4 commits January 9, 2018 17:02

add sink tests

f434c09

improve docs

cd1bf24

move files

514021c

add header

f94b53e

tdas reviewed Jan 10, 2018

View reviewed changes

asfgit closed this in 6f7aaed Jan 11, 2018


		import testImplicits._

		test("(de)serialization of initial offsets") {

[SPARK-22908] Add kafka source and sink for continuous processing. #20096

[SPARK-22908] Add kafka source and sink for continuous processing. #20096

Conversation

jose-torres commented Dec 27, 2017

What changes were proposed in this pull request?

How was this patch tested?

jose-torres commented Dec 27, 2017 • edited Loading

zsxwing commented Dec 27, 2017

SparkQA commented Dec 27, 2017

SparkQA commented Dec 28, 2017

SparkQA commented Dec 28, 2017

SparkQA commented Dec 28, 2017

SparkQA commented Dec 30, 2017

tdas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 5, 2018

SparkQA commented Jan 9, 2018

SparkQA commented Jan 9, 2018

SparkQA commented Jan 10, 2018

tdas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 10, 2018

jose-torres commented Jan 10, 2018

SparkQA commented Jan 10, 2018

SparkQA commented Jan 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdas left a comment

Choose a reason for hiding this comment

jose-torres commented Jan 10, 2018

SparkQA commented Jan 10, 2018

tdas commented Jan 10, 2018

SparkQA commented Jan 11, 2018

sameeragarwal commented Jan 12, 2018

jose-torres commented Dec 27, 2017 •

edited

Loading