[SPARK-18120 ][SQL] Call QueryExecutionListener callback methods for … #16664

salilsurendran · 2017-01-20T20:34:53Z

…DataFrameWriter methods

What changes were proposed in this pull request?

QueryExecutionListener has two methods onSuccess() and onFailure() that takes a QueryExecution object as a parameter that gets called when a query is executed. It gets called for several of the DataSet methods like take, head, first, collect etc. but doesn't get called for any of the DataFrameWriter methods like saveAsTable, save etc. This commit fixes this issue and makes calls to these two methods from DataFrameWriter output methods.
Also, added a new property "spark.sql.queryExecutionListeners" that can be used to specify instances of QueryExecutionListeners that should be attached to the SparkSession when the spark application starts up.

How was this patch tested?

Testing was done using unit tests contained in two suites. The unit tests can be executed by :
test-only *SparkSQLQueryExecutionListenerSuite
test-only *DataFrameCallbackSuite

…DataFrameWriter methods QueryExecutionListener has two methods onSuccess() and onFailure() that takes a QueryExecution object as a parameter that gets called when a query is executed. It gets called for several of the DataSet methods like take, head, first, collect etc. but doesn't get called for any of the DataFrameWriter methods like saveAsTable, save etc. This commit fixes this issue and makes calls to these two methods from DataFrameWriter output methods. Also, added a new property "spark.sql.queryExecutionListeners" that can be used to specify instances of QueryExecutionListeners that should be attached to the SparkSession when the spark application starts up. Testing was done using unit tests.

vanzin · 2017-01-20T21:10:46Z

ok to test

SparkQA · 2017-01-20T23:52:58Z

Test build #71741 has finished for PR 16664 at commit 751ded0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class OutputParams(

vanzin · 2017-01-23T21:39:37Z

/cc @yhuai @marmbrus

vanzin · 2017-01-23T21:40:48Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+   *               called.
+   */
+  private def executeAndCallQEListener(
+                                        funcName: String,


Formatting is wrong here.

vanzin · 2017-01-23T21:40:57Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+   *
+   * @param funcName A identifier for the method executing the query
+   * @param qe the @see [[QueryExecution]] object associated with the
+   *        query


Fits in the previous line.

marmbrus · 2017-01-23T22:00:21Z

/cc @liancheng

SparkQA · 2017-01-31T02:38:50Z

Test build #72177 has finished for PR 16664 at commit b0392ed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-31T02:51:37Z

Test build #72178 has finished for PR 16664 at commit 752125a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

salilsurendran · 2017-01-31T16:47:26Z

@yhuai @marmbrus @liancheng Can someone review my PR please. Thanks.

vanzin · 2017-02-01T17:45:28Z

@yhuai @marmbrus @liancheng if none of you are going to take look, I'll give the code another pass and not wait for your feedback before pushing.

marmbrus · 2017-02-01T20:12:02Z

I think @sameeragarwal plans to review. I glanced and it looks fine.

vanzin

Only minor things.

vanzin · 2017-02-02T00:41:34Z

sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala

+ *                specified by using the @see [[org.apache.spark.sql.DataFrameWriter#option]] method
+ * @param writeParams will contain any extra information that the write method wants to provide
+ */
+case class OutputParams(


Add @DeveloperApi.

vanzin · 2017-02-02T00:43:43Z

sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala

+      Seq(1 -> 100).toDF("x", "y").write.saveAsTable("bar")
+    }
+    assert(onWriteSuccessCalled)
+    spark.listenerManager.clear()


This needs to be in a finally block no?

vanzin · 2017-02-02T00:43:52Z

sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala

+      callSaveFunction(Seq(1 -> 100).toDF("x", "y"), path.getAbsolutePath)
+    }
+    assert(testQueryExecutionListener.onWriteSuccessCalled)
+    spark.listenerManager.clear()


Same here. Feels like it should be in SharedSQLContext.afterEach...

vanzin · 2017-02-02T00:46:17Z

docs/sql-programming-guide.md

@@ -1302,8 +1302,9 @@ Configuration of in-memory caching can be done using the `setConf` method on `Sp

 ## Other Configuration Options

-The following options can also be used to tune the performance of query execution. It is possible
-that these options will be deprecated in future release as more optimizations are performed automatically.
+The following options can also be used to tune the performance of query execution and attaching


I don't think this new option belongs in this section. It has nothing to do with performance and this description now sounds weird. A separate section for it would be better, even if it's the only option there.

gatorsmile · 2017-02-02T06:48:09Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

  val SESSION_LOCAL_TIMEZONE =
    SQLConfigBuilder("spark.sql.session.timeZone")
      .doc("""The ID of session local timezone, e.g. "GMT", "America/Los_Angeles", etc.""")
      .stringConf
      .createWithDefault(TimeZone.getDefault().getID())

+


Nit: Please remove this empty line

gatorsmile · 2017-02-02T06:53:28Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+   * methods.
+   *
+   * @param funcName A identifier for the method executing the query
+   * @param qe the @see [[QueryExecution]] object associated with the query


Could you please fix the doc by following what #16013 did?

gatorsmile · 2017-02-02T06:56:05Z

@marmbrus DataStreamWriter has similar issues, right?

gatorsmile · 2017-02-02T07:02:43Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

@@ -190,6 +192,32 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
  }

  /**
+   * Executes the query and calls the {@link org.apache.spark.sql.util.QueryExecutionListener}
+   * methods.


How about changing it to

Wrap a DataFrameWriter action to track the QueryExecution and time cost, then report to the user-registered callback functions.

gatorsmile · 2017-02-02T07:04:05Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+   * @param action the function that executes the query after which the listener methods gets
+   *               called.
+   */
+  private def executeAndCallQEListener(


How about renaming it withAction? It is more consistent.

I believe you are saying rename the method executeAndCallQEListener to withAction?

gatorsmile · 2017-02-02T07:06:22Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+
+  val QUERY_EXECUTION_LISTENERS =
+    ConfigBuilder("spark.sql.queryExecutionListeners")
+      .doc("QueryExecutionListeners to be attached to the SparkSession")


Can you improve this line? Add what you wrote in the sql-programming-guide.md?

In this case I updated the doc to read "A comma-separated list of classes that implement QueryExecutionListener that will be attached to the SparkSession". I could attach the whole line I put in sql-programming-guide.md but it will make it look out of place compared to the docs for other properties in the same class.

We do not have a separate document for the Spark SQL configuration. We expect users to do it using the command set -v. This command will output the contents of doc.

gatorsmile · 2017-02-02T07:08:16Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

@@ -514,6 +576,9 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
   * shorten names(none, `snappy`, `gzip`, and `lzo`). This will override
   * `spark.sql.parquet.compression.codec`.</li>
   * </ul>
+   * Calls the callback methods in @see[[QueryExecutionListener]] methods after query execution with
+   * @see[[OutputParams]] having datasourceType set as string constant "parquet" and
+   * destination set as the path to which the data is written


I think we do not need to add these comments to all the functions.

gatorsmile · 2017-02-02T07:11:34Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+      df.queryExecution,
+      OutputParams(source, destination, extraOptions.toMap)) {
+      dataSource.write(mode, df)
+    }


Nit: the style issue.

withAction("save", df.queryExecution, OutputParams(source, destination, extraOptions.toMap)) { dataSource.write(mode, df) }

gatorsmile · 2017-02-02T07:13:24Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+      qe,
+      new OutputParams(source, Some(tableIdent.unquotedString), extraOptions.toMap)) {
+        qe.toRdd
+    }


Nit: also the style issue.

val outputParms = OutputParams(source, Some(tableIdent.unquotedString), extraOptions.toMap) withAction("insertInto", qe, outputParms)(qe.toRdd)

gatorsmile · 2017-02-02T07:14:15Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+      "saveAsTable",
+      qe,
+      new OutputParams(source, Some(tableIdent.unquotedString), extraOptions.toMap)) {
+      qe.toRdd


No need to call new here. Please follow the above example. Thanks!

gatorsmile · 2017-02-02T07:24:14Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -660,12 +660,21 @@ object SQLConf {
      .booleanConf
      .createWithDefault(false)

+
+  val QUERY_EXECUTION_LISTENERS =


I think we can put it into StaticSQLConf

gatorsmile · 2017-02-02T07:28:30Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+    executeAndCallQEListener(
+      "saveAsTable",
+      qe,
+      new OutputParams(source, Some(tableIdent.unquotedString), extraOptions.toMap)) {


source? Why not using a qualified table name?

source reflects the Datasource type to which the data is written. So in case of the parquet(), csv() methods it will be "parquet" and "csv". So in case of saveAsTable() should it be "hive" or "db" since qualified table name is not actually a datasource type?

I got your points. source looks ok to me.

gatorsmile · 2017-02-02T07:29:59Z

I just quickly went over the code. It looks ok to me, but I will review it again when the comments are resolved. Thanks!

gatorsmile · 2017-02-03T20:32:44Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+    executeAndCallQEListener(
+      "save",
+      df.queryExecution,
+      OutputParams(source, destination, extraOptions.toMap)) {


When the source is JDBC, you will also pass credentials. Be careful on this.

SparkQA · 2017-02-04T01:08:14Z

Test build #72329 has finished for PR 16664 at commit ecf9f34.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-02-06T06:06:09Z

sql/core/src/test/scala/org/apache/spark/sql/SparkSQLQueryExecutionListenerSuite.scala

+        .config("spark.sql.queryExecutionListeners", classOf[NoZeroArgConstructorListener].getName)
+        .getOrCreate()
+    }
+    assert(!SparkSession.getDefaultSession.isDefined)


assert(SparkSession.getDefaultSession.isEmpty)

gatorsmile · 2017-02-06T06:06:24Z

sql/core/src/test/scala/org/apache/spark/sql/SparkSQLQueryExecutionListenerSuite.scala

+        .config("spark.sql.queryExecutionListeners", "non.existent.QueryExecutionListener")
+        .getOrCreate()
+    }
+    assert(!SparkSession.getDefaultSession.isDefined)


The same here. isEmpty

gatorsmile · 2017-02-06T06:08:47Z

sql/core/src/test/scala/org/apache/spark/sql/SparkSQLQueryExecutionListenerSuite.scala

+      qe: QueryExecution,
+      durationNs: Long,
+      options: Option[OutputParams]
+  ): Unit = {}


Nit: -> options: Option[OutputParams]): Unit = {}

gatorsmile · 2017-02-06T06:29:29Z

Could you update the PR title to [SPARK-18120][SQL] Call QueryExecutionListener callback methods for DataFrameWriter methods

gatorsmile · 2017-02-06T06:42:03Z

Just finished this round of reviews. Thanks!

This PR enables the QueryExecutionListener when users using the DataFrameWriter methods. However, it still misses the other code paths, especially, the DDL statements. For example, CTAS when using the sql() API. cc @sameeragarwal @cloud-fan

cloud-fan · 2017-02-08T07:10:31Z

I think it's ok to enable the listener for DataFrameWriter first, we can think about DDL commands later

cloud-fan · 2017-02-08T11:56:33Z

why do we need the new config spark.sql.queryExecutionListeners? I think it's not hard to register listeners manually, and at least we should do it in a follow-up PR.

cloud-fan · 2017-02-08T11:59:53Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+      qe: QueryExecution,
+      outputParams: OutputParams)(action: => Unit) = {
+    try {
+      val start = System.nanoTime()


Dataset.withAction will reset metrics of physical plans, shall we do it here? And can we create a general function for both Dataset and DataFrameWriter?

cloud-fan · 2017-02-08T12:08:42Z

sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala

+ * @param writeParams will contain any extra information that the write method wants to provide
+ */
+@DeveloperApi
+case class OutputParams(


It looks reasonable to provide more information to the listeners for write operations. However, this will be public, I think we should think about it more carefully to get a better design, can we do it later?

Sorry arguments to this class seem to have been picked pretty randomly. Can you explain more why these parameters are picked?

rxin · 2017-02-10T15:56:56Z

Sorry I'm really confused, probably because I haven't kept track with this pr. But the diff doesn't match the pr description. Are we fixing a bug here or introducing a bunch of new APIs?

Actually we are not only introducing new APIs, we are also breaking old APIs in this patch. Please separate the bug fix part from the API changing part.

rxin · 2017-02-10T16:05:41Z

docs/sql-programming-guide.md

@@ -1300,10 +1300,28 @@ Configuration of in-memory caching can be done using the `setConf` method on `Sp

 </table>

+## QueryExecutionListener Options


this seems like a completely unrelated change to the bug fix.

vanzin · 2017-02-10T18:19:04Z

Actually we are not only introducing new APIs, we are also breaking old APIs in this patch. Please separate the bug fix part from the API changing part.

I actually disagree that this particular change should be a separate PR. Part of exposing these new queries to the listener is providing information of what these queries are doing, and the current (developer) API does not have a way to expose that.

We can discuss ways of maybe exposing this information in a way that doesn't break the existing API (I thought about a couple but I didn't like any of them, so my preference was to just modify the existing developer API). But I strongly feel the bug fix is not complete without this information being exposed in some way.

rxin · 2017-02-10T18:54:17Z

That's probably because you are not familiar with the SQL component. The existing API already has references to the QueryExecution object, which actually includes all of the information your compatibility-breaking API is currently exposing.

salilsurendran · 2017-02-10T18:59:05Z

The QueryExecution object doesn't have details related to the output metadata. Like for eg. if I call df.write.parquet("/my/path"). The path to which the DataFrame is written i.e. "/my/path" is not available in the QueryExecution object.

vanzin · 2017-02-10T18:59:49Z

The existing API already has references to the QueryExecution object, which actually includes all of the information your compatibility-breaking API is currently exposing.

That's fair but not what I was told; if that's the case then great, but I'll let Salil comment since he's looked at this code way more than I have.

rxin · 2017-02-10T19:00:12Z

It does. It contains the entire plan.

cloud-fan · 2017-02-10T19:04:43Z

@rxin , in DataFrameWriter.save we do

    val dataSource = DataSource(
      df.sparkSession,
      className = source,
      partitionColumns = partitioningColumns.getOrElse(Nil),
      bucketSpec = getBucketSpec,
      options = extraOptions.toMap)

    dataSource.write(mode, df)

knowing the entire plan is not enough, it would be better if we also have these write options(provider, partitioning, extraOptions, etc.)

salilsurendran · 2017-02-10T19:05:30Z

When I was working on this PR the output path wasn't there but if you are confident that it is there then it might have been added recently. I can check and get back to you.

rxin · 2017-02-10T19:07:06Z

I think that's a separate "bug" we should fix, i.e. DataFrameWriter should use InsertIntoDataSourceCommand so we can consolidate the two paths.

rxin · 2017-02-10T19:10:48Z

Basically I see no reason to add some specific parameter to a listener API that is meant to be generic which already contains reference to QueryExecution. What are you going to do if next time you want to find some other information with a take or collect query? Do you go in and add another interface-breaking change for that?

If the goal is to expose information for writing data out properly, then just make it work with the existing interface and fix the issue that using DataFrameWriter doesn't call the callback (and doesn't have the correct information set in QueryExecution).

rxin · 2017-02-10T19:12:10Z

Actually @cloud-fan are you sure it is a problem right now? DataSOurce.write itself creates the commands, and if the information are propagated correctly, the QueryExecution object should have a command InsertIntoHadoopFsRelationCommand.

cloud-fan · 2017-02-10T19:17:11Z

DataSource.write returns Unit, so the entire plan will be df.queryExecution, which doesn't contain these write options.

rxin · 2017-02-10T19:36:09Z

Yea we should fix that.

vanzin · 2017-02-10T21:14:14Z

Does that mean the information would show up in the plan? That would be great.

cloud-fan · 2017-02-10T21:16:01Z

@vanzin yes, InsertXXX command will carry all the write options.

salilsurendran · 2017-02-10T21:30:41Z

@cloud-fan From what I understand we need to modify InsertXXX command to carry all the write options instead of the change suggested in this PR. Right now the QueryExecution object doesn't carry any of the output options. Am I correct?

cloud-fan · 2017-02-10T21:35:40Z

@salilsurendran yes, and we can send another PR to fix the InsertXXX command problem

…methods for DataFrameWriter methods We only notify `QueryExecutionListener` for several `Dataset` operations, e.g. collect, take, etc. We should also do the notification for `DataFrameWriter` operations. new regression test close apache#16664 Author: Wenchen Fan <[email protected]> Closes apache#16962 from cloud-fan/insert.

vanzin reviewed Jan 23, 2017

View reviewed changes

salilsurendran and others added 2 commits January 30, 2017 15:58

Merge branch 'master' into SPARK-18120

b0392ed

Fixing code review comments

752125a

vanzin reviewed Feb 2, 2017

View reviewed changes

gatorsmile reviewed Feb 2, 2017

View reviewed changes

gatorsmile reviewed Feb 3, 2017

View reviewed changes

Committing to fix code review issues.

ecf9f34

gatorsmile reviewed Feb 6, 2017

View reviewed changes

cloud-fan reviewed Feb 8, 2017

View reviewed changes

rxin reviewed Feb 10, 2017

View reviewed changes

cloud-fan mentioned this pull request Feb 16, 2017

[SPARK-18120][SPARK-19557][SQL] Call QueryExecutionListener callback methods for DataFrameWriter methods #16962

Closed

asfgit closed this in 54d2359 Feb 17, 2017

		@@ -1300,10 +1300,28 @@ Configuration of in-memory caching can be done using the `setConf` method on `Sp

		</table>

		## QueryExecutionListener Options

[SPARK-18120 ][SQL] Call QueryExecutionListener callback methods for … #16664

[SPARK-18120 ][SQL] Call QueryExecutionListener callback methods for … #16664

Conversation

salilsurendran commented Jan 20, 2017

What changes were proposed in this pull request?

How was this patch tested?

vanzin commented Jan 20, 2017

SparkQA commented Jan 20, 2017

vanzin commented Jan 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marmbrus commented Jan 23, 2017

SparkQA commented Jan 31, 2017

SparkQA commented Jan 31, 2017

salilsurendran commented Jan 31, 2017

vanzin commented Feb 1, 2017

marmbrus commented Feb 1, 2017

vanzin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Feb 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

salilsurendran Feb 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Feb 2, 2017

Choose a reason for hiding this comment

SparkQA commented Feb 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Feb 6, 2017

gatorsmile commented Feb 6, 2017

cloud-fan commented Feb 8, 2017

cloud-fan commented Feb 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxin commented Feb 10, 2017 • edited Loading

Choose a reason for hiding this comment

vanzin commented Feb 10, 2017 • edited Loading

rxin commented Feb 10, 2017

salilsurendran commented Feb 10, 2017

vanzin commented Feb 10, 2017

rxin commented Feb 10, 2017 • edited Loading

cloud-fan commented Feb 10, 2017

salilsurendran commented Feb 10, 2017

rxin commented Feb 10, 2017

rxin commented Feb 10, 2017 • edited Loading

rxin commented Feb 10, 2017

cloud-fan commented Feb 10, 2017

rxin commented Feb 10, 2017

vanzin commented Feb 10, 2017

cloud-fan commented Feb 10, 2017

salilsurendran commented Feb 10, 2017

cloud-fan commented Feb 10, 2017

salilsurendran Feb 3, 2017 •

edited

Loading

rxin commented Feb 10, 2017 •

edited

Loading

vanzin commented Feb 10, 2017 •

edited

Loading

rxin commented Feb 10, 2017 •

edited

Loading

rxin commented Feb 10, 2017 •

edited

Loading