[SPARK-24250][SQL] support accessing SQLConf inside tasks #21299

cloud-fan · 2018-05-11T13:16:30Z

What changes were proposed in this pull request?

Previously in #20136 we decided to forbid tasks to access SQLConf, because it doesn't work and always give you the default conf value. In #21190 we fixed the check and all the places that violate it.

Currently the pattern of accessing configs at the executor side is: read the configs at the driver side, then access the variables holding the config values in the RDD closure, so that they will be serialized to the executor side. Something like

val someConf = conf.getXXX
child.execute().mapPartitions {
  if (someConf == ...) ...
  ...
}

However, this pattern is hard to apply if the config needs to be propagated via a long call stack. An example is DataType.sameType, and see how many changes were made in #21190 .

When it comes to code generation, it's even worse. I tried it locally and we need to change a ton of files to propagate configs to code generators.

This PR proposes to allow tasks to access SQLConf. The idea is, we can save all the SQL configs to job properties when an SQL execution is triggered. At executor side we rebuild the SQLConf from job properties.

How was this patch tested?

a new test suite

…sed only on the driver" This reverts commit a4206d5.

cloud-fan · 2018-05-11T13:18:10Z

cc @juliuszsompolski @gatorsmile @hvanhovell @viirya @dongjoon-hyun @rednaxelafx

cloud-fan · 2018-05-11T13:20:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala

-    }.getOrElse(CreateJacksonParser.internalRow(_: JsonFactory, _: InternalRow))
-
-    JsonInferSchema.infer(rdd, parsedOptions, rowParser)
+    JsonInferSchema.infer(sampled, parsedOptions, CreateJacksonParser.string)


@HyukjinKwon @MaxGekk can you take a look at the json changes? Thanks!

String Reader of JacksonParser can be slower than specialized UTF8StreamJsonParser if it can be used when you pass an array of bytes (even if it spends some time for encoding detection). You can check that by JsonBenchmarks.

I reran the benchmarks on your branch. There is a difference in the schema inferring benchmarks

JSON schema inferring: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ No encoding 46348 / 46679 2.2 463.5 1.0X UTF-8 is set 45651 / 45731 2.2 456.5 1.0X

before your changes:

JSON schema inferring: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative -------------------------------------------------------------------------------------------- No encoding 38902 / 39282 2.6 389.0 1.0X UTF-8 is set 56959 / 57261 1.8 569.6 0.7X

As I wrote above "array-based" parser is faster than "string-based" but "string-based" parser is faster than "reader based" one.

hvanhovell · 2018-05-11T13:47:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/ReadOnlySQLConf.scala

+import org.apache.spark.{TaskContext, TaskContextImpl}
+import org.apache.spark.internal.config.{ConfigEntry, ConfigProvider, ConfigReader}
+
+class ReadOnlySQLConf(context: TaskContext) extends SQLConf {


This is nice :)

kiszk · 2018-05-11T15:34:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/ReadOnlySQLConf.scala

+  override def clear(): Unit = {
+    throw new UnsupportedOperationException("Cannot mutate ReadOnlySQLConf.")
+  }
+}


Do we need to allow us to do clone? clone will create mutable SQLConf.

I don't think we need to clone or copy SQLConf in tasks, let's ban it.

SparkQA · 2018-05-11T16:51:14Z

Test build #90511 has finished for PR 21299 at commit 10068b9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ReadOnlySQLConf(context: TaskContext) extends SQLConf
class TaskContextConfigProvider(context: TaskContext) extends ConfigProvider

SparkQA · 2018-05-11T16:54:23Z

Test build #90512 has finished for PR 21299 at commit 7c7caf8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ReadOnlySQLConf(context: TaskContext) extends SQLConf
class TaskContextConfigProvider(context: TaskContext) extends ConfigProvider

SparkQA · 2018-05-11T16:54:50Z

Test build #90508 has finished for PR 21299 at commit e2e4a52.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ReadOnlySQLConf(context: TaskContext) extends SQLConf
class TaskContextConfigProvider(context: TaskContext) extends ConfigProvider

SparkQA · 2018-05-11T19:52:56Z

Test build #90519 has finished for PR 21299 at commit dc63dbe.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-11T21:45:45Z

Test build #90520 has finished for PR 21299 at commit 2ecabe4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-05-12T15:21:00Z

Retest this please.

dongjoon-hyun · 2018-05-12T15:29:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala

+      val allConfigs = sparkSession.sessionState.conf.getAllConfs
+      allConfigs.foreach {
+        // Excludes external configs defined by users.
+        case (key, value) if key.startsWith("spark") => sc.setLocalProperty(key, value)


This causes scala.MatchError. We need to cover the other case, too.

SparkQA · 2018-05-12T18:07:50Z

Test build #90542 has finished for PR 21299 at commit 2ecabe4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-05-13T01:46:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/CreateJacksonParser.scala

-    val sd = getStreamDecoder(enc, binary, binary.length)
-
-    jsonFactory.createParser(sd)
-  }


Why these two removed? Looks like no SQLConf involved here?

viirya · 2018-05-13T01:57:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala

+      val allConfigs = sparkSession.sessionState.conf.getAllConfs
+      allConfigs.foreach {
+        // Excludes external configs defined by users.
+        case (key, value) if key.startsWith("spark") => sc.setLocalProperty(key, value)


Only propagate config values that have been set other than default value?

Oh, I see. getAllConfs only returns set configs.

viirya · 2018-05-13T02:04:34Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

@@ -898,7 +898,6 @@ object SparkSession extends Logging {
     * @since 2.0.0
     */
    def getOrCreate(): SparkSession = synchronized {
-      assertOnDriver()


Is this meaning we can create SparkSession on driver?

SparkQA · 2018-05-13T07:05:01Z

Test build #90549 has finished for PR 21299 at commit bf8b42d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-05-13T07:47:46Z

retest this please

SparkQA · 2018-05-13T11:34:25Z

Test build #90550 has finished for PR 21299 at commit bf8b42d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987

The idea looks good to me, it should be very useful to have this feature.

jiangxb1987 · 2018-05-14T04:22:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala

      body
    } finally {
-      sc.setLocalProperty(SQLExecution.EXECUTION_ID_KEY, oldExecutionId)
+      allConfigs.foreach {
+        case (key, _) => sparkSession.sparkContext.setLocalProperty(key, null)


Shouldn't this be set back to the original value?

good point, although it's very unlikely that users set some sql configs to local property. let me change it.

SparkQA · 2018-05-14T07:05:01Z

Test build #90568 has finished for PR 21299 at commit a100dea.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-05-14T08:18:39Z

Retest this please.

SparkQA · 2018-05-14T12:09:52Z

Test build #90570 has finished for PR 21299 at commit a100dea.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-05-14T12:57:51Z

retest this please

SparkQA · 2018-05-14T17:22:41Z

Test build #90587 has finished for PR 21299 at commit a100dea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-14T17:55:12Z

Test build #90590 has finished for PR 21299 at commit 01e288a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-05-16T13:54:39Z

also cc @squito

…sed only on the driver" This reverts commit a4206d5. This is from apache#21299 and to ease the review of it. Author: Wenchen Fan <[email protected]> Closes apache#21341 from cloud-fan/revert.

squito · 2018-05-16T14:38:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala

+
+  def withSQLConfPropagated[T](sparkSession: SparkSession)(body: => T): T = {
+    // Set all the specified SQL configs to local properties, so that they can be available at
+    // the executor side.


properties are serialized per task. how unusual would it be for there to be a large list of properties? if that would be reasonable, then it might make more sense to use a Broadcast.

(separately, task serialization should probably avoid re-serializing the properties every time, but this could make that existing issue much worse,)

Technically broadcast is faster than local properties if there are a lot of properties, but one problem is you need to carry the broadcast handler everywhere, which I don't think is applicable to SQLConf.get.

BTW we currently have hundreds of SQL configs, even a user set all of them for a job, the overhead is low. I tried

sc.makeRDD(Seq(1,2,3)).collect 1.to(100).foreach(i => sc.setLocalProperty(i.toString * 10, i.toString * 10)) sc.makeRDD(Seq(1,2,3)).collect

and didn't observe performance difference.

we can avoid serializing local properties for each task, but that's a general optimization to local properties, we can do that in another PR.

SparkQA · 2018-05-16T18:31:38Z

Test build #90681 has finished for PR 21299 at commit 73427fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ContinuousMemoryStream[A : Encoder](id: Int, sqlContext: SQLContext, numPartitions: Int = 2)

SparkQA · 2018-05-16T19:12:47Z

Test build #90683 has finished for PR 21299 at commit cf75856.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ContinuousMemoryStream[A : Encoder](id: Int, sqlContext: SQLContext, numPartitions: Int = 2)

gengliangwang · 2018-05-18T08:57:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala

+    }
+  }
+
+  def withSQLConfPropagated[T](sparkSession: SparkSession)(body: => T): T = {


Maybe cleaner in following way:

def withSQLConfPropagated[T](sparkSession: SparkSession)(body: => T): T = { val sc = sparkSession.sparkContext // Set all the specified SQL configs to local properties, so that they can be available at // the executor side. val allConfigs = sparkSession.sessionState.conf.getAllConfs val originalLocalProps = allConfigs.collect { case (key, value) if key.startsWith("spark") => val originalValue = sc.getLocalProperty(key) sc.setLocalProperty(key, value) (key, originalValue) } try { body } finally { originalLocalProps.foreach { case (key, value) => sc.setLocalProperty(key, value) } } }

gengliangwang

LGTM

SparkQA · 2018-05-18T13:52:09Z

Test build #90786 has finished for PR 21299 at commit cfb76f5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-05-19T10:51:24Z

thanks, merging to master!

dongjoon-hyun · 2018-05-19T18:41:35Z

Hi, @cloud-fan.
This seems to cause a build failure. Could you take a look?

https://amplab.cs.berkeley.edu/jenkins/job/spark-master-compile-maven-hadoop-2.6/7533/console

cc @gatorsmile

dongjoon-hyun · 2018-05-19T20:52:17Z

@viirya and @mgaido91 are working on the build failure, but there seem to be test failures, too.
Maybe, we had better revert this first to unblock all the PRs.

cloud-fan · 2018-05-20T08:14:38Z

I've reverted it, will re-submit it soon.

Revert "[SPARK-22938][SQL][FOLLOWUP] Assert that SQLConf.get is acces…

e39b7d0

…sed only on the driver" This reverts commit a4206d5.

cloud-fan commented May 11, 2018

View reviewed changes

hvanhovell reviewed May 11, 2018

View reviewed changes

cloud-fan force-pushed the config branch from e2e4a52 to 10068b9 Compare May 11, 2018 13:50

support accessing SQLConf inside tasks

7c7caf8

cloud-fan force-pushed the config branch from 10068b9 to 7c7caf8 Compare May 11, 2018 13:56

kiszk reviewed May 11, 2018

View reviewed changes

address comments

dc63dbe

fix json

2ecabe4

dongjoon-hyun reviewed May 12, 2018

View reviewed changes

viirya reviewed May 13, 2018

View reviewed changes

fix

bf8b42d

jiangxb1987 reviewed May 14, 2018

View reviewed changes

address comments

a100dea

fix

01e288a

cloud-fan mentioned this pull request May 16, 2018

Revert "[SPARK-22938][SQL][FOLLOWUP] Assert that SQLConf.get is acces… #21341

Closed

cloud-fan force-pushed the config branch from f6d16c3 to 73427fb Compare May 16, 2018 14:09

squito reviewed May 16, 2018

View reviewed changes

cloud-fan force-pushed the config branch from 73427fb to cf75856 Compare May 16, 2018 14:47

Merge branch 'master' into config

cf75856

kiszk mentioned this pull request May 18, 2018

[SPARK-22219][SQL] Refactor code to get a value for "spark.sql.codegen.comments" #19449

Closed

gengliangwang reviewed May 18, 2018

View reviewed changes

gengliangwang approved these changes May 18, 2018

View reviewed changes

address comment

cfb76f5

rednaxelafx mentioned this pull request May 18, 2018

[SPARK-23711][SQL] Add fallback generator for UnsafeProjection #21106

Closed

asfgit closed this in dd37529 May 19, 2018

dongjoon-hyun mentioned this pull request May 19, 2018

[SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4 #21372

Closed

dongjoon-hyun mentioned this pull request May 20, 2018

[SPARK-24323][SQL] Fix lint-java errors #21374

Closed

cloud-fan mentioned this pull request May 20, 2018

[SPARK-24250][SQL] support accessing SQLConf inside tasks #21376

Closed

[SPARK-24250][SQL] support accessing SQLConf inside tasks #21299

[SPARK-24250][SQL] support accessing SQLConf inside tasks #21299

Conversation

cloud-fan commented May 11, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented May 11, 2018 • edited Loading

Choose a reason for hiding this comment

MaxGekk May 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 11, 2018

SparkQA commented May 11, 2018

SparkQA commented May 11, 2018

SparkQA commented May 11, 2018

SparkQA commented May 11, 2018

dongjoon-hyun commented May 12, 2018

Choose a reason for hiding this comment

SparkQA commented May 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 13, 2018

mgaido91 commented May 13, 2018

SparkQA commented May 13, 2018

jiangxb1987 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 14, 2018

dongjoon-hyun commented May 14, 2018

SparkQA commented May 14, 2018

cloud-fan commented May 14, 2018

SparkQA commented May 14, 2018

SparkQA commented May 14, 2018

cloud-fan commented May 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 16, 2018

SparkQA commented May 16, 2018

gengliangwang May 18, 2018 • edited Loading

Choose a reason for hiding this comment

gengliangwang left a comment

Choose a reason for hiding this comment

SparkQA commented May 18, 2018

cloud-fan commented May 19, 2018

dongjoon-hyun commented May 19, 2018 • edited Loading

dongjoon-hyun commented May 19, 2018

cloud-fan commented May 20, 2018

cloud-fan commented May 11, 2018 •

edited

Loading

cloud-fan commented May 11, 2018 •

edited

Loading

MaxGekk May 11, 2018 •

edited

Loading

gengliangwang May 18, 2018 •

edited

Loading

dongjoon-hyun commented May 19, 2018 •

edited

Loading