Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22387][SQL] Propagate session configs to data source read/write options #19861

Closed
wants to merge 14 commits into from

Conversation

jiangxb1987
Copy link
Contributor

@jiangxb1987 jiangxb1987 commented Dec 1, 2017

What changes were proposed in this pull request?

Introduce a new interface SessionConfigSupport for DataSourceV2, it can help to propagate session configs with the specified key-prefix to all data source operations in this session.

How was this patch tested?

Add new test suite DataSourceV2UtilsSuite.

@SparkQA
Copy link

SparkQA commented Dec 1, 2017

Test build #84377 has finished for PR 19861 at commit eaa6cae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

cc @cloud-fan

* @return an immutable map that contains all the session configs that should be propagated to
* the data source.
*/
def withSessionConfig(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These helper functions need to be moved to org.apache.spark.sql.execution.datasources.v2 package. This will be called by SQL API code path.

Another more straightforward option is to provide it by ConfigSupport. WDYT? cc @cloud-fan

val options = dataSource match {
case cs: ConfigSupport =>
val confs = withSessionConfig(cs, sparkSession.sessionState.conf)
new DataSourceV2Options((confs ++ extraOptions).asJava)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened if they have duplicate names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Should the confs in the extraOptions have a higher priority? WDYT @cloud-fan ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea extraOptions needs higher priority.

* Create a list of key-prefixes, all session configs that match at least one of the prefixes
* will be propagated to the data source options.
*/
List<String> getConfigPrefixes();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to think about the current use cases and validate this API. E.g. CSV data source and JSON data source both accept an option columnNameOfCorruptRecord, or session config spark.sql.columnNameOfCorruptRecord. We get the following information:

  1. mostly session config maps to an existing option.
  2. session configs are always prefixed with spark.sql, we should not ask the data source to always specify it.
  3. do we really need to support more than one prefixes?

@SparkQA
Copy link

SparkQA commented Dec 4, 2017

Test build #84433 has finished for PR 19861 at commit ec5723c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 6, 2017

Test build #84563 has finished for PR 19861 at commit 0dd7f2e.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 6, 2017

Test build #84565 has finished for PR 19861 at commit 8329a6b.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 7, 2017

Test build #84603 has finished for PR 19861 at commit 6b4fcab.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 7, 2017

Test build #84607 has finished for PR 19861 at commit 6b4fcab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

source: String,
conf: SQLConf): immutable.Map[String, String] = {
val prefixes = cs.getConfigPrefixes
require(prefixes != null, "The config key-prefixes cann't be null.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double n

require(prefixes != null, "The config key-prefixes cann't be null.")
val mapping = cs.getConfigMapping.asScala
val validOptions = cs.getValidOptions
require(validOptions != null, "The valid options list cann't be null.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double n

@SparkQA
Copy link

SparkQA commented Dec 11, 2017

Test build #84712 has finished for PR 19861 at commit ec9a717.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 12, 2017

Test build #84745 has finished for PR 19861 at commit ec9a717.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 12, 2017

Test build #84776 has finished for PR 19861 at commit ec9a717.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* propagate session configs with chosen key-prefixes to the particular data source.
*/
@InterfaceStability.Evolving
public interface ConfigSupport {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: SessionConfigSupport


/**
* A mix-in interface for {@link DataSourceV2}. Data sources can implement this interface to
* propagate session configs with chosen key-prefixes to the particular data source.
Copy link
Contributor

@cloud-fan cloud-fan Dec 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

propagate session configs with the specified key-prefix to all data source operations in this session

* `spark.datasource.$name`, turn `spark.datasource.$name.xxx -&gt; yyy` into
* `xxx -&gt; yyy`, and propagate them to all data source operations in this session.
*/
String name();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about keyPrefix

@SparkQA
Copy link

SparkQA commented Dec 15, 2017

Test build #84941 has finished for PR 19861 at commit f7d5a4d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DataSourceV2UtilsSuite extends SparkFunSuite

val keyPrefix = cs.keyPrefix()
require(keyPrefix != null, "The data source config key prefix can't be null.")

val pattern = Pattern.compile(s"^spark\\.datasource\\.$keyPrefix\\.(.*)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: (.*) -> (.+). Just to forbid some corner case like spark.datasource.$keyPrefix.

@SparkQA
Copy link

SparkQA commented Dec 15, 2017

Test build #84952 has finished for PR 19861 at commit 5292329.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Dec 15, 2017

Test build #84961 has finished for PR 19861 at commit 5292329.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@cloud-fan
Copy link
Contributor

SparkR test is pretty flaky recently...

@SparkQA
Copy link

SparkQA commented Dec 15, 2017

Test build #84965 has finished for PR 19861 at commit 5292329.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 16, 2017

Test build #84988 has finished for PR 19861 at commit 5292329.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Dec 16, 2017

Test build #84995 has finished for PR 19861 at commit 5292329.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Dec 18, 2017

Test build #85049 has finished for PR 19861 at commit 5292329.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Dec 18, 2017

Test build #85060 has finished for PR 19861 at commit 5292329.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 9c289a5 Dec 21, 2017
@jiangxb1987 jiangxb1987 deleted the datasource-configs branch December 21, 2017 02:11
@rdblue
Copy link
Contributor

rdblue commented Jan 23, 2018

@jiangxb1987, @cloud-fan, what was the use case you needed to add this for?

@jiangxb1987
Copy link
Contributor Author

With SessionConfigSupport, you can use the datasource session configs more easily for that:

  1. All configs with the name prefix spark.datasource.$keyPrefix will be imported in to the data source options, so you can directly access them from data source options;
  2. You can use the striped name for the options, this may save some effort, e.g. type foo.bar instead of spark.datasource.$keyPrefix.foo.bar.

@rdblue
Copy link
Contributor

rdblue commented Jan 23, 2018

@jiangxb1987, I understand what this does. I just wanted an example use case where it was necessary. What was the motivating use case?

@cloud-fan
Copy link
Contributor

@rdblue This is also what we already have for built-in data sources, e.g. spark.sql.parquet.compression.codec, spark.sql.parquet.filterPushdown, etc.

@cloud-fan
Copy link
Contributor

Basically we want per-query options, which can be specified via DataFrameReader/Writer.option, and we also want per-session options, which can save users from typing same options again and again inside one session.

@rdblue
Copy link
Contributor

rdblue commented Jan 23, 2018

Thanks for the example, @cloud-fan.

jzhuge pushed a commit to jzhuge/spark that referenced this pull request Aug 20, 2018
…e options

Introduce a new interface `SessionConfigSupport` for `DataSourceV2`, it can help to propagate session configs with the specified key-prefix to all data source operations in this session.

Add new test suite `DataSourceV2UtilsSuite`.

Author: Xingbo Jiang <[email protected]>

Closes apache#19861 from jiangxb1987/datasource-configs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants