Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18817] [SPARKR] [SQL] Set default warehouse dir to tempdir #16290

Closed
wants to merge 9 commits into from
7 changes: 7 additions & 0 deletions R/pkg/R/sparkR.R
Original file line number Diff line number Diff line change
Expand Up @@ -363,6 +363,13 @@ sparkR.session <- function(
...) {

sparkConfigMap <- convertNamedListToEnv(sparkConfig)

# NOTE(shivaram): Set default warehouse dir to tmpdir to meet CRAN requirements
# See SPARK-18817 for more details
if (!exists("spark.sql.default.warehouse.dir", envir = sparkConfigMap)) {
assign("spark.sql.default.warehouse.dir", tempdir(), envir = sparkConfigMap)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could just sparkConfigMap[["spark.sql.warehouse.default.dir"]] <- tempdir()

Copy link
Member

@felixcheung felixcheung Dec 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move this after L383 "overrideEnvs(sparkConfigMap, paramMap)" in case spark.sql.warehouse.default.dir is passed in named param and explicitly set to something other then the tmp dir

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this below L383 - We still need the exists check to make sure we don't overwrite any user provided value ?

}

namedParams <- list(...)
if (length(namedParams) > 0) {
paramMap <- convertNamedListToEnv(namedParams)
Expand Down
8 changes: 8 additions & 0 deletions R/pkg/inst/tests/testthat/test_sparkSQL.R
Original file line number Diff line number Diff line change
Expand Up @@ -2165,6 +2165,14 @@ test_that("SQL error message is returned from JVM", {
expect_equal(grepl("blah", retError), TRUE)
})

test_that("Default warehouse dir should be set to tempdir", {
# nothing should be written outside tempdir() without explicit user permission
inital_working_directory_files <- list.files()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if warehouse dir ("spark-warehouse") is already there before running this test then the list of file won't change?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Jenkins start with a new workspace every time it runs a test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm referring to other tests in this test file, test_sparkSQL, that is calling to API that might already initialize the warehouse dir.

sparkR.session() is called at the top. Does this createOrReplaceTempView cause the warehouse dir to be created?

https://github.com/shivaram/spark-1/blob/25834109588e8e545deafb1da162958766a057e2/R/pkg/inst/tests/testthat/test_sparkSQL.R#L570

Copy link
Member

@felixcheung felixcheung Dec 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my test, the spark-warehouse directory is created when I run a <- createDataFrame(iris)

so I think by the time this test is run this directory would already be there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I think @felixcheung point is right - The dir should be created early on. Also I think in the tests we sometimes configure the hive.metastore.dir in our tests and so I dont see it come up when I run the test cases. I'll try to see if we can design a different test case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this test to recursively list all the files and check if spark-warehouse is in it. Another thing is that we could check if the specific table is in it

result <- sql("CREATE TABLE warehouse")
expect_equal(inital_working_directory_files, list.files())
result <- sql("DROP TABLE warehouse")
})

irisDF <- suppressWarnings(createDataFrame(iris))

test_that("Method as.data.frame as a synonym for collect()", {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -819,7 +819,7 @@ private[sql] class SQLConf extends Serializable with CatalystConf with Logging {

def variableSubstituteDepth: Int = getConf(VARIABLE_SUBSTITUTE_DEPTH)

def warehousePath: String = new Path(getConf(StaticSQLConf.WAREHOUSE_PATH)).toString
def warehousePath: String = new Path(getConf(StaticSQLConf.DEFAULT_WAREHOUSE_PATH)).toString

def ignoreCorruptFiles: Boolean = getConf(IGNORE_CORRUPT_FILES)

Expand Down Expand Up @@ -964,11 +964,17 @@ object StaticSQLConf {
}
}

val WAREHOUSE_PATH = buildConf("spark.sql.warehouse.dir")
.doc("The default location for managed databases and tables.")
val DEFAULT_WAREHOUSE_PATH = buildConf("spark.sql.default.warehouse.dir")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make it internal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with this part of the code base - What are the consequences of making it internal ? Is it just in terms of what shows up in documentation or does it affect how users can use it ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the internal configuration, it will not be printed out. For example, you can try something like

spark.sql("SET -v").show(numRows = 200, truncate = false)

.doc("The default location for managed databases and tables. " +
"Used if spark.sql.warehouse.dir is not set")
.stringConf
.createWithDefault(Utils.resolveURI("spark-warehouse").toString)

val WAREHOUSE_PATH = buildConf("spark.sql.warehouse.dir")
.doc("The location for managed databases and tables.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description is not right. spark.sql.warehouse.dir is still the default location when we create a database/table without providing the location value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a good point. I misunderstood the meaning of default there. Will fix this now

.stringConf
.createOptional

val CATALOG_IMPLEMENTATION = buildConf("spark.sql.catalogImplementation")
.internal()
.stringConf
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -55,14 +55,19 @@ private[sql] class SharedState(val sparkContext: SparkContext) extends Logging {
s"is set. Setting ${WAREHOUSE_PATH.key} to the value of " +
s"hive.metastore.warehouse.dir ('$hiveWarehouseDir').")
hiveWarehouseDir
} else {
} else if (sparkContext.conf.contains(WAREHOUSE_PATH.key) &&
sparkContext.conf.get(WAREHOUSE_PATH).isDefined) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: indent is not right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indented 4 spaces now

// If spark.sql.warehouse.dir is set, we will override hive.metastore.warehouse.dir using
// the value of spark.sql.warehouse.dir.
// When neither spark.sql.warehouse.dir nor hive.metastore.warehouse.dir is set,
// we will set hive.metastore.warehouse.dir to the default value of spark.sql.warehouse.dir.
val sparkWarehouseDir = sparkContext.conf.get(WAREHOUSE_PATH)
val sparkWarehouseDir = sparkContext.conf.get(WAREHOUSE_PATH).get
sparkContext.conf.set("hive.metastore.warehouse.dir", sparkWarehouseDir)
sparkWarehouseDir
} else {
// When neither spark.sql.warehouse.dir nor hive.metastore.warehouse.dir is set,
// we will set hive.metastore.warehouse.dir to the value of spark.sql.default.warehouse.dir.
val sparkDefaultWarehouseDir = sparkContext.conf.get(DEFAULT_WAREHOUSE_PATH)
sparkContext.conf.set("hive.metastore.warehouse.dir", sparkDefaultWarehouseDir)
sparkDefaultWarehouseDir
}

}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -221,6 +221,19 @@ class SQLConfSuite extends QueryTest with SharedSQLContext {
.sessionState.conf.warehousePath.stripSuffix("/"))
}

test("changing default value of warehouse path") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this test case only cover one of four cases. spark.sql.default.warehouse.dir is set and spark.sql.warehouse.dir is not set. We also need to check the other three cases:

  • spark.sql.default.warehouse.dir is not set and spark.sql.warehouse.dir is not set
  • spark.sql.default.warehouse.dir is set and spark.sql.warehouse.dir is set
  • spark.sql.default.warehouse.dir is not set and spark.sql.warehouse.dir is set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Added tests for all 4 cases now

try {
val newWarehouseDefault = "spark-warehouse2"
val newWarehouseDefaultPath = new Path(Utils.resolveURI(newWarehouseDefault)).toString
sparkContext.conf.set("spark.sql.default.warehouse.dir", newWarehouseDefaultPath)
val spark = new SparkSession(sparkContext)
assert(newWarehouseDefaultPath.stripSuffix("/") === spark
.sessionState.conf.warehousePath.stripSuffix("/"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need a check for spark.sharedState.warehousePath because we did the logic changes there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

} finally {
sparkContext.conf.remove("spark.sql.default.warehouse.dir")
}
}

test("MAX_CASES_BRANCHES") {
withTable("tab1") {
spark.range(10).write.saveAsTable("tab1")
Expand Down