[SPARK-18817][SPARKR][SQL] change derby log output to temp dir #16330

felixcheung · 2016-12-18T22:49:31Z

What changes were proposed in this pull request?

Passes R tempdir() (this is the R session temp dir, shared with other temp files/dirs) to JVM, set System.Property for derby home dir to move derby.log

How was this patch tested?

Manually, unit tests

With this, these are relocated to under /tmp

# ls /tmp/RtmpG2M0cB/
derby.log

And they are removed automatically when the R session is ended.

felixcheung · 2016-12-18T22:50:16Z

R/pkg/inst/tests/testthat/test_aFirstTest.R

+  l <- max(length(filesBefore), length(filesAfter))
+  length(filesBefore) <- l
+  length(filesAfter) <- l
+  expect_equal(sort(filesBefore, na.last = TRUE), sort(filesAfter, na.last = TRUE))


this will fail until merging with #16290 to move spark-warehouse dir

felixcheung · 2016-12-18T22:51:10Z

core/src/main/scala/org/apache/spark/api/r/RRDD.scala

+        System.getProperty("derby.system.home") == null) {
+      // This must be set before SparkContext is instantiated.
+      System.setProperty("derby.system.home",
+                         sparkEnvirMap.get("spark.sql.default.derby.dir").toString)


Is there a better, more general (than R), place for this in SparkConf or sql?

It sounds like only R needs it. Not sure whether we should treat it as an internal SQLConf.

If we do not want to put it into SQLConf, we need to define it in an Object, instead of hard-coding the string in all the places.

felixcheung · 2016-12-18T22:51:25Z

@shivaram @yhuai @gatorsmile

gatorsmile · 2016-12-19T00:45:18Z

R/pkg/R/sparkR.R

@@ -381,6 +381,10 @@ sparkR.session <- function(
    deployMode <- sparkConfigMap[["spark.submit.deployMode"]]
  }

+  if (!exists("spark.sql.default.derby.dir", envir = sparkConfigMap)) {


What is spark.sql.default.derby.dir?

The metastore might not be always derby, right?

it might not be, but if you refer to the JIRA, our goal is only to change the derby output path since it is the default metastore

I see. This PR is just to fix the default behavior for releasing SparkR on CRAN.

We need to add code comments to explain what it is for the future code maintainer/reader.

SparkQA · 2016-12-19T03:02:38Z

Test build #70324 has finished for PR 16330 at commit 338b3c4.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-12-19T04:26:58Z

Jenkins, retest this please

SparkQA · 2016-12-19T05:02:23Z

Test build #70332 has finished for PR 16330 at commit 338b3c4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-12-19T05:52:05Z

that's weird, I'm seeing a lot of seemingly unrelated flaky test failure lately?

org.apache.spark.util.collection.ExternalSorterSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(ExternalSorterSuite.scala:32)
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)

felixcheung · 2016-12-19T05:52:25Z

jenkins, retest this please

SparkQA · 2016-12-19T06:30:59Z

Test build #70337 has finished for PR 16330 at commit 338b3c4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-12-20T06:01:10Z

core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala

@@ -104,6 +104,12 @@ class SparkHadoopUtil extends Logging {
      }
      val bufferSize = conf.get("spark.buffer.size", "65536")
      hadoopConf.set("io.file.buffer.size", bufferSize)
+
+      if (conf.contains("spark.sql.default.derby.dir")) {


Why do we need to introduce this flag?

@yhuai
Spark by default has derby for metastore. Generally metastore_db and derby.log gets created by default in the current directory. This creates a problem for more restrictive environment, such as when running as a R package when the guideline is not to have anything written to user's space (unless under tempdir)

Just checking now it also seems to be the case when running the pyspark shell.

It looks like this is the new behavior since 2.0.0. Would it make sense if we always default derby/metastore to tempdir unless it is running in an application directory that would be cleaned out when the job is done (eg. YARN cluster)

SparkQA · 2016-12-21T02:46:02Z

Test build #70435 has finished for PR 16330 at commit 0ce9905.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2017-01-26T22:11:23Z

@yhuai Could you take another look at this ?

felixcheung · 2017-02-17T04:41:22Z

Hi @gatorsmile @cloud-fan @yhuai could you give us some feedback on this?
This is blocking release of Spark to R/CRAN - we don't need a very detailed review - just your thought of whether this is heading the right direction. Thanks

And #16290 please

gatorsmile · 2017-02-17T16:00:08Z

The extra flag is needed only when using Hive metastore. How about renaming the flag to spark.sql.hive.metastore.default.derby.dir? The value is unable to be changed at runtime. Thus, the best home should be object StaticSQLConf.

So far, maybe we can set it as an internal configuration?

cc @cloud-fan @yhuai

gatorsmile · 2017-02-17T16:19:17Z

In Hive metastore execution client, we face the same issue. See the code in HiveUtils.scala These files will be dropped after JVM is stopped.

I am not sure how strict CRAN is.

felixcheung · 2017-02-17T21:41:30Z

thank you thank you @gatorsmile I'll update this.
As for the meatstore client, since by default it goes to tempdir (here) that's ok by CRAN (IMO tempdir feels more right as default rather than the current dir in other places)

cloud-fan · 2017-02-23T02:13:38Z

Seems it's more reasonable to always use temp location for derby and warehouse, not only Spark R. By doing this, no extra configs are needed and we don't need to touch Spark R. What do you think? @yhuai @gatorsmile

felixcheung · 2017-02-23T18:40:36Z

@cloud-fan That would be awesome. One thing to check though, I'm wondering if the reason for using a predictable location (current directory) is so that when starting another job it would re-use the metastore created before and so have access to the tables defined in a previous run? If that is that case, is it ok to have it in tempdir - which obviously might disappear?

For instance, each R session has a unique directory that gets clean out when the session is terminated (although that tempdir is not currently passed to the JVM).

gatorsmile · 2017-02-23T19:02:23Z

when running as a R package when the guideline is not to have anything written to user's space (unless under tempdir)

@cloud-fan I think what @felixcheung wants is to create both metadata and data files (derby and warehouse) in tempdir. However, if we do it by default, the directoryies in tempdir might be removed. That means, all the metadata and data files could be gone without notice.

felixcheung · 2017-02-23T19:52:54Z

hmm, this is actually an excellent point - this is more than metadata, it could be data files for warehouse and so on. @gatorsmile @cloud-fan @shivaram
I think in this case all of these are more akin to loading a library and calling a method to save your own data file into your user space. I'm not 100% sure how to interpret this.

Here's the actual policy wording here:

- Packages should not write in the users’ home filespace, nor anywhere else on the file system apart from the R session’s temporary directory (or during installation in the location pointed to by TMPDIR: and such usage should be cleaned up). Installing into the system’s R installation (e.g., scripts to its bin directory) is not allowed.
Limited exceptions may be allowed in interactive sessions if the package obtains confirmation from the user.

cloud-fan · 2017-02-23T20:16:48Z

yea that's a good point, if we use temp dir by default, then Spark may lose data without notice. So I'm not sure if we really want to do this in Spark R, maybe we can ask users to allow Spark R to write to user's home filespace during installation?

shivaram · 2017-02-27T18:25:00Z

Its a bit tricky to ask users permission during installation (actually I'm not sure how we can create such an option ?) -- I think a viable option could be to do add logWarning that shows where SparkSQL data is going to be stored and a pointer to how the location can be changed.

@felixcheung I think its worth a shot to ask the CRAN submission process with such a warning and then revisit this if we still have a problem ?

felixcheung · 2017-02-28T18:19:17Z

Right, I was thinking of adding note into the API doc and we could add a log too - to say when you call sparkR.session() it will create a metastore and a warehouse location.

I think we still need to set System.setProperty("derby.system.home" because that's where derby.log goes to, which is not user data but a log file.

What do you think - @gatorsmile @cloud-fan @shivaram ?

cloud-fan · 2017-02-28T19:29:52Z

yea log file should be fine to put in temp dir.

SparkQA · 2017-03-11T21:33:03Z

Test build #74388 has finished for PR 16330 at commit 8062ee1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-11T22:12:05Z

Test build #74391 has finished for PR 16330 at commit e5b69ca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram

Thanks @felixcheung - Some comments inline

shivaram · 2017-03-12T00:19:56Z

core/src/main/scala/org/apache/spark/api/r/RRDD.scala

@@ -127,6 +127,13 @@ private[r] object RRDD {
      sparkConf.setExecutorEnv(name.toString, value.toString)
    }

+    if (sparkEnvirMap.containsKey("spark.r.sql.default.derby.dir") &&


Its a little awkward that this is set in RRDD. Is there a more general place we can set this in across languages / runtimes (i.e. for Python / Scala as well) ?

@cloud-fan @gatorsmile Any thoughts on this ?

well, in revisiting this I thought it would be easier to minimize the impact by making this R only.
it would be much easier if we make the derby log going to tmpdir always for all lang bindings

Introduce "spark.sql.derby.system.home" in SQLConf as an internal config? Set the default to System.getProperty("derby.system.home")? Then, in the R, we can set "derby.system.home" to tempdir()?

Does it sound ok? @cloud-fan @rxin

in further testing, I found that derby.system.home affects metastore_db too, so I'm changing to set derby.stream.error.file instead

setting derby.stream.error.file is all we need to move derby.log - I'd like to proceed with this change to make the cut for 2.1.1 release unless we have serious concern?

shivaram · 2017-03-12T00:20:47Z

R/pkg/inst/tests/testthat/test_sparkSQL.R

+compare_list <- function(list1, list2) {
+  # get testthat to show the diff by first making the 2 lists equal in length
+  expect_equal(length(list1), length(list2))
+  l <- max(length(list1), length(list2))


The lengths should be equal if we get to this line ? Or am I missing something ?

the idea is to show enough information in the log without having to rerun the check manually to find out what is different.

the first check will show the numeric values but it wouldn't say how exactly they are different.
the next check (or moved to compare_list() here) will get testthat to dump the delta too, but first it must set the 2 lists into the same size etc..

in fact, all of these are well tested in "Check masked functions" test in test_context.R, just duplicated here.

here's what it looks like

1. Failure: No extra files are created in SPARK_HOME by starting session and making calls (@test_sparkSQL.R#2917) length(list1) not equal to length(list2). 1/1 mismatches [1] 22 - 23 == -1 2. Failure: No extra files are created in SPARK_HOME by starting session and making calls (@test_sparkSQL.R#2917) sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE). 3/23 mismatches x[21]: "unit-tests.out" y[21]: "spark-warehouse" x[22]: "WINDOWS.md" y[22]: "unit-tests.out" x[23]: NA y[23]: "WINDOWS.md"

Got it - that sounds good

shivaram · 2017-03-12T00:22:27Z

R/pkg/inst/tests/testthat/test_sparkSQL.R

+  filesAfter <- list.files(path = file.path(Sys.getenv("SPARK_HOME"), "R"), all.files = TRUE)
+
+  expect_true(length(sparkHomeFileBefore) > 0)
+  compare_list(sparkHomeFileBefore, filesBefore)


I'm not sure what we are checking by having both sparkHomeFilesBefore and filesBefore -- Wouldn't just one of them do the job and if not can we add a comment here ?

I'm trying to catch a few things with this - will add some comment on.
for instance,

what's created by calling sparkR.session(enableHiveSupport = F) (every tests except test_sparkSQL.R)

what's created by calling sparkR.session(enableHiveSupport = T) (test_sparkSQL.R)

this unfortunately doesn't quite work as expected - it should have failed actually instead of passing - because we are running Scala tests before and they have caused spark-warehouse and metastore_db to be created already, before any R code is run.

reworking that now.

SparkQA · 2017-03-12T18:53:59Z

Test build #74409 has finished for PR 16330 at commit 9f3bb76.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-12T21:55:29Z

Test build #74411 has finished for PR 16330 at commit 7e8ffe7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2017-03-16T13:10:07Z

Sounds good to me.

…

On Mar 16, 2017 04:24, "Felix Cheung" ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In core/src/main/scala/org/apache/spark/api/r/RRDD.scala <#16330 (comment)>: > @@ -127,6 +127,13 @@ private[r] object RRDD { sparkConf.setExecutorEnv(name.toString, value.toString) } + if (sparkEnvirMap.containsKey("spark.r.sql.default.derby.dir") && setting derby.stream.error.file is all we need to move derby.log - I'd like to proceed with this change to make the cut for 2.1.1 release unless we have serious concern? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16330 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAIyFWaK3zxD1Xhd_KDsKjc-KiVirgx5ks5rmPGjgaJpZM4LQQhG> .

gatorsmile · 2017-03-17T00:25:29Z

The code changes are now very specific to R. Let me know if you still need me. : )

… test

felixcheung · 2017-03-18T18:05:28Z

rebased and force pushed to retest

shivaram · 2017-03-18T18:06:38Z

Great. I'll take a final look and wait for Jenkins

shivaram · 2017-03-18T19:58:17Z

R/pkg/inst/tests/testthat/test_sparkSQL.R

+  expect_true(length(sparkRFilesBefore) > 0)
+  # first, ensure derby.log is not there
+  expect_false("derby.log" %in% filesAfter)
+  # second, ensure only spark-warehouse is created when calling SparkSession, enableHiveSupport = F


I'm a little confused how these two setdiff commands map to with or without hive support. Can we make this a bit more easier to understand ?

agreed.
updated, hope it's better now.

shivaram · 2017-03-18T20:03:28Z

Had a minor comment on the test case. LGTM otherwise and waiting for Jenkins

SparkQA · 2017-03-18T20:48:01Z

Test build #74788 has finished for PR 16330 at commit 2eb75f8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-19T02:37:23Z

Test build #74794 has finished for PR 16330 at commit ac9fbfc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? Passes R `tempdir()` (this is the R session temp dir, shared with other temp files/dirs) to JVM, set System.Property for derby home dir to move derby.log ## How was this patch tested? Manually, unit tests With this, these are relocated to under /tmp ``` # ls /tmp/RtmpG2M0cB/ derby.log ``` And they are removed automatically when the R session is ended. Author: Felix Cheung <[email protected]> Closes #16330 from felixcheung/rderby. (cherry picked from commit 422aa67) Signed-off-by: Felix Cheung <[email protected]>

felixcheung · 2017-03-19T17:38:08Z

merged to master and branch-2.1
thanks @shivaram @gatorsmile @cloud-fan

felixcheung commented Dec 18, 2016

View reviewed changes

gatorsmile reviewed Dec 19, 2016

View reviewed changes

felixcheung mentioned this pull request Dec 19, 2016

[SPARK-18817] [SPARKR] [SQL] Set default warehouse dir to tempdir #16290

Closed

yhuai reviewed Dec 20, 2016

View reviewed changes

felixcheung mentioned this pull request Feb 17, 2017

[SPARK-19066][SPARKR][Backport-2.1]:LDA doesn't set optimizer correctly #16623

Closed

felixcheung force-pushed the rderby branch from 0ce9905 to 8062ee1 Compare March 11, 2017 18:55

felixcheung changed the title ~~[SPARK-18817][SPARKR][SQL] change derby log output and metastore to temp dir~~ [SPARK-18817][SPARKR][SQL] change derby log output to temp dir Mar 11, 2017

shivaram reviewed Mar 12, 2017

View reviewed changes

felixcheung added 6 commits March 18, 2017 11:04

set derby dir

928c2f1

move test

8fc7033

change to only set derby home

b001730

update api doc

ca4f08e

change property to use to move derby.log, clean start in test, update…

4604a53

… test

import order

2eb75f8

felixcheung force-pushed the rderby branch from 7e8ffe7 to 2eb75f8 Compare March 18, 2017 18:05

shivaram reviewed Mar 18, 2017

View reviewed changes

add more comment on test

ac9fbfc

asfgit closed this in 422aa67 Mar 19, 2017

[SPARK-18817][SPARKR][SQL] change derby log output to temp dir #16330

[SPARK-18817][SPARKR][SQL] change derby log output to temp dir #16330

Conversation

felixcheung commented Dec 18, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

felixcheung Dec 18, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung commented Dec 18, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 19, 2016

felixcheung commented Dec 19, 2016

SparkQA commented Dec 19, 2016

felixcheung commented Dec 19, 2016

felixcheung commented Dec 19, 2016

SparkQA commented Dec 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 21, 2016

shivaram commented Jan 26, 2017

felixcheung commented Feb 17, 2017

gatorsmile commented Feb 17, 2017

gatorsmile commented Feb 17, 2017

felixcheung commented Feb 17, 2017

cloud-fan commented Feb 23, 2017 • edited Loading

felixcheung commented Feb 23, 2017 • edited Loading

gatorsmile commented Feb 23, 2017 • edited Loading

felixcheung commented Feb 23, 2017 • edited Loading

cloud-fan commented Feb 23, 2017

shivaram commented Feb 27, 2017

felixcheung commented Feb 28, 2017

cloud-fan commented Feb 28, 2017

SparkQA commented Mar 11, 2017

SparkQA commented Mar 11, 2017

shivaram left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung Mar 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung Mar 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung Mar 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 12, 2017

SparkQA commented Mar 12, 2017

shivaram commented Mar 16, 2017 via email

gatorsmile commented Mar 17, 2017

felixcheung commented Mar 18, 2017

shivaram commented Mar 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shivaram commented Mar 18, 2017

SparkQA commented Mar 18, 2017

SparkQA commented Mar 19, 2017

felixcheung commented Mar 19, 2017

felixcheung commented Dec 18, 2016 •

edited

Loading

felixcheung Dec 18, 2016 •

edited

Loading

felixcheung commented Dec 18, 2016 •

edited

Loading

cloud-fan commented Feb 23, 2017 •

edited

Loading

felixcheung commented Feb 23, 2017 •

edited

Loading

gatorsmile commented Feb 23, 2017 •

edited

Loading

felixcheung commented Feb 23, 2017 •

edited

Loading

felixcheung Mar 12, 2017 •

edited

Loading

felixcheung Mar 12, 2017 •

edited

Loading

felixcheung Mar 12, 2017 •

edited

Loading