Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18817][SPARKR][SQL] change derby log output to temp dir #16330

Closed
wants to merge 7 commits into from

Conversation

felixcheung
Copy link
Member

@felixcheung felixcheung commented Dec 18, 2016

What changes were proposed in this pull request?

Passes R tempdir() (this is the R session temp dir, shared with other temp files/dirs) to JVM, set System.Property for derby home dir to move derby.log

How was this patch tested?

Manually, unit tests

With this, these are relocated to under /tmp

# ls /tmp/RtmpG2M0cB/
derby.log

And they are removed automatically when the R session is ended.

l <- max(length(filesBefore), length(filesAfter))
length(filesBefore) <- l
length(filesAfter) <- l
expect_equal(sort(filesBefore, na.last = TRUE), sort(filesAfter, na.last = TRUE))
Copy link
Member Author

@felixcheung felixcheung Dec 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will fail until merging with #16290 to move spark-warehouse dir

System.getProperty("derby.system.home") == null) {
// This must be set before SparkContext is instantiated.
System.setProperty("derby.system.home",
sparkEnvirMap.get("spark.sql.default.derby.dir").toString)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better, more general (than R), place for this in SparkConf or sql?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like only R needs it. Not sure whether we should treat it as an internal SQLConf.

If we do not want to put it into SQLConf, we need to define it in an Object, instead of hard-coding the string in all the places.

@felixcheung
Copy link
Member Author

felixcheung commented Dec 18, 2016

@shivaram @yhuai @gatorsmile

R/pkg/R/sparkR.R Outdated
@@ -381,6 +381,10 @@ sparkR.session <- function(
deployMode <- sparkConfigMap[["spark.submit.deployMode"]]
}

if (!exists("spark.sql.default.derby.dir", envir = sparkConfigMap)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is spark.sql.default.derby.dir?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metastore might not be always derby, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might not be, but if you refer to the JIRA, our goal is only to change the derby output path since it is the default metastore

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. This PR is just to fix the default behavior for releasing SparkR on CRAN.

We need to add code comments to explain what it is for the future code maintainer/reader.

@SparkQA
Copy link

SparkQA commented Dec 19, 2016

Test build #70324 has finished for PR 16330 at commit 338b3c4.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Dec 19, 2016

Test build #70332 has finished for PR 16330 at commit 338b3c4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member Author

that's weird, I'm seeing a lot of seemingly unrelated flaky test failure lately?

org.apache.spark.util.collection.ExternalSorterSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(ExternalSorterSuite.scala:32)
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)

@felixcheung
Copy link
Member Author

jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Dec 19, 2016

Test build #70337 has finished for PR 16330 at commit 338b3c4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -104,6 +104,12 @@ class SparkHadoopUtil extends Logging {
}
val bufferSize = conf.get("spark.buffer.size", "65536")
hadoopConf.set("io.file.buffer.size", bufferSize)

if (conf.contains("spark.sql.default.derby.dir")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to introduce this flag?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhuai
Spark by default has derby for metastore. Generally metastore_db and derby.log gets created by default in the current directory. This creates a problem for more restrictive environment, such as when running as a R package when the guideline is not to have anything written to user's space (unless under tempdir)

Just checking now it also seems to be the case when running the pyspark shell.

It looks like this is the new behavior since 2.0.0. Would it make sense if we always default derby/metastore to tempdir unless it is running in an application directory that would be cleaned out when the job is done (eg. YARN cluster)

@SparkQA
Copy link

SparkQA commented Dec 21, 2016

Test build #70435 has finished for PR 16330 at commit 0ce9905.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

@yhuai Could you take another look at this ?

@felixcheung
Copy link
Member Author

Hi @gatorsmile @cloud-fan @yhuai could you give us some feedback on this?
This is blocking release of Spark to R/CRAN - we don't need a very detailed review - just your thought of whether this is heading the right direction. Thanks

And #16290 please

@gatorsmile
Copy link
Member

The extra flag is needed only when using Hive metastore. How about renaming the flag to spark.sql.hive.metastore.default.derby.dir? The value is unable to be changed at runtime. Thus, the best home should be object StaticSQLConf.

So far, maybe we can set it as an internal configuration?

cc @cloud-fan @yhuai

@gatorsmile
Copy link
Member

In Hive metastore execution client, we face the same issue. See the code in HiveUtils.scala These files will be dropped after JVM is stopped.

I am not sure how strict CRAN is.

@felixcheung
Copy link
Member Author

thank you thank you @gatorsmile I'll update this.
As for the meatstore client, since by default it goes to tempdir (here) that's ok by CRAN (IMO tempdir feels more right as default rather than the current dir in other places)

@cloud-fan
Copy link
Contributor

cloud-fan commented Feb 23, 2017

Seems it's more reasonable to always use temp location for derby and warehouse, not only Spark R. By doing this, no extra configs are needed and we don't need to touch Spark R. What do you think? @yhuai @gatorsmile

@felixcheung
Copy link
Member Author

felixcheung commented Feb 23, 2017

@cloud-fan That would be awesome. One thing to check though, I'm wondering if the reason for using a predictable location (current directory) is so that when starting another job it would re-use the metastore created before and so have access to the tables defined in a previous run? If that is that case, is it ok to have it in tempdir - which obviously might disappear?

For instance, each R session has a unique directory that gets clean out when the session is terminated (although that tempdir is not currently passed to the JVM).

@gatorsmile
Copy link
Member

gatorsmile commented Feb 23, 2017

when running as a R package when the guideline is not to have anything written to user's space (unless under tempdir)

@cloud-fan I think what @felixcheung wants is to create both metadata and data files (derby and warehouse) in tempdir. However, if we do it by default, the directoryies in tempdir might be removed. That means, all the metadata and data files could be gone without notice.

@felixcheung
Copy link
Member Author

felixcheung commented Feb 23, 2017

hmm, this is actually an excellent point - this is more than metadata, it could be data files for warehouse and so on. @gatorsmile @cloud-fan @shivaram
I think in this case all of these are more akin to loading a library and calling a method to save your own data file into your user space. I'm not 100% sure how to interpret this.

Here's the actual policy wording here:

- Packages should not write in the users’ home filespace, nor anywhere else on the file system apart from the R session’s temporary directory (or during installation in the location pointed to by TMPDIR: and such usage should be cleaned up). Installing into the system’s R installation (e.g., scripts to its bin directory) is not allowed.
Limited exceptions may be allowed in interactive sessions if the package obtains confirmation from the user.

@cloud-fan
Copy link
Contributor

yea that's a good point, if we use temp dir by default, then Spark may lose data without notice. So I'm not sure if we really want to do this in Spark R, maybe we can ask users to allow Spark R to write to user's home filespace during installation?

@shivaram
Copy link
Contributor

Its a bit tricky to ask users permission during installation (actually I'm not sure how we can create such an option ?) -- I think a viable option could be to do add logWarning that shows where SparkSQL data is going to be stored and a pointer to how the location can be changed.

@felixcheung I think its worth a shot to ask the CRAN submission process with such a warning and then revisit this if we still have a problem ?

@felixcheung
Copy link
Member Author

Right, I was thinking of adding note into the API doc and we could add a log too - to say when you call sparkR.session() it will create a metastore and a warehouse location.

I think we still need to set System.setProperty("derby.system.home" because that's where derby.log goes to, which is not user data but a log file.

What do you think - @gatorsmile @cloud-fan @shivaram ?

@cloud-fan
Copy link
Contributor

yea log file should be fine to put in temp dir.

@felixcheung felixcheung changed the title [SPARK-18817][SPARKR][SQL] change derby log output and metastore to temp dir [SPARK-18817][SPARKR][SQL] change derby log output to temp dir Mar 11, 2017
@SparkQA
Copy link

SparkQA commented Mar 11, 2017

Test build #74388 has finished for PR 16330 at commit 8062ee1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 11, 2017

Test build #74391 has finished for PR 16330 at commit e5b69ca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@shivaram shivaram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @felixcheung - Some comments inline

@@ -127,6 +127,13 @@ private[r] object RRDD {
sparkConf.setExecutorEnv(name.toString, value.toString)
}

if (sparkEnvirMap.containsKey("spark.r.sql.default.derby.dir") &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a little awkward that this is set in RRDD. Is there a more general place we can set this in across languages / runtimes (i.e. for Python / Scala as well) ?

@cloud-fan @gatorsmile Any thoughts on this ?

Copy link
Member Author

@felixcheung felixcheung Mar 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, in revisiting this I thought it would be easier to minimize the impact by making this R only.
it would be much easier if we make the derby log going to tmpdir always for all lang bindings

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduce "spark.sql.derby.system.home" in SQLConf as an internal config? Set the default to System.getProperty("derby.system.home")? Then, in the R, we can set "derby.system.home" to tempdir()?

Does it sound ok? @cloud-fan @rxin

Copy link
Member Author

@felixcheung felixcheung Mar 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in further testing, I found that derby.system.home affects metastore_db too, so I'm changing to set derby.stream.error.file instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting derby.stream.error.file is all we need to move derby.log - I'd like to proceed with this change to make the cut for 2.1.1 release unless we have serious concern?

compare_list <- function(list1, list2) {
# get testthat to show the diff by first making the 2 lists equal in length
expect_equal(length(list1), length(list2))
l <- max(length(list1), length(list2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lengths should be equal if we get to this line ? Or am I missing something ?

Copy link
Member Author

@felixcheung felixcheung Mar 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea is to show enough information in the log without having to rerun the check manually to find out what is different.

the first check will show the numeric values but it wouldn't say how exactly they are different.
the next check (or moved to compare_list() here) will get testthat to dump the delta too, but first it must set the 2 lists into the same size etc..

in fact, all of these are well tested in "Check masked functions" test in test_context.R, just duplicated here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's what it looks like

1. Failure: No extra files are created in SPARK_HOME by starting session and making calls (@test_sparkSQL.R#2917)
length(list1) not equal to length(list2).
1/1 mismatches
[1] 22 - 23 == -1


2. Failure: No extra files are created in SPARK_HOME by starting session and making calls (@test_sparkSQL.R#2917)
sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
3/23 mismatches
x[21]: "unit-tests.out"
y[21]: "spark-warehouse"

x[22]: "WINDOWS.md"
y[22]: "unit-tests.out"

x[23]: NA
y[23]: "WINDOWS.md"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it - that sounds good

filesAfter <- list.files(path = file.path(Sys.getenv("SPARK_HOME"), "R"), all.files = TRUE)

expect_true(length(sparkHomeFileBefore) > 0)
compare_list(sparkHomeFileBefore, filesBefore)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what we are checking by having both sparkHomeFilesBefore and filesBefore -- Wouldn't just one of them do the job and if not can we add a comment here ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to catch a few things with this - will add some comment on.
for instance,

  1. what's created by calling sparkR.session(enableHiveSupport = F) (every tests except test_sparkSQL.R)
  2. what's created by calling sparkR.session(enableHiveSupport = T) (test_sparkSQL.R)

this unfortunately doesn't quite work as expected - it should have failed actually instead of passing - because we are running Scala tests before and they have caused spark-warehouse and metastore_db to be created already, before any R code is run.

reworking that now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated/

@SparkQA
Copy link

SparkQA commented Mar 12, 2017

Test build #74409 has finished for PR 16330 at commit 9f3bb76.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 12, 2017

Test build #74411 has finished for PR 16330 at commit 7e8ffe7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

shivaram commented Mar 16, 2017 via email

@gatorsmile
Copy link
Member

The code changes are now very specific to R. Let me know if you still need me. : )

@felixcheung
Copy link
Member Author

rebased and force pushed to retest

@shivaram
Copy link
Contributor

Great. I'll take a final look and wait for Jenkins

expect_true(length(sparkRFilesBefore) > 0)
# first, ensure derby.log is not there
expect_false("derby.log" %in% filesAfter)
# second, ensure only spark-warehouse is created when calling SparkSession, enableHiveSupport = F
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused how these two setdiff commands map to with or without hive support. Can we make this a bit more easier to understand ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed.
updated, hope it's better now.

@shivaram
Copy link
Contributor

Had a minor comment on the test case. LGTM otherwise and waiting for Jenkins

@SparkQA
Copy link

SparkQA commented Mar 18, 2017

Test build #74788 has finished for PR 16330 at commit 2eb75f8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 19, 2017

Test build #74794 has finished for PR 16330 at commit ac9fbfc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Mar 19, 2017
## What changes were proposed in this pull request?

Passes R `tempdir()` (this is the R session temp dir, shared with other temp files/dirs) to JVM, set System.Property for derby home dir to move derby.log

## How was this patch tested?

Manually, unit tests

With this, these are relocated to under /tmp
```
# ls /tmp/RtmpG2M0cB/
derby.log
```
And they are removed automatically when the R session is ended.

Author: Felix Cheung <[email protected]>

Closes #16330 from felixcheung/rderby.

(cherry picked from commit 422aa67)
Signed-off-by: Felix Cheung <[email protected]>
@felixcheung
Copy link
Member Author

merged to master and branch-2.1
thanks @shivaram @gatorsmile @cloud-fan

@asfgit asfgit closed this in 422aa67 Mar 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants