-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18817][SPARKR][SQL] change derby log output to temp dir #16330
Conversation
l <- max(length(filesBefore), length(filesAfter)) | ||
length(filesBefore) <- l | ||
length(filesAfter) <- l | ||
expect_equal(sort(filesBefore, na.last = TRUE), sort(filesAfter, na.last = TRUE)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will fail until merging with #16290 to move spark-warehouse
dir
System.getProperty("derby.system.home") == null) { | ||
// This must be set before SparkContext is instantiated. | ||
System.setProperty("derby.system.home", | ||
sparkEnvirMap.get("spark.sql.default.derby.dir").toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better, more general (than R), place for this in SparkConf or sql?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like only R needs it. Not sure whether we should treat it as an internal SQLConf
.
If we do not want to put it into SQLConf
, we need to define it in an Object, instead of hard-coding the string in all the places.
R/pkg/R/sparkR.R
Outdated
@@ -381,6 +381,10 @@ sparkR.session <- function( | |||
deployMode <- sparkConfigMap[["spark.submit.deployMode"]] | |||
} | |||
|
|||
if (!exists("spark.sql.default.derby.dir", envir = sparkConfigMap)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is spark.sql.default.derby.dir
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metastore might not be always derby, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might not be, but if you refer to the JIRA, our goal is only to change the derby output path since it is the default metastore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. This PR is just to fix the default behavior for releasing SparkR on CRAN.
We need to add code comments to explain what it is for the future code maintainer/reader.
Test build #70324 has finished for PR 16330 at commit
|
Jenkins, retest this please |
Test build #70332 has finished for PR 16330 at commit
|
that's weird, I'm seeing a lot of seemingly unrelated flaky test failure lately?
|
jenkins, retest this please |
Test build #70337 has finished for PR 16330 at commit
|
@@ -104,6 +104,12 @@ class SparkHadoopUtil extends Logging { | |||
} | |||
val bufferSize = conf.get("spark.buffer.size", "65536") | |||
hadoopConf.set("io.file.buffer.size", bufferSize) | |||
|
|||
if (conf.contains("spark.sql.default.derby.dir")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to introduce this flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yhuai
Spark by default has derby for metastore. Generally metastore_db and derby.log gets created by default in the current directory. This creates a problem for more restrictive environment, such as when running as a R package when the guideline is not to have anything written to user's space (unless under tempdir)
Just checking now it also seems to be the case when running the pyspark shell.
It looks like this is the new behavior since 2.0.0. Would it make sense if we always default derby/metastore to tempdir unless it is running in an application directory that would be cleaned out when the job is done (eg. YARN cluster)
Test build #70435 has finished for PR 16330 at commit
|
@yhuai Could you take another look at this ? |
Hi @gatorsmile @cloud-fan @yhuai could you give us some feedback on this? And #16290 please |
The extra flag is needed only when using Hive metastore. How about renaming the flag to So far, maybe we can set it as an internal configuration? cc @cloud-fan @yhuai |
In Hive metastore execution client, we face the same issue. See the code in HiveUtils.scala These files will be dropped after JVM is stopped. I am not sure how strict CRAN is. |
thank you thank you @gatorsmile I'll update this. |
Seems it's more reasonable to always use temp location for derby and warehouse, not only Spark R. By doing this, no extra configs are needed and we don't need to touch Spark R. What do you think? @yhuai @gatorsmile |
@cloud-fan That would be awesome. One thing to check though, I'm wondering if the reason for using a predictable location (current directory) is so that when starting another job it would re-use the metastore created before and so have access to the tables defined in a previous run? If that is that case, is it ok to have it in tempdir - which obviously might disappear? For instance, each R session has a unique directory that gets clean out when the session is terminated (although that tempdir is not currently passed to the JVM). |
@cloud-fan I think what @felixcheung wants is to create both metadata and data files (derby and warehouse) in tempdir. However, if we do it by default, the directoryies in tempdir might be removed. That means, all the metadata and data files could be gone without notice. |
hmm, this is actually an excellent point - this is more than metadata, it could be data files for warehouse and so on. @gatorsmile @cloud-fan @shivaram Here's the actual policy wording here:
|
yea that's a good point, if we use temp dir by default, then Spark may lose data without notice. So I'm not sure if we really want to do this in Spark R, maybe we can ask users to allow Spark R to write to user's home filespace during installation? |
Its a bit tricky to ask users permission during installation (actually I'm not sure how we can create such an option ?) -- I think a viable option could be to do add @felixcheung I think its worth a shot to ask the CRAN submission process with such a warning and then revisit this if we still have a problem ? |
Right, I was thinking of adding note into the API doc and we could add a log too - to say when you call I think we still need to set What do you think - @gatorsmile @cloud-fan @shivaram ? |
yea log file should be fine to put in temp dir. |
Test build #74388 has finished for PR 16330 at commit
|
Test build #74391 has finished for PR 16330 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @felixcheung - Some comments inline
@@ -127,6 +127,13 @@ private[r] object RRDD { | |||
sparkConf.setExecutorEnv(name.toString, value.toString) | |||
} | |||
|
|||
if (sparkEnvirMap.containsKey("spark.r.sql.default.derby.dir") && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a little awkward that this is set in RRDD. Is there a more general place we can set this in across languages / runtimes (i.e. for Python / Scala as well) ?
@cloud-fan @gatorsmile Any thoughts on this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, in revisiting this I thought it would be easier to minimize the impact by making this R only.
it would be much easier if we make the derby log going to tmpdir always for all lang bindings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introduce "spark.sql.derby.system.home" in SQLConf as an internal config? Set the default to System.getProperty("derby.system.home")
? Then, in the R, we can set "derby.system.home" to tempdir()
?
Does it sound ok? @cloud-fan @rxin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in further testing, I found that derby.system.home
affects metastore_db too, so I'm changing to set derby.stream.error.file
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setting derby.stream.error.file
is all we need to move derby.log
- I'd like to proceed with this change to make the cut for 2.1.1 release unless we have serious concern?
compare_list <- function(list1, list2) { | ||
# get testthat to show the diff by first making the 2 lists equal in length | ||
expect_equal(length(list1), length(list2)) | ||
l <- max(length(list1), length(list2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lengths should be equal if we get to this line ? Or am I missing something ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the idea is to show enough information in the log without having to rerun the check manually to find out what is different.
the first check will show the numeric values but it wouldn't say how exactly they are different.
the next check (or moved to compare_list() here) will get testthat to dump the delta too, but first it must set the 2 lists into the same size etc..
in fact, all of these are well tested in "Check masked functions" test in test_context.R, just duplicated here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's what it looks like
1. Failure: No extra files are created in SPARK_HOME by starting session and making calls (@test_sparkSQL.R#2917)
length(list1) not equal to length(list2).
1/1 mismatches
[1] 22 - 23 == -1
2. Failure: No extra files are created in SPARK_HOME by starting session and making calls (@test_sparkSQL.R#2917)
sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
3/23 mismatches
x[21]: "unit-tests.out"
y[21]: "spark-warehouse"
x[22]: "WINDOWS.md"
y[22]: "unit-tests.out"
x[23]: NA
y[23]: "WINDOWS.md"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it - that sounds good
filesAfter <- list.files(path = file.path(Sys.getenv("SPARK_HOME"), "R"), all.files = TRUE) | ||
|
||
expect_true(length(sparkHomeFileBefore) > 0) | ||
compare_list(sparkHomeFileBefore, filesBefore) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what we are checking by having both sparkHomeFilesBefore
and filesBefore
-- Wouldn't just one of them do the job and if not can we add a comment here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to catch a few things with this - will add some comment on.
for instance,
- what's created by calling
sparkR.session(enableHiveSupport = F)
(every tests except test_sparkSQL.R) - what's created by calling
sparkR.session(enableHiveSupport = T)
(test_sparkSQL.R)
this unfortunately doesn't quite work as expected - it should have failed actually instead of passing - because we are running Scala tests before and they have caused spark-warehouse and metastore_db to be created already, before any R code is run.
reworking that now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated/
Test build #74409 has finished for PR 16330 at commit
|
Test build #74411 has finished for PR 16330 at commit
|
Sounds good to me.
…On Mar 16, 2017 04:24, "Felix Cheung" ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In core/src/main/scala/org/apache/spark/api/r/RRDD.scala
<#16330 (comment)>:
> @@ -127,6 +127,13 @@ private[r] object RRDD {
sparkConf.setExecutorEnv(name.toString, value.toString)
}
+ if (sparkEnvirMap.containsKey("spark.r.sql.default.derby.dir") &&
setting derby.stream.error.file is all we need to move derby.log - I'd
like to proceed with this change to make the cut for 2.1.1 release unless
we have serious concern?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16330 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAIyFWaK3zxD1Xhd_KDsKjc-KiVirgx5ks5rmPGjgaJpZM4LQQhG>
.
|
The code changes are now very specific to R. Let me know if you still need me. : ) |
rebased and force pushed to retest |
Great. I'll take a final look and wait for Jenkins |
expect_true(length(sparkRFilesBefore) > 0) | ||
# first, ensure derby.log is not there | ||
expect_false("derby.log" %in% filesAfter) | ||
# second, ensure only spark-warehouse is created when calling SparkSession, enableHiveSupport = F |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused how these two setdiff commands map to with or without hive support. Can we make this a bit more easier to understand ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed.
updated, hope it's better now.
Had a minor comment on the test case. LGTM otherwise and waiting for Jenkins |
Test build #74788 has finished for PR 16330 at commit
|
Test build #74794 has finished for PR 16330 at commit
|
## What changes were proposed in this pull request? Passes R `tempdir()` (this is the R session temp dir, shared with other temp files/dirs) to JVM, set System.Property for derby home dir to move derby.log ## How was this patch tested? Manually, unit tests With this, these are relocated to under /tmp ``` # ls /tmp/RtmpG2M0cB/ derby.log ``` And they are removed automatically when the R session is ended. Author: Felix Cheung <[email protected]> Closes #16330 from felixcheung/rderby. (cherry picked from commit 422aa67) Signed-off-by: Felix Cheung <[email protected]>
merged to master and branch-2.1 |
What changes were proposed in this pull request?
Passes R
tempdir()
(this is the R session temp dir, shared with other temp files/dirs) to JVM, set System.Property for derby home dir to move derby.logHow was this patch tested?
Manually, unit tests
With this, these are relocated to under /tmp
And they are removed automatically when the R session is ended.