-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12204][SPARKR] Implement drop method for DataFrame in SparkR. #10201
Conversation
Test build #47335 has finished for PR 10201 at commit
|
@@ -1324,12 +1312,16 @@ setMethod("selectExpr", | |||
#' path <- "path/to/file.json" | |||
#' df <- jsonFile(sqlContext, path) | |||
#' newDF <- withColumn(df, "newCol", df$col1 * 5) | |||
#' # Replace an existing column | |||
#' newDF2 <- withColumn(newDF, "newCol", newDF$col1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% about the replace existing column behavior - I thought it was intentional that we support multiple columns with the same name before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the reason. The original commit can be found at amplab-extras/SparkR-pkg#204.
I don't think it is related to supporting multiple columns with the same name. Spark Core itself allows multiple columns with the same name:
scala> val df=sqlContext.createDataFrame(Seq((1,2,3))).toDF("a","a","c")
df: org.apache.spark.sql.DataFrame = [a: int, a: int, c: int]
scala> df.show
+---+---+---+
| a| a| c|
+---+---+---+
| 1| 2| 3|
+---+---+---+
scala> df.withColumn("a", df("c")).show
+---+---+---+
| a| a| c|
+---+---+---+
| 3| 3| 3|
+---+---+---+
You can see all columns of the same name will be replaced in the above example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the reason. When the withColumn was implemented in SparkR, the withColumn() in Scala support just adding columns, without support for replacing existing columns. But later, withColumn() in Scala was enhanced to support replacing existing columns, see #5541. However, withColumn in SparkR have not been synced with Scala until this PR:)
Test build #47571 has finished for PR 10201 at commit
|
function(x, col) { | ||
stopifnot(class(col) == "character" || class(col) == "Column") | ||
|
||
if (class(col) == "character") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd flip this check, since @jc
should only be called on Column
but minor point since it's checked in line 2245.
@felixcheung, yes, this may cause backward-compatibility issue. But this is not SparkR specific, as it's change in Spark SQL core. Where is the appropriate place for documentation? |
Test build #47580 has finished for PR 10201 at commit
|
SQL and MLlib have a "Migration guide" section, perhaps something like that? http://spark.apache.org/docs/latest/sql-programming-guide.html#migration-guide |
@felixcheung Was there a migration guide entry for |
@shivaram I checked but release notes and programming guide/migration guide and I don't see referencing to withColumn for Spark 1.4.0 or 1.4.1. Perhaps the behavior change happened before the 1.4.0 release? |
According to https://issues.apache.org/jira/browse/SPARK-6635 and https://issues.apache.org/jira/browse/SPARK-10073, the feature for Scala was in Spark 1.4.0 and python in 1.5.0. But seems both just have API updated without any migration guide for compatibility break. Do we need to do it specifically for SparkR? |
@felixcheung, @shivaram, documentation for withColumn changed. please take a review |
Test build #47643 has finished for PR 10201 at commit
|
Test build #47647 has finished for PR 10201 at commit
|
@felixcheung, refine the wording: Prior to 1.4, DataFrame.withColumn() supports adding a column only. The column will always be added as a new column with its specified name in the result DataFrame even if there may be any existing columns of the same name. Since 1.4, DataFrame.withColumn() supports adding a column of a different name from names of all existing columns or replacing existing columns of the same name. Any comment? |
that's good, thanks |
Test build #47718 has finished for PR 10201 at commit
|
#' sc <- sparkR.init() | ||
#' sqlCtx <- sparkRSQL.init(sc) | ||
#' path <- "path/to/file.json" | ||
#' df <- jsonFile(sqlCtx, path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update this to read.json?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. thanks
looks good - only a minor code doc comment. |
Test build #47803 has finished for PR 10201 at commit
|
any other comments? @shivaram, could you merge it? |
@sun-rui Sorry for the delay in looking at this. Could you bring this up to date with master ? It looks good to me. |
@@ -2073,6 +2073,8 @@ options. | |||
--conf spark.sql.hive.thriftServer.singleSession=true \ | |||
... | |||
{% endhighlight %} | |||
- Since 1.6.1, withColumn method in sparkR supports adding a new column to or replacing existing columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which version is appropriate here? 1.6.1 or 2.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we want to put this in the R migration guide session instead of SQL? or both?
rebased to master |
Test build #49758 has finished for PR 10201 at commit
|
Test build #49777 has finished for PR 10201 at commit
|
Test build #49846 has finished for PR 10201 at commit
|
LGTM |
Merging this to master |
No description provided.