-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12421][SQL] Fixed copy() method of GenericRow. #10374
Conversation
ok to test. |
@apo1 LGTM but could you make the description? |
@sarutak Do you mean the description of this pull-request? I added the link to the Jira. Ok? |
OK, Thanks! Also, it's better if you add a test for change. |
@@ -201,7 +201,7 @@ class GenericRow(protected[sql] val values: Array[Any]) extends Row { | |||
|
|||
override def toSeq: Seq[Any] = values.toSeq | |||
|
|||
override def copy(): Row = this | |||
override def copy(): Row = new GenericRow(values.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GenericRow
is supposed to be immutable. So there we really shouldn't need to copy its values. Could you explain why this is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is immutable, but if you want to extract the values of a GenericRows, change them and create a new row, then it's better to not change the original values.
So you would first want to receive a copy of the original row.
Also the method name copy() implies that it returns a real copy and not the reference on the original object.
Other rows like GenericInternalRow implement a correct copy method:
override def copy(): InternalRow = new GenericInternalRow(values.clone())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You cannot a put value back in the GenericRow
when you have taken it out, unless it is a mutable object. Could you provide an example for this? Which is also a good basis for a unit test.
copy
is merely a contract all rows need to adhere to. If a row is inmutable, why copy it? It also avoids a lot object allocations. I think GenericInternalRow
should also return reference to it self on copy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I try to provide a simple example and I could not try the code below right now. Imagine you want to query a row, copy it, change a value and then build a new one. Then you want to build a DF out of the new and the old row. I know that this is a strange example and there are a lot of better method to implement it, but you could implement it this way:
import org.apache.spark.sql.Row;
val df = sc.parallelize(1,2,3,5,6,7,8,9).toDF.sort
val firstRow= df.first
val firstRowCopied = firstRow.copy() // <-- HERE the row will not be copied.
val arr = firstRowCopied.toSeq.toArray
arr(0) = arr(0) * 10
val newRow = Row.fromSeq(arr)
sqlContext.createDataframe(List(firstRow, newRow), df.schema) // Both row will contain the value 10 and share the same value Sequence
If you don't want that someone uses the copy method, because it does not do what it implies, then i think we should provide another trait/contract which does not include this method. It's just confusing if the copy method does not copy the object.
row.copy()
and just row
are identical so far!
I don't see the point, why we should'nt provide this copy method for the users who want to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been doing some digging. Using an adapted version of your code (only made it work):
import org.apache.spark.sql.Row;
val df = sqlContext.range(1, 9).toDF
val firstRow = df.first
val firstRowCopied = firstRow.copy() // <-- HERE the row will not be copied.
val arr = firstRowCopied.toSeq.toArray
arr(0) = arr(0).asInstanceOf[Long] * 10
val newRow = Row.fromSeq(arr)
val newDf = sqlContext.createDataFrame(sc.parallelize(Seq(firstRow, newRow)), df.schema)
newDF.show
yields the following result:
+---+
| id|
+---+
| 10|
| 10|
+---+
Which is wrong. What happens is that the firstRowCopied.toSeq
wraps the value array in a ArrayOps
object, this object returns wrapped backing array (instead of a copy) when the you invoke toArray
. This really shouldn't happen, because now you mutable gain access to a structure which is supposed to be immutable. I think we should change the toSeq
method instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what's the best solution. Maybe we need some help from more experienced committers here. I think we are agreed that we need a change here, but we differ how the change should look like.
Anyway, it's not urgent, because the wrong behave should only affect just a few developers/users.
Test build #47998 has finished for PR 10374 at commit
|
@apo1 Could you add a unit test for this? |
@hvanhovell Ok, I try. But I'm not sure where's the right place for the unit test. |
Test build #48190 has finished for PR 10374 at commit
|
Test build #48206 has finished for PR 10374 at commit
|
The comment in GenericRow explicitly says not to do this. "A row implementation that uses an array of objects as the underlying storage. Note that, while If we change the behavior, we should update that too. |
@nongli I think the comment that comment is mainly meant for developers passing an array into the Row constructor; they should not reuse the array. We currently leak a reference to the backing array by accident, which can be a problem. |
agree with @hvanhovell about the meaning of that comment. However, I don't think we need to change the |
I would prefer to change the implementation of toSeq. |
It is currently possible to change the values of the supposedly immutable ```GenericRow``` and ```GenericInternalRow``` classes. This is caused by the fact that scala's ArrayOps ```toArray``` (returned by calling ```toSeq```) will return the backing array instead of a copy. This PR fixes this problem. This PR was inspired by #10374 by apo1. cc apo1 sarutak marmbrus cloud-fan nongli (everyone in the previous conversation). Author: Herman van Hovell <[email protected]> Closes #10553 from hvanhovell/SPARK-12421.
Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one. |
This fix is for the following issue SPARK-12421