Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12421][SQL] Fixed copy() method of GenericRow. #10374

Closed
wants to merge 3 commits into from
Closed

[SPARK-12421][SQL] Fixed copy() method of GenericRow. #10374

wants to merge 3 commits into from

Conversation

bdoepf
Copy link

@bdoepf bdoepf commented Dec 18, 2015

This fix is for the following issue SPARK-12421

@sarutak
Copy link
Member

sarutak commented Dec 18, 2015

ok to test.

@sarutak
Copy link
Member

sarutak commented Dec 18, 2015

@apo1 LGTM but could you make the description?

@bdoepf
Copy link
Author

bdoepf commented Dec 18, 2015

@sarutak Do you mean the description of this pull-request? I added the link to the Jira. Ok?

@sarutak
Copy link
Member

sarutak commented Dec 18, 2015

OK, Thanks! Also, it's better if you add a test for change.

@@ -201,7 +201,7 @@ class GenericRow(protected[sql] val values: Array[Any]) extends Row {

override def toSeq: Seq[Any] = values.toSeq

override def copy(): Row = this
override def copy(): Row = new GenericRow(values.clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GenericRow is supposed to be immutable. So there we really shouldn't need to copy its values. Could you explain why this is needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is immutable, but if you want to extract the values of a GenericRows, change them and create a new row, then it's better to not change the original values.
So you would first want to receive a copy of the original row.

Also the method name copy() implies that it returns a real copy and not the reference on the original object.

Other rows like GenericInternalRow implement a correct copy method:
override def copy(): InternalRow = new GenericInternalRow(values.clone())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot a put value back in the GenericRow when you have taken it out, unless it is a mutable object. Could you provide an example for this? Which is also a good basis for a unit test.

copy is merely a contract all rows need to adhere to. If a row is inmutable, why copy it? It also avoids a lot object allocations. I think GenericInternalRow should also return reference to it self on copy.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to provide a simple example and I could not try the code below right now. Imagine you want to query a row, copy it, change a value and then build a new one. Then you want to build a DF out of the new and the old row. I know that this is a strange example and there are a lot of better method to implement it, but you could implement it this way:

import org.apache.spark.sql.Row;
val df = sc.parallelize(1,2,3,5,6,7,8,9).toDF.sort
val firstRow= df.first
val firstRowCopied = firstRow.copy()  // <-- HERE the row will not be copied. 
val arr = firstRowCopied.toSeq.toArray 
arr(0) =  arr(0) * 10
val newRow = Row.fromSeq(arr)
sqlContext.createDataframe(List(firstRow, newRow), df.schema)  // Both row will contain the value 10 and share the same value Sequence

If you don't want that someone uses the copy method, because it does not do what it implies, then i think we should provide another trait/contract which does not include this method. It's just confusing if the copy method does not copy the object.
row.copy() and just row are identical so far!

I don't see the point, why we should'nt provide this copy method for the users who want to use it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been doing some digging. Using an adapted version of your code (only made it work):

import org.apache.spark.sql.Row;
val df = sqlContext.range(1, 9).toDF
val firstRow = df.first
val firstRowCopied = firstRow.copy()  // <-- HERE the row will not be copied.
val arr = firstRowCopied.toSeq.toArray
arr(0) =  arr(0).asInstanceOf[Long] * 10
val newRow = Row.fromSeq(arr)
val newDf = sqlContext.createDataFrame(sc.parallelize(Seq(firstRow, newRow)), df.schema)

newDF.show yields the following result:

+---+
| id|
+---+
| 10|
| 10|
+---+

Which is wrong. What happens is that the firstRowCopied.toSeq wraps the value array in a ArrayOps object, this object returns wrapped backing array (instead of a copy) when the you invoke toArray. This really shouldn't happen, because now you mutable gain access to a structure which is supposed to be immutable. I think we should change the toSeq method instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what's the best solution. Maybe we need some help from more experienced committers here. I think we are agreed that we need a change here, but we differ how the change should look like.

Anyway, it's not urgent, because the wrong behave should only affect just a few developers/users.

@SparkQA
Copy link

SparkQA commented Dec 18, 2015

Test build #47998 has finished for PR 10374 at commit e09e898.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sarutak
Copy link
Member

sarutak commented Dec 22, 2015

CC: @rxin @marmbrus

@hvanhovell
Copy link
Contributor

@apo1 Could you add a unit test for this?

@bdoepf
Copy link
Author

bdoepf commented Dec 22, 2015

@hvanhovell Ok, I try. But I'm not sure where's the right place for the unit test. sql/catalyst/src/test/scala/org/apache/spark/sql/Rowtest seems to initialize a GenericRow to test some stuff, so I add the test below there, ok?

@SparkQA
Copy link

SparkQA commented Dec 22, 2015

Test build #48190 has finished for PR 10374 at commit 4179fd9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 22, 2015

Test build #48206 has finished for PR 10374 at commit ef4a636.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nongli
Copy link
Contributor

nongli commented Dec 22, 2015

The comment in GenericRow explicitly says not to do this.

"A row implementation that uses an array of objects as the underlying storage. Note that, while
the array is not copied, and thus could technically be mutated after creation, this is not
allowed."

If we change the behavior, we should update that too.

@hvanhovell
Copy link
Contributor

@nongli I think the comment that comment is mainly meant for developers passing an array into the Row constructor; they should not reuse the array. We currently leak a reference to the backing array by accident, which can be a problem.

@cloud-fan
Copy link
Contributor

agree with @hvanhovell about the meaning of that comment. However, I don't think we need to change the copy method as Row is meant to be immutable. row.toSeq.toArray seems a hack to get the underlying array and change value of the row. We should either add doc to say not do this, or fix the toSeq to not leak the backing array.

@marmbrus
Copy link
Contributor

I would prefer to change the implementation of toSeq.

asfgit pushed a commit that referenced this pull request Jan 4, 2016
It is currently possible to change the values of the supposedly immutable ```GenericRow``` and ```GenericInternalRow``` classes. This is caused by the fact that scala's ArrayOps ```toArray``` (returned by calling ```toSeq```) will return the backing array instead of a copy. This PR fixes this problem.

This PR was inspired by #10374 by apo1.

cc apo1 sarutak marmbrus cloud-fan nongli (everyone in the previous conversation).

Author: Herman van Hovell <[email protected]>

Closes #10553 from hvanhovell/SPARK-12421.
@rxin
Copy link
Contributor

rxin commented Jun 15, 2016

Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one.

@asfgit asfgit closed this in 1a33f2e Jun 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants