[SPARK-12421][SQL] Fixed copy() method of GenericRow. #10374

bdoepf · 2015-12-18T09:11:12Z

This fix is for the following issue SPARK-12421

sarutak · 2015-12-18T09:37:50Z

ok to test.

sarutak · 2015-12-18T09:39:20Z

@apo1 LGTM but could you make the description?

bdoepf · 2015-12-18T10:30:37Z

@sarutak Do you mean the description of this pull-request? I added the link to the Jira. Ok?

sarutak · 2015-12-18T10:41:15Z

OK, Thanks! Also, it's better if you add a test for change.

hvanhovell · 2015-12-18T11:14:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala

@@ -201,7 +201,7 @@ class GenericRow(protected[sql] val values: Array[Any]) extends Row {

  override def toSeq: Seq[Any] = values.toSeq

-  override def copy(): Row = this
+  override def copy(): Row = new GenericRow(values.clone())


GenericRow is supposed to be immutable. So there we really shouldn't need to copy its values. Could you explain why this is needed?

Yes it is immutable, but if you want to extract the values of a GenericRows, change them and create a new row, then it's better to not change the original values.
So you would first want to receive a copy of the original row.

Also the method name copy() implies that it returns a real copy and not the reference on the original object.

Other rows like GenericInternalRow implement a correct copy method:
override def copy(): InternalRow = new GenericInternalRow(values.clone())

You cannot a put value back in the GenericRow when you have taken it out, unless it is a mutable object. Could you provide an example for this? Which is also a good basis for a unit test.

copy is merely a contract all rows need to adhere to. If a row is inmutable, why copy it? It also avoids a lot object allocations. I think GenericInternalRow should also return reference to it self on copy.

I try to provide a simple example and I could not try the code below right now. Imagine you want to query a row, copy it, change a value and then build a new one. Then you want to build a DF out of the new and the old row. I know that this is a strange example and there are a lot of better method to implement it, but you could implement it this way:

import org.apache.spark.sql.Row; val df = sc.parallelize(1,2,3,5,6,7,8,9).toDF.sort val firstRow= df.first val firstRowCopied = firstRow.copy() // <-- HERE the row will not be copied. val arr = firstRowCopied.toSeq.toArray arr(0) = arr(0) * 10 val newRow = Row.fromSeq(arr) sqlContext.createDataframe(List(firstRow, newRow), df.schema) // Both row will contain the value 10 and share the same value Sequence

If you don't want that someone uses the copy method, because it does not do what it implies, then i think we should provide another trait/contract which does not include this method. It's just confusing if the copy method does not copy the object.
row.copy() and just row are identical so far!

I don't see the point, why we should'nt provide this copy method for the users who want to use it.

I have been doing some digging. Using an adapted version of your code (only made it work):

import org.apache.spark.sql.Row; val df = sqlContext.range(1, 9).toDF val firstRow = df.first val firstRowCopied = firstRow.copy() // <-- HERE the row will not be copied. val arr = firstRowCopied.toSeq.toArray arr(0) = arr(0).asInstanceOf[Long] * 10 val newRow = Row.fromSeq(arr) val newDf = sqlContext.createDataFrame(sc.parallelize(Seq(firstRow, newRow)), df.schema)

newDF.show yields the following result:

+---+ | id| +---+ | 10| | 10| +---+

Which is wrong. What happens is that the firstRowCopied.toSeq wraps the value array in a ArrayOps object, this object returns wrapped backing array (instead of a copy) when the you invoke toArray. This really shouldn't happen, because now you mutable gain access to a structure which is supposed to be immutable. I think we should change the toSeq method instead.

I'm not sure what's the best solution. Maybe we need some help from more experienced committers here. I think we are agreed that we need a change here, but we differ how the change should look like.

Anyway, it's not urgent, because the wrong behave should only affect just a few developers/users.

SparkQA · 2015-12-18T11:29:07Z

Test build #47998 has finished for PR 10374 at commit e09e898.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sarutak · 2015-12-22T08:48:55Z

CC: @rxin @marmbrus

hvanhovell · 2015-12-22T08:49:23Z

@apo1 Could you add a unit test for this?

bdoepf · 2015-12-22T09:21:00Z

@hvanhovell Ok, I try. But I'm not sure where's the right place for the unit test. sql/catalyst/src/test/scala/org/apache/spark/sql/Rowtest seems to initialize a GenericRow to test some stuff, so I add the test below there, ok?

SparkQA · 2015-12-22T12:03:53Z

Test build #48190 has finished for PR 10374 at commit 4179fd9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-22T18:42:26Z

Test build #48206 has finished for PR 10374 at commit ef4a636.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2015-12-22T21:16:35Z

The comment in GenericRow explicitly says not to do this.

"A row implementation that uses an array of objects as the underlying storage. Note that, while
the array is not copied, and thus could technically be mutated after creation, this is not
allowed."

If we change the behavior, we should update that too.

hvanhovell · 2015-12-22T21:44:27Z

@nongli I think the comment that comment is mainly meant for developers passing an array into the Row constructor; they should not reuse the array. We currently leak a reference to the backing array by accident, which can be a problem.

cloud-fan · 2015-12-23T07:18:45Z

agree with @hvanhovell about the meaning of that comment. However, I don't think we need to change the copy method as Row is meant to be immutable. row.toSeq.toArray seems a hack to get the underlying array and change value of the row. We should either add doc to say not do this, or fix the toSeq to not leak the backing array.

marmbrus · 2015-12-23T17:27:01Z

I would prefer to change the implementation of toSeq.

It is currently possible to change the values of the supposedly immutable ```GenericRow``` and ```GenericInternalRow``` classes. This is caused by the fact that scala's ArrayOps ```toArray``` (returned by calling ```toSeq```) will return the backing array instead of a copy. This PR fixes this problem. This PR was inspired by #10374 by apo1. cc apo1 sarutak marmbrus cloud-fan nongli (everyone in the previous conversation). Author: Herman van Hovell <[email protected]> Closes #10553 from hvanhovell/SPARK-12421.

rxin · 2016-06-15T22:18:21Z

Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one.

Fixed copy() method of GenericRow.

e09e898

hvanhovell reviewed Dec 18, 2015
View reviewed changes

Added unit test for spark-12421

4179fd9

Merge branch 'master' into myBranch

ef4a636

hvanhovell mentioned this pull request Jan 2, 2016

[SPARK-12421][SQL] Prevent Internal/External row from exposing state. #10553

Closed

asfgit closed this in 1a33f2e Jun 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12421][SQL] Fixed copy() method of GenericRow. #10374

[SPARK-12421][SQL] Fixed copy() method of GenericRow. #10374

bdoepf commented Dec 18, 2015

sarutak commented Dec 18, 2015

sarutak commented Dec 18, 2015

bdoepf commented Dec 18, 2015

sarutak commented Dec 18, 2015

hvanhovell Dec 18, 2015

bdoepf Dec 18, 2015

hvanhovell Dec 18, 2015

bdoepf Dec 18, 2015

hvanhovell Dec 18, 2015

bdoepf Dec 22, 2015

SparkQA commented Dec 18, 2015

sarutak commented Dec 22, 2015

hvanhovell commented Dec 22, 2015

bdoepf commented Dec 22, 2015

SparkQA commented Dec 22, 2015

SparkQA commented Dec 22, 2015

nongli commented Dec 22, 2015

hvanhovell commented Dec 22, 2015

cloud-fan commented Dec 23, 2015

marmbrus commented Dec 23, 2015

rxin commented Jun 15, 2016

[SPARK-12421][SQL] Fixed copy() method of GenericRow. #10374

[SPARK-12421][SQL] Fixed copy() method of GenericRow. #10374

Conversation

bdoepf commented Dec 18, 2015

sarutak commented Dec 18, 2015

sarutak commented Dec 18, 2015

bdoepf commented Dec 18, 2015

sarutak commented Dec 18, 2015

hvanhovell Dec 18, 2015

Choose a reason for hiding this comment

bdoepf Dec 18, 2015

Choose a reason for hiding this comment

hvanhovell Dec 18, 2015

Choose a reason for hiding this comment

bdoepf Dec 18, 2015

Choose a reason for hiding this comment

hvanhovell Dec 18, 2015

Choose a reason for hiding this comment

bdoepf Dec 22, 2015

Choose a reason for hiding this comment

SparkQA commented Dec 18, 2015

sarutak commented Dec 22, 2015

hvanhovell commented Dec 22, 2015

bdoepf commented Dec 22, 2015

SparkQA commented Dec 22, 2015

SparkQA commented Dec 22, 2015

nongli commented Dec 22, 2015

hvanhovell commented Dec 22, 2015

cloud-fan commented Dec 23, 2015

marmbrus commented Dec 23, 2015

rxin commented Jun 15, 2016