[SPARK-9435][SQL] Reuse function in Java UDF to correctly support expressions that require equality comparison between ScalaUDF #16553

HyukjinKwon · 2017-01-11T18:21:42Z

What changes were proposed in this pull request?

Currently, running the codes in Java

spark.udf().register("inc", new UDF1<Long, Long>() {
  @Override
  public Long call(Long i) {
    return i + 1;
  }
}, DataTypes.LongType);

spark.range(10).toDF("x").createOrReplaceTempView("tmp");
Row result = spark.sql("SELECT inc(x) FROM tmp GROUP BY inc(x)").head();
Assert.assertEquals(7, result.getLong(0));

fails as below:

org.apache.spark.sql.AnalysisException: expression 'tmp.`x`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
Aggregate [UDF(x#19L)], [UDF(x#19L) AS UDF(x)#23L]
+- SubqueryAlias tmp, `tmp`
   +- Project [id#16L AS x#19L]
      +- Range (0, 10, step=1, splits=Some(8))

	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)

The root cause is because we were creating the function every time when it needs to build as below:

scala> def inc(i: Int) = i + 1
inc: (i: Int)Int

scala> (inc(_: Int)).hashCode
res15: Int = 1231799381

scala> (inc(_: Int)).hashCode
res16: Int = 2109839984

scala> (inc(_: Int)) == (inc(_: Int))
res17: Boolean = false

This seems leading to the comparison failure between ScalaUDFs created from Java UDF API, for example, in Expression.semanticEquals.

In case of Scala one, it seems already fine.

Both can be tested easily as below if any reviewer is more comfortable with Scala:

val df = Seq((1, 10), (2, 11), (3, 12)).toDF("x", "y")
val javaUDF = new UDF1[Int, Int]  {
  override def call(i: Int): Int = i + 1
}
// spark.udf.register("inc", javaUDF, IntegerType) // Uncomment this for Java API
// spark.udf.register("inc", (i: Int) => i + 1)    // Uncomment this for Scala API
df.createOrReplaceTempView("tmp")
spark.sql("SELECT inc(y) FROM tmp GROUP BY inc(y)").show()

How was this patch tested?

Unit test in JavaUDFSuite.java and ./dev/lint-java.

…omparison

HyukjinKwon · 2017-01-11T18:32:00Z

cc @marmbrus, I just saw you in the JIRA. Could you please take a look?

marmbrus · 2017-01-11T19:45:04Z

sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala

@@ -488,219 +488,241 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends
   * @since 1.3.0
   */
  def register(name: String, f: UDF1[_, _], returnType: DataType): Unit = {
+    val func = f.asInstanceOf[UDF1[Any, Any]].call(_: Any)


There is commented out code above thats used to generate these functions. We should update it or delete it.

Ah, sure. Thanks!

SparkQA · 2017-01-11T20:45:57Z

Test build #71224 has finished for PR 16553 at commit 30ed14f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-11T20:46:35Z

Test build #71225 has finished for PR 16553 at commit 3dea44f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-01-12T00:40:08Z

sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala

         |  functionRegistry.registerFunction(
         |    name,
-         |    (e: Seq[Expression]) => ScalaUDF(f$anyCast.call($anyParams), returnType, e))
+         |    (e: Seq[Expression]) => ScalaUDF(func, returnType, e))


I verified this by overwriting the current changes after copying and pasting and checking no diff.

I can confirm they are the same.

SparkQA · 2017-01-12T02:56:14Z

Test build #71235 has finished for PR 16553 at commit 2ea071a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-01-15T01:34:22Z

@marmbrus, could you take another look when you have some time?

gatorsmile · 2017-01-15T06:58:25Z

LGTM. cc @marmbrus for final sign off

HyukjinKwon · 2017-01-15T12:46:12Z

@gatorsmile Thanks!

viirya · 2017-01-15T13:05:03Z

sql/core/src/test/java/test/org/apache/spark/sql/JavaUDFSuite.java

+    }, DataTypes.LongType);
+
+    spark.range(10).toDF("x").createOrReplaceTempView("tmp");
+    List<Row> results = spark.sql("SELECT inc(x) FROM tmp GROUP BY inc(x)").collectAsList();


This test is not so obvious what it goes to test for. Can we add few comments showing that?

Sure, makes sense. Thanks!

viirya · 2017-01-15T13:05:33Z

One minor comment, otherwise LGTM.

SparkQA · 2017-01-15T15:43:12Z

Test build #71398 has finished for PR 16553 at commit 0d5b586.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-01-17T03:39:13Z

@marmbrus Can this be merged by any change maybe?

HyukjinKwon · 2017-01-21T16:23:29Z

gentle ping..

gatorsmile · 2017-01-23T23:25:38Z

retest this please

gatorsmile · 2017-01-23T23:26:21Z

Maybe we can merge it now and you can resolve any extra comment from @marmbrus as a followup

HyukjinKwon · 2017-01-23T23:35:58Z

@gatorsmile Thanks !

SparkQA · 2017-01-24T01:48:26Z

Test build #71885 has finished for PR 16553 at commit 0d5b586.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-01-24T02:49:24Z

(FWIW, I checked /build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package)

…ressions that require equality comparison between ScalaUDF ## What changes were proposed in this pull request? Currently, running the codes in Java ```java spark.udf().register("inc", new UDF1<Long, Long>() { Override public Long call(Long i) { return i + 1; } }, DataTypes.LongType); spark.range(10).toDF("x").createOrReplaceTempView("tmp"); Row result = spark.sql("SELECT inc(x) FROM tmp GROUP BY inc(x)").head(); Assert.assertEquals(7, result.getLong(0)); ``` fails as below: ``` org.apache.spark.sql.AnalysisException: expression 'tmp.`x`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Aggregate [UDF(x#19L)], [UDF(x#19L) AS UDF(x)#23L] +- SubqueryAlias tmp, `tmp` +- Project [id#16L AS x#19L] +- Range (0, 10, step=1, splits=Some(8)) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57) ``` The root cause is because we were creating the function every time when it needs to build as below: ```scala scala> def inc(i: Int) = i + 1 inc: (i: Int)Int scala> (inc(_: Int)).hashCode res15: Int = 1231799381 scala> (inc(_: Int)).hashCode res16: Int = 2109839984 scala> (inc(_: Int)) == (inc(_: Int)) res17: Boolean = false ``` This seems leading to the comparison failure between `ScalaUDF`s created from Java UDF API, for example, in `Expression.semanticEquals`. In case of Scala one, it seems already fine. Both can be tested easily as below if any reviewer is more comfortable with Scala: ```scala val df = Seq((1, 10), (2, 11), (3, 12)).toDF("x", "y") val javaUDF = new UDF1[Int, Int] { override def call(i: Int): Int = i + 1 } // spark.udf.register("inc", javaUDF, IntegerType) // Uncomment this for Java API // spark.udf.register("inc", (i: Int) => i + 1) // Uncomment this for Scala API df.createOrReplaceTempView("tmp") spark.sql("SELECT inc(y) FROM tmp GROUP BY inc(y)").show() ``` ## How was this patch tested? Unit test in `JavaUDFSuite.java` and `./dev/lint-java`. Author: hyukjinkwon <[email protected]> Closes #16553 from HyukjinKwon/SPARK-9435. (cherry picked from commit e576c1e) Signed-off-by: gatorsmile <[email protected]>

gatorsmile · 2017-01-24T06:21:57Z

Thanks! Merging to master.

HyukjinKwon · 2017-01-24T07:03:17Z

Thank you @gatorsmile

…ressions that require equality comparison between ScalaUDF ## What changes were proposed in this pull request? Currently, running the codes in Java ```java spark.udf().register("inc", new UDF1<Long, Long>() { Override public Long call(Long i) { return i + 1; } }, DataTypes.LongType); spark.range(10).toDF("x").createOrReplaceTempView("tmp"); Row result = spark.sql("SELECT inc(x) FROM tmp GROUP BY inc(x)").head(); Assert.assertEquals(7, result.getLong(0)); ``` fails as below: ``` org.apache.spark.sql.AnalysisException: expression 'tmp.`x`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Aggregate [UDF(x#19L)], [UDF(x#19L) AS UDF(x)#23L] +- SubqueryAlias tmp, `tmp` +- Project [id#16L AS x#19L] +- Range (0, 10, step=1, splits=Some(8)) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57) ``` The root cause is because we were creating the function every time when it needs to build as below: ```scala scala> def inc(i: Int) = i + 1 inc: (i: Int)Int scala> (inc(_: Int)).hashCode res15: Int = 1231799381 scala> (inc(_: Int)).hashCode res16: Int = 2109839984 scala> (inc(_: Int)) == (inc(_: Int)) res17: Boolean = false ``` This seems leading to the comparison failure between `ScalaUDF`s created from Java UDF API, for example, in `Expression.semanticEquals`. In case of Scala one, it seems already fine. Both can be tested easily as below if any reviewer is more comfortable with Scala: ```scala val df = Seq((1, 10), (2, 11), (3, 12)).toDF("x", "y") val javaUDF = new UDF1[Int, Int] { override def call(i: Int): Int = i + 1 } // spark.udf.register("inc", javaUDF, IntegerType) // Uncomment this for Java API // spark.udf.register("inc", (i: Int) => i + 1) // Uncomment this for Scala API df.createOrReplaceTempView("tmp") spark.sql("SELECT inc(y) FROM tmp GROUP BY inc(y)").show() ``` ## How was this patch tested? Unit test in `JavaUDFSuite.java` and `./dev/lint-java`. Author: hyukjinkwon <[email protected]> Closes apache#16553 from HyukjinKwon/SPARK-9435.

HyukjinKwon added 2 commits January 12, 2017 03:02

Reuse function in Java UDF to support correctly expression equality c…

30ed14f

…omparison

Better test

3dea44f

HyukjinKwon changed the title ~~[SPARK-9435][SQL] Reuse function in Java UDF to correctly support expressions that require equality comparison~~ [SPARK-9435][SQL] Reuse function in Java UDF to correctly support expressions that require equality comparison between ScalaUDF Jan 11, 2017

marmbrus reviewed Jan 11, 2017

View reviewed changes

Fix comments that are used to print out the register functions

2ea071a

HyukjinKwon commented Jan 12, 2017

View reviewed changes

viirya reviewed Jan 15, 2017

View reviewed changes

Add a comment to explain what tht test is for

0d5b586

asfgit closed this in e576c1e Jan 24, 2017

HyukjinKwon deleted the SPARK-9435 branch January 2, 2018 03:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-9435][SQL] Reuse function in Java UDF to correctly support expressions that require equality comparison between ScalaUDF #16553

[SPARK-9435][SQL] Reuse function in Java UDF to correctly support expressions that require equality comparison between ScalaUDF #16553

HyukjinKwon commented Jan 11, 2017 •

edited

Loading

HyukjinKwon commented Jan 11, 2017

marmbrus Jan 11, 2017

HyukjinKwon Jan 11, 2017

SparkQA commented Jan 11, 2017

SparkQA commented Jan 11, 2017

HyukjinKwon Jan 12, 2017

gatorsmile Jan 15, 2017

SparkQA commented Jan 12, 2017

HyukjinKwon commented Jan 15, 2017

gatorsmile commented Jan 15, 2017

HyukjinKwon commented Jan 15, 2017

viirya Jan 15, 2017

HyukjinKwon Jan 15, 2017

viirya commented Jan 15, 2017

SparkQA commented Jan 15, 2017

HyukjinKwon commented Jan 17, 2017

HyukjinKwon commented Jan 21, 2017

gatorsmile commented Jan 23, 2017

gatorsmile commented Jan 23, 2017 •

edited

Loading

HyukjinKwon commented Jan 23, 2017

SparkQA commented Jan 24, 2017

HyukjinKwon commented Jan 24, 2017

gatorsmile commented Jan 24, 2017

HyukjinKwon commented Jan 24, 2017

[SPARK-9435][SQL] Reuse function in Java UDF to correctly support expressions that require equality comparison between ScalaUDF #16553

[SPARK-9435][SQL] Reuse function in Java UDF to correctly support expressions that require equality comparison between ScalaUDF #16553

Conversation

HyukjinKwon commented Jan 11, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Jan 11, 2017

marmbrus Jan 11, 2017

Choose a reason for hiding this comment

HyukjinKwon Jan 11, 2017

Choose a reason for hiding this comment

SparkQA commented Jan 11, 2017

SparkQA commented Jan 11, 2017

HyukjinKwon Jan 12, 2017

Choose a reason for hiding this comment

gatorsmile Jan 15, 2017

Choose a reason for hiding this comment

SparkQA commented Jan 12, 2017

HyukjinKwon commented Jan 15, 2017

gatorsmile commented Jan 15, 2017

HyukjinKwon commented Jan 15, 2017

viirya Jan 15, 2017

Choose a reason for hiding this comment

HyukjinKwon Jan 15, 2017

Choose a reason for hiding this comment

viirya commented Jan 15, 2017

SparkQA commented Jan 15, 2017

HyukjinKwon commented Jan 17, 2017

HyukjinKwon commented Jan 21, 2017

gatorsmile commented Jan 23, 2017

gatorsmile commented Jan 23, 2017 • edited Loading

HyukjinKwon commented Jan 23, 2017

SparkQA commented Jan 24, 2017

HyukjinKwon commented Jan 24, 2017

gatorsmile commented Jan 24, 2017

HyukjinKwon commented Jan 24, 2017

HyukjinKwon commented Jan 11, 2017 •

edited

Loading

gatorsmile commented Jan 23, 2017 •

edited

Loading