[SPARK-5817] [SQL] Fix bug of udtf with column names #4602

chenghao-intel · 2015-02-14T03:50:53Z

It's a bug while do query like:

select d from (select explode(array(1,1)) d from src limit 1) t

And it will throws exception like:

org.apache.spark.sql.AnalysisException: cannot resolve 'd' given input columns _c0; line 1 pos 7
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:116)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

To solve the bug, it requires code refactoring for UDTF
The major changes are about:

Simplifying the UDTF development, UDTF will manage the output attribute names any more, instead, the logical.Generate will handle that properly.
UDTF will be asked for the output schema (data types) during the logical plan analyzing.

SparkQA · 2015-02-14T03:52:27Z

Test build #27472 has started for PR 4602 at commit 5ddab7e.

This patch merges cleanly.

SparkQA · 2015-02-14T03:53:30Z

Test build #27472 has finished for PR 4602 at commit 5ddab7e.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-14T03:53:31Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27472/
Test FAILed.

SparkQA · 2015-02-14T04:22:35Z

Test build #27473 has started for PR 4602 at commit 7738ca6.

This patch merges cleanly.

scwf · 2015-02-14T04:49:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala

@@ -101,6 +101,7 @@ case class Alias(child: Expression, name: String)
  extends NamedExpression with trees.UnaryNode[Expression] {

  override type EvaluatedType = Any
+  override lazy val resolved = childrenResolved && !child.isInstanceOf[Generator]


why this change？

Alias(Generator) does not like the normal expression, and it will be transformed into Generate(Generator, alias).

Can you add a comment to this effect?

SparkQA · 2015-02-14T05:29:36Z

Test build #27473 has finished for PR 4602 at commit 7738ca6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-14T05:29:39Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27473/
Test PASSed.

marmbrus · 2015-02-14T17:32:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -137,6 +137,11 @@ class Analyzer(catalog: Catalog,
              failAnalysis(
                s"unresolved operator ${operator.simpleString}")

+            case p @ Project(exprs, _) if exprs.length > 1 && exprs.collect {


perhaps exprs.find(_.isInstanceOf[Generator]).isDefined

e.g. Project(Alias(Generator1, name), Alias(Generator2, name2))

Oh, it's a bug in my code, thanks for finding this. :)

SparkQA · 2015-02-15T01:42:34Z

Test build #27499 has started for PR 4602 at commit 9656e51.

This patch merges cleanly.

SparkQA · 2015-02-15T03:02:08Z

Test build #27499 has finished for PR 4602 at commit 9656e51.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-15T03:02:11Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27499/
Test PASSed.

chenghao-intel · 2015-02-15T03:12:24Z

@marmbrus any more comments on this?

yhuai · 2015-02-17T00:53:47Z

I tried the following

val rdd = sc.parallelize((1 to 10).map(i => s"""{"a":$i, "b":"str${i}"}"""))
sqlContext.jsonRDD(rdd).registerTempTable("jt")
sqlContext.sql("CREATE TABLE gen_tmp (key Int)")
sqlContext.sql("INSERT OVERWRITE TABLE gen_tmp SELECT explode(array(1,2,3)) AS val FROM jt LIMIT 1")

org.apache.spark.sql.AnalysisException: invalid cast from array<struct<_c0:int>> to int;
    at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.failAnalysis(Analyzer.scala:85)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$18$$anonfun$apply$2.applyOrElse(Analyzer.scala:98)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$18$$anonfun$apply$2.applyOrElse(Analyzer.scala:92)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263)

chenghao-intel · 2015-02-17T07:42:51Z

Thank you @yhuai , I've updated the description and rebased the code.

SparkQA · 2015-02-17T07:42:56Z

Test build #27617 has started for PR 4602 at commit f6907d2.

This patch merges cleanly.

chenghao-intel · 2015-02-17T07:56:08Z

retest this please.

SparkQA · 2015-02-17T07:57:31Z

Test build #27620 has started for PR 4602 at commit f6907d2.

This patch merges cleanly.

SparkQA · 2015-02-17T09:01:54Z

Test build #27617 has finished for PR 4602 at commit f6907d2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ShowTablesCommand(databaseName: Option[String]) extends RunnableCommand

AmplabJenkins · 2015-02-17T09:01:57Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27617/
Test PASSed.

SparkQA · 2015-02-17T09:14:39Z

Test build #27620 has finished for PR 4602 at commit f6907d2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ShowTablesCommand(databaseName: Option[String]) extends RunnableCommand

AmplabJenkins · 2015-02-17T09:14:42Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27620/
Test PASSed.

chenghao-intel · 2015-02-18T02:45:57Z

/cc @marmbrus @yhuai Any comment on this?

yhuai · 2015-02-18T03:56:39Z

@chenghao-intel After another look of the code, I think it may be better to remove aliases from the generator. Then, MultiAlias can be used to assign the names to the output fields of a generator. Because a Generator is just an Expression, seems it is not a good idea to put names in it. Instead, using an NamedExpression (e.g. MultiAlias) to wrap a Generator looks like a better approach.

chenghao-intel · 2015-02-22T03:30:04Z

The generator does not like the normal expression, it can output multiple columns. In current implementation, the logical plan node Generate is for that purpose, not the Project. I agree that we need to be improved, as some duplicated code in the sub classes of generator, probably all we need is a more general logical plan node Project? But it seems more changes need to be done for that. I can do that after this PR merged.

chenghao-intel · 2015-02-22T04:45:55Z

@yhuai please ignore my previous comment. I was thinking some other possibilities.
I agree with you we can move the output column names into the logical plan node Generate, but one thing that I am not sure if we need to provide the ability of managing the default field names(if it's not specified) by the generator expression itself.

chenghao-intel · 2015-02-25T00:51:31Z

@yhuai @marmbrus this is a bug fixing, it will be great if you can give more comments on this, and I agree with @yhuai we need to refactor the UDTF expression implementation, but can I put that in the next PR? This is actually a block issue for our internally benchmark.

marmbrus · 2015-02-25T05:01:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -144,6 +144,12 @@ class Analyzer(catalog: Catalog,
              failAnalysis(
                s"unresolved operator ${operator.simpleString}")

+            case p @ Project(exprs, _) if exprs.length > 1 && exprs.flatMap(_.collect {


pull containsMultipleGenerators out into a function.

marmbrus · 2015-04-14T21:23:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

+          case p @ Project(exprs, _) if containsMultipleGenerators(exprs) =>
+            failAnalysis(
+              s"""Only a single table generating function is allowed in a SELECT clause, found:
+                 | ${exprs.map(_.prettyString).mkString(",")}""".stripMargin)


Do we have a test for this error?

Yea, I added in the unit test. see HiveQuerySuite.scala.

SparkQA · 2015-04-17T07:48:37Z

Test build #30468 has started for PR 4602 at commit d2e8b43.

SparkQA · 2015-04-17T08:46:16Z

Test build #30468 has finished for PR 4602 at commit d2e8b43.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Explode(child: Expression)
This patch does not change any dependencies.

AmplabJenkins · 2015-04-17T08:46:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30468/
Test FAILed.

SparkQA · 2015-04-17T17:28:40Z

Test build #30489 has started for PR 4602 at commit 5ee5d2c.

SparkQA · 2015-04-17T17:43:34Z

Test build #30490 has started for PR 4602 at commit 04ae500.

SparkQA · 2015-04-17T18:08:30Z

Test build #30491 has started for PR 4602 at commit 556e982.

SparkQA · 2015-04-17T18:23:36Z

Test build #30493 has started for PR 4602 at commit c2a5132.

SparkQA · 2015-04-17T18:25:42Z

Test build #30489 has finished for PR 4602 at commit 5ee5d2c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Explode(child: Expression)
This patch does not change any dependencies.

AmplabJenkins · 2015-04-17T18:25:46Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30489/
Test FAILed.

SparkQA · 2015-04-17T18:41:50Z

Test build #30490 has finished for PR 4602 at commit 04ae500.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Explode(child: Expression)
This patch does not change any dependencies.

AmplabJenkins · 2015-04-17T18:41:54Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30490/
Test FAILed.

SparkQA · 2015-04-17T19:05:16Z

Test build #30491 has finished for PR 4602 at commit 556e982.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Explode(child: Expression)
This patch does not change any dependencies.

AmplabJenkins · 2015-04-17T19:05:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30491/
Test FAILed.

SparkQA · 2015-04-17T20:15:02Z

Test build #30493 has finished for PR 4602 at commit c2a5132.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Explode(child: Expression)
This patch does not change any dependencies.

AmplabJenkins · 2015-04-17T20:15:06Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30493/
Test PASSed.

marmbrus · 2015-04-21T22:11:36Z

Thanks, merged to master.

It's a bug while do query like: ```sql select d from (select explode(array(1,1)) d from src limit 1) t ``` And it will throws exception like: ``` org.apache.spark.sql.AnalysisException: cannot resolve 'd' given input columns _c0; line 1 pos 7 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:116) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) ``` To solve the bug, it requires code refactoring for UDTF The major changes are about: * Simplifying the UDTF development, UDTF will manage the output attribute names any more, instead, the `logical.Generate` will handle that properly. * UDTF will be asked for the output schema (data types) during the logical plan analyzing. Author: Cheng Hao <[email protected]> Closes apache#4602 from chenghao-intel/explode_bug and squashes the following commits: c2a5132 [Cheng Hao] add back resolved for Alias 556e982 [Cheng Hao] revert the unncessary change 002c361 [Cheng Hao] change the rule of resolved for Generate 04ae500 [Cheng Hao] add qualifier only for generator output 5ee5d2c [Cheng Hao] prepend the new qualifier d2e8b43 [Cheng Hao] Update the code as feedback ca5e7f4 [Cheng Hao] shrink the commits

scwf reviewed Feb 14, 2015
View reviewed changes

marmbrus reviewed Feb 14, 2015
View reviewed changes

chenghao-intel force-pushed the explode_bug branch from 9656e51 to f6907d2 Compare February 17, 2015 07:39

marmbrus reviewed Feb 25, 2015
View reviewed changes

marmbrus reviewed Apr 14, 2015
View reviewed changes

Update the code as feedback

d2e8b43

chenghao-intel force-pushed the explode_bug branch from eb8178c to d2e8b43 Compare April 17, 2015 07:45

prepend the new qualifier

5ee5d2c

add qualifier only for generator output

04ae500

chenghao-intel added 2 commits April 18, 2015 02:03

change the rule of resolved for Generate

002c361

revert the unncessary change

556e982

add back resolved for Alias

c2a5132

asfgit closed this in 7662ec2 Apr 21, 2015

chenghao-intel deleted the explode_bug branch July 2, 2015 08:44

[SPARK-5817] [SQL] Fix bug of udtf with column names #4602

[SPARK-5817] [SQL] Fix bug of udtf with column names #4602

Conversation

chenghao-intel commented Feb 14, 2015

SparkQA commented Feb 14, 2015

SparkQA commented Feb 14, 2015

AmplabJenkins commented Feb 14, 2015

SparkQA commented Feb 14, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 14, 2015

AmplabJenkins commented Feb 14, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 15, 2015

SparkQA commented Feb 15, 2015

AmplabJenkins commented Feb 15, 2015

chenghao-intel commented Feb 15, 2015

yhuai commented Feb 17, 2015

chenghao-intel commented Feb 17, 2015

SparkQA commented Feb 17, 2015

chenghao-intel commented Feb 17, 2015

SparkQA commented Feb 17, 2015

SparkQA commented Feb 17, 2015

AmplabJenkins commented Feb 17, 2015

SparkQA commented Feb 17, 2015

AmplabJenkins commented Feb 17, 2015

chenghao-intel commented Feb 18, 2015

yhuai commented Feb 18, 2015

chenghao-intel commented Feb 22, 2015

chenghao-intel commented Feb 22, 2015

chenghao-intel commented Feb 25, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 17, 2015

SparkQA commented Apr 17, 2015

AmplabJenkins commented Apr 17, 2015

SparkQA commented Apr 17, 2015

SparkQA commented Apr 17, 2015

SparkQA commented Apr 17, 2015

SparkQA commented Apr 17, 2015

SparkQA commented Apr 17, 2015

AmplabJenkins commented Apr 17, 2015

SparkQA commented Apr 17, 2015

AmplabJenkins commented Apr 17, 2015

SparkQA commented Apr 17, 2015

AmplabJenkins commented Apr 17, 2015

SparkQA commented Apr 17, 2015

AmplabJenkins commented Apr 17, 2015

marmbrus commented Apr 21, 2015