Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-2180: support HAVING clauses in Hive queries #1136

Closed
wants to merge 6 commits into from

Conversation

willb
Copy link
Contributor

@willb willb commented Jun 19, 2014

This PR extends Spark's HiveQL support to handle HAVING clauses in aggregations. The HAVING test from the Hive compatibility suite doesn't appear to be runnable from within Spark, so I added a simple comparable test to HiveQuerySuite.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15917/

@rxin
Copy link
Contributor

rxin commented Jun 19, 2014

Any idea why the having test from Hive is not runnable?

@willb
Copy link
Contributor Author

willb commented Jun 19, 2014

@rxin, I'm not 100% sure but I think it's a problem with local map/reduce (the stack trace isn't too informative, but it's the same as the one for tests that are blacklisted due to missing local map/reduce).

I have another commit to push here (adding a semantic exception when HAVING is specified without GROUP BY and test coverage for same).

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15925/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15928/

@rxin
Copy link
Contributor

rxin commented Jun 20, 2014

Thanks, @willb. There is at least one problem I found. - I think you'd need to add a cast to the having expression. Otherwise try run the following:
select key, count(*) c from src group by key having c

In Hive this returns nothing, but in Spark SQL with this patch it throws a runtime exception failing to cast integer to boolean.

@rxin
Copy link
Contributor

rxin commented Jun 20, 2014

To be more specific, I think you can always add a cast that cast the having expression to boolean, and then we have SimplifyCasts in the optimizer that would remove unnecessary casts.

@willb
Copy link
Contributor Author

willb commented Jun 20, 2014

Thanks for the catch, @rxin! I'll make the change and add tests for it.

@willb
Copy link
Contributor Author

willb commented Jun 20, 2014

So I've added a cast in cases in which non-boolean expressions are supplied to having expressions. It appears that Cast(_, BooleanType) isn't idempotent, though -- if you apply it to a Boolean (say, x > 4), it will translate that to NOT ((x > 4) = 0). This seems like a bug, but it's possible that I'm missing the reason why it should work that way. Should I change Cast so that casting an X to X is a no-op?

(Checking the type of a variable during parse doesn't work, so I wind up with a different exception in examples like the one you posted. I'll either need to fix the behavior of Cast or delay adding the cast until I have type information.)

@rxin
Copy link
Contributor

rxin commented Jun 20, 2014

That's definitely a bug - I will take a look at it later.

@willb
Copy link
Contributor Author

willb commented Jun 20, 2014

Thanks! I'm happy to put together a preliminary patch as well, but probably won't be able to take a look until tomorrow morning.

@rxin
Copy link
Contributor

rxin commented Jun 20, 2014

I found the issue and fixed it. Will push out a pull request soon.

If you can just add the boolean cast (always add it - no need to check if the type is already boolean since once I fix the bug, the extra cast on boolean value will be removed), that'd be great.

@rxin
Copy link
Contributor

rxin commented Jun 20, 2014

Here's the patch: #1144

@rxin
Copy link
Contributor

rxin commented Jun 20, 2014

BTW I really want this to go into 1.0.1, which will probably have a release candidate soon. So if you have a chance to rebase your PR and add the cast, please do. Thanks a lot, @willb!

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@willb
Copy link
Contributor Author

willb commented Jun 20, 2014

Thanks for the quick review and patch, @rxin!

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15959/

@yhuai
Copy link
Contributor

yhuai commented Jun 20, 2014

I tried having.q in hive, I got an error on running SELECT key FROM src GROUP BY key HAVING max(value) > "val_255". The reason is that the output of an Aggregate only has selectExpressions.

@rxin
Copy link
Contributor

rxin commented Jun 20, 2014

I'm going to merge this in master & branch-1.0. I will create a separate ticket to track progress on HAVING. Basically there are two things missing:

  1. HAVING without GROUP BY should just become a normal WHERE
  2. HAVING should be able to contain aggregate expressions that don't appear in the aggregation list. This test contains that: https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/having.q

asfgit pushed a commit that referenced this pull request Jun 20, 2014
This PR extends Spark's HiveQL support to handle HAVING clauses in aggregations.  The HAVING test from the Hive compatibility suite doesn't appear to be runnable from within Spark, so I added a simple comparable test to `HiveQuerySuite`.

Author: William Benton <[email protected]>

Closes #1136 from willb/SPARK-2180 and squashes the following commits:

3bbaf26 [William Benton] Added casts to HAVING expressions
83f1340 [William Benton] scalastyle fixes
18387f1 [William Benton] Add test for HAVING without GROUP BY
b880bef [William Benton] Added semantic error for HAVING without GROUP BY
942428e [William Benton] Added test coverage for SPARK-2180.
56084cc [William Benton] Add support for HAVING clauses in Hive queries.

(cherry picked from commit 171ebb3)
Signed-off-by: Reynold Xin <[email protected]>
@asfgit asfgit closed this in 171ebb3 Jun 20, 2014
@willb
Copy link
Contributor Author

willb commented Jun 20, 2014

@rxin, re: the former, seems like most implementations signal this as an error.

@rxin
Copy link
Contributor

rxin commented Jun 20, 2014

There are databases that support that, and it seems to me a very simple change (actually just removing the check code you added is probably enough).

@rxin
Copy link
Contributor

rxin commented Jun 20, 2014

BTW two follow up tickets created:

https://issues.apache.org/jira/browse/SPARK-2225

https://issues.apache.org/jira/browse/SPARK-2226

Let me know if you'd like to work on them.

@willb
Copy link
Contributor Author

willb commented Jun 20, 2014

OK, I wasn't sure if strict Hive compatibility was the goal. I'm happy to take these tickets. Thanks again!

@rxin
Copy link
Contributor

rxin commented Jun 20, 2014

I actually did 2225 already. I will assign 2226 to you. Thanks!

pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
This PR extends Spark's HiveQL support to handle HAVING clauses in aggregations.  The HAVING test from the Hive compatibility suite doesn't appear to be runnable from within Spark, so I added a simple comparable test to `HiveQuerySuite`.

Author: William Benton <[email protected]>

Closes apache#1136 from willb/SPARK-2180 and squashes the following commits:

3bbaf26 [William Benton] Added casts to HAVING expressions
83f1340 [William Benton] scalastyle fixes
18387f1 [William Benton] Add test for HAVING without GROUP BY
b880bef [William Benton] Added semantic error for HAVING without GROUP BY
942428e [William Benton] Added test coverage for SPARK-2180.
56084cc [William Benton] Add support for HAVING clauses in Hive queries.
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
This PR extends Spark's HiveQL support to handle HAVING clauses in aggregations.  The HAVING test from the Hive compatibility suite doesn't appear to be runnable from within Spark, so I added a simple comparable test to `HiveQuerySuite`.

Author: William Benton <[email protected]>

Closes apache#1136 from willb/SPARK-2180 and squashes the following commits:

3bbaf26 [William Benton] Added casts to HAVING expressions
83f1340 [William Benton] scalastyle fixes
18387f1 [William Benton] Add test for HAVING without GROUP BY
b880bef [William Benton] Added semantic error for HAVING without GROUP BY
942428e [William Benton] Added test coverage for SPARK-2180.
56084cc [William Benton] Add support for HAVING clauses in Hive queries.
udaynpusa pushed a commit to mapr/spark that referenced this pull request Jan 30, 2024
…ute URI: ${system:user.name%7D (apache#1136)

Co-authored-by: Egor Krivokon <>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants