Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17712][SQL] Fix invalid pushdown of data-independent filters beneath aggregates #15289

Closed
wants to merge 2 commits into from

Conversation

JoshRosen
Copy link
Contributor

What changes were proposed in this pull request?

This patch fixes a minor correctness issue impacting the pushdown of filters beneath aggregates. Specifically, if a filter condition references no grouping or aggregate columns (e.g. WHERE false) then it would be incorrectly pushed beneath an aggregate.

Intuitively, the only case where you can push a filter beneath an aggregate is when that filter is deterministic and is defined over the grouping columns / expressions, since in that case the filter is acting to exclude entire groups from the query (like a HAVING clause). The existing code would only push deterministic filters beneath aggregates when all of the filter's references were grouping columns, but this logic missed the case where a filter has no references. For example, WHERE false is deterministic but is independent of the actual data.

This patch fixes this minor bug by adding a new check to ensure that we don't push filters beneath aggregates when those filters don't reference any columns.

How was this patch tested?

New regression test in FilterPushdownSuite.

@@ -710,7 +710,7 @@ object PushDownPredicate extends Rule[LogicalPlan] with PredicateHelper {

val (pushDown, rest) = candidates.partition { cond =>
val replaced = replaceAlias(cond, aliasMap)
replaced.references.subsetOf(aggregate.child.outputSet)
cond.references.nonEmpty && replaced.references.subsetOf(aggregate.child.outputSet)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use replaced.references here then this will break the "aggregate: push down filters with literal" test case. Here's that test:

  test("aggregate: push down filters with literal") {
    val originalQuery = testRelation
      .select('a, 'b)
      .groupBy('a)('a, count('b) as 'c, "s" as 'd)
      .where('c === 2L && 'd === "s")

    val optimized = Optimize.execute(originalQuery.analyze)

    val correctAnswer = testRelation
      .where("s" === "s")
      .select('a, 'b)
      .groupBy('a)('a, count('b) as 'c, "s" as 'd)
      .where('c === 2L)
      .analyze

    comparePlans(optimized, correctAnswer)
  }

If you use replaced.references then the 'd in the filter is replaced with "s" and the resulting expression no longer contains any references and thus isn't pushed.

@hvanhovell
Copy link
Contributor

LGTM - pending Jenkins.

@SparkQA
Copy link

SparkQA commented Sep 29, 2016

Test build #66074 has finished for PR 15289 at commit 87504e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

hvanhovell commented Sep 29, 2016

Merging to master. Thanks!

@hvanhovell
Copy link
Contributor

@JoshRosen I cannot cherry-pick this one into 2.0. Could you open a PR against 2.0 I feel you that this is required?

@asfgit asfgit closed this in 37eb918 Sep 29, 2016
@ioana-delaney
Copy link
Contributor

@JoshRosen Wouldn't be a better design to push down the predicate but keep the original predicate as well? If the aggregate is above a complex join, not pushing down the predicate may have significant performance implications.

@JoshRosen
Copy link
Contributor Author

@ioana-delaney, to clarify, are you suggesting that there are predicates which we could correctly be pushing down past aggregates but are not pushing? I believe that the current logic encompasses all of the cases that are safe to push.

Or are you suggesting that we're affecting correctness by not keeping the original predicate following the pushdown?

asfgit pushed a commit that referenced this pull request Sep 29, 2016
…eneath aggregates

## What changes were proposed in this pull request?

This patch fixes a minor correctness issue impacting the pushdown of filters beneath aggregates. Specifically, if a filter condition references no grouping or aggregate columns (e.g. `WHERE false`) then it would be incorrectly pushed beneath an aggregate.

Intuitively, the only case where you can push a filter beneath an aggregate is when that filter is deterministic and is defined over the grouping columns / expressions, since in that case the filter is acting to exclude entire groups from the query (like a `HAVING` clause). The existing code would only push deterministic filters beneath aggregates when all of the filter's references were grouping columns, but this logic missed the case where a filter has no references. For example, `WHERE false` is deterministic but is independent of the actual data.

This patch fixes this minor bug by adding a new check to ensure that we don't push filters beneath aggregates when those filters don't reference any columns.

## How was this patch tested?

New regression test in FilterPushdownSuite.

Author: Josh Rosen <[email protected]>

Closes #15289 from JoshRosen/SPARK-17712.

(cherry picked from commit 37eb918)
Signed-off-by: Josh Rosen <[email protected]>
@JoshRosen JoshRosen deleted the SPARK-17712 branch September 29, 2016 19:12
@JoshRosen
Copy link
Contributor Author

@hvanhovell, I've cherry-picked this fix to branch-2.0.

@ioana-delaney
Copy link
Contributor

@JoshRosen The original predicate has to be kept above the aggregation. An optimization would be to also push down the predicate below the aggregation, lower in the plan for early filtering. In theory an always-false predicate should eliminate an entire sub-plan.

@ioana-delaney
Copy link
Contributor

@JoshRosen In your example, we don't want to first count one million rows coming from the base table and then to return zero rows based on the false predicate in the outer query block. Instead, by pushing down the predicate to the base table, you do a pre-filtering and return zero rows early in the plan. Then you apply the aggregate followed by the the original predicate that will do the final filtering. Anyway, just some thought for further optimizations of predicates pushed down through aggregation. Also, a more realistic query would imply false predicates through predicate transitivity e.g. a != b and a =1 and b =1 => 1 != 1 So there might be some real customer queries that can take advantage of these optimizations.

asfgit pushed a commit that referenced this pull request Jan 8, 2018
…ng expressions

## What changes were proposed in this pull request?

The following SQL query should return zero rows, but in Spark it actually returns one row:

```
SELECT 1 from (
  SELECT 1 AS z,
  MIN(a.x)
  FROM (select 1 as x) a
  WHERE false
) b
where b.z != b.z
```

The problem stems from the `PushDownPredicate` rule: when this rule encounters a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, it removes the original filter and adds a new filter onto Aggregate's child, e.g. `Agg(Filter(...))`. This is sometimes okay, but the case above is a counterexample: because there is no explicit `GROUP BY`, we are implicitly computing a global aggregate over the entire table so the original filter was not acting like a `HAVING` clause filtering the number of groups: if we push this filter then it fails to actually reduce the cardinality of the Aggregate output, leading to the wrong answer.

In 2016 I fixed a similar problem involving invalid pushdowns of data-independent filters (filters which reference no columns of the filtered relation). There was additional discussion after my fix was merged which pointed out that my patch was an incomplete fix (see #15289), but it looks I must have either misunderstood the comment or forgot to follow up on the additional points raised there.

This patch fixes the problem by choosing to never push down filters in cases where there are no grouping expressions. Since there are no grouping keys, the only columns are aggregate columns and we can't push filters defined over aggregate results, so this change won't cause us to miss out on any legitimate pushdown opportunities.

## How was this patch tested?

New regression tests in `SQLQueryTestSuite` and `FilterPushdownSuite`.

Author: Josh Rosen <[email protected]>

Closes #20180 from JoshRosen/SPARK-22983-dont-push-filters-beneath-aggs-with-empty-grouping-expressions.

(cherry picked from commit 2c73d2a)
Signed-off-by: gatorsmile <[email protected]>
ghost pushed a commit to dbtsai/spark that referenced this pull request Jan 8, 2018
…ng expressions

## What changes were proposed in this pull request?

The following SQL query should return zero rows, but in Spark it actually returns one row:

```
SELECT 1 from (
  SELECT 1 AS z,
  MIN(a.x)
  FROM (select 1 as x) a
  WHERE false
) b
where b.z != b.z
```

The problem stems from the `PushDownPredicate` rule: when this rule encounters a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, it removes the original filter and adds a new filter onto Aggregate's child, e.g. `Agg(Filter(...))`. This is sometimes okay, but the case above is a counterexample: because there is no explicit `GROUP BY`, we are implicitly computing a global aggregate over the entire table so the original filter was not acting like a `HAVING` clause filtering the number of groups: if we push this filter then it fails to actually reduce the cardinality of the Aggregate output, leading to the wrong answer.

In 2016 I fixed a similar problem involving invalid pushdowns of data-independent filters (filters which reference no columns of the filtered relation). There was additional discussion after my fix was merged which pointed out that my patch was an incomplete fix (see apache#15289), but it looks I must have either misunderstood the comment or forgot to follow up on the additional points raised there.

This patch fixes the problem by choosing to never push down filters in cases where there are no grouping expressions. Since there are no grouping keys, the only columns are aggregate columns and we can't push filters defined over aggregate results, so this change won't cause us to miss out on any legitimate pushdown opportunities.

## How was this patch tested?

New regression tests in `SQLQueryTestSuite` and `FilterPushdownSuite`.

Author: Josh Rosen <[email protected]>

Closes apache#20180 from JoshRosen/SPARK-22983-dont-push-filters-beneath-aggs-with-empty-grouping-expressions.
asfgit pushed a commit that referenced this pull request Jan 8, 2018
…ng expressions

## What changes were proposed in this pull request?

The following SQL query should return zero rows, but in Spark it actually returns one row:

```
SELECT 1 from (
  SELECT 1 AS z,
  MIN(a.x)
  FROM (select 1 as x) a
  WHERE false
) b
where b.z != b.z
```

The problem stems from the `PushDownPredicate` rule: when this rule encounters a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, it removes the original filter and adds a new filter onto Aggregate's child, e.g. `Agg(Filter(...))`. This is sometimes okay, but the case above is a counterexample: because there is no explicit `GROUP BY`, we are implicitly computing a global aggregate over the entire table so the original filter was not acting like a `HAVING` clause filtering the number of groups: if we push this filter then it fails to actually reduce the cardinality of the Aggregate output, leading to the wrong answer.

In 2016 I fixed a similar problem involving invalid pushdowns of data-independent filters (filters which reference no columns of the filtered relation). There was additional discussion after my fix was merged which pointed out that my patch was an incomplete fix (see #15289), but it looks I must have either misunderstood the comment or forgot to follow up on the additional points raised there.

This patch fixes the problem by choosing to never push down filters in cases where there are no grouping expressions. Since there are no grouping keys, the only columns are aggregate columns and we can't push filters defined over aggregate results, so this change won't cause us to miss out on any legitimate pushdown opportunities.

## How was this patch tested?

New regression tests in `SQLQueryTestSuite` and `FilterPushdownSuite`.

Author: Josh Rosen <[email protected]>

Closes #20180 from JoshRosen/SPARK-22983-dont-push-filters-beneath-aggs-with-empty-grouping-expressions.

(cherry picked from commit 2c73d2a)
Signed-off-by: gatorsmile <[email protected]>
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…ng expressions

## What changes were proposed in this pull request?

The following SQL query should return zero rows, but in Spark it actually returns one row:

```
SELECT 1 from (
  SELECT 1 AS z,
  MIN(a.x)
  FROM (select 1 as x) a
  WHERE false
) b
where b.z != b.z
```

The problem stems from the `PushDownPredicate` rule: when this rule encounters a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, it removes the original filter and adds a new filter onto Aggregate's child, e.g. `Agg(Filter(...))`. This is sometimes okay, but the case above is a counterexample: because there is no explicit `GROUP BY`, we are implicitly computing a global aggregate over the entire table so the original filter was not acting like a `HAVING` clause filtering the number of groups: if we push this filter then it fails to actually reduce the cardinality of the Aggregate output, leading to the wrong answer.

In 2016 I fixed a similar problem involving invalid pushdowns of data-independent filters (filters which reference no columns of the filtered relation). There was additional discussion after my fix was merged which pointed out that my patch was an incomplete fix (see apache#15289), but it looks I must have either misunderstood the comment or forgot to follow up on the additional points raised there.

This patch fixes the problem by choosing to never push down filters in cases where there are no grouping expressions. Since there are no grouping keys, the only columns are aggregate columns and we can't push filters defined over aggregate results, so this change won't cause us to miss out on any legitimate pushdown opportunities.

## How was this patch tested?

New regression tests in `SQLQueryTestSuite` and `FilterPushdownSuite`.

Author: Josh Rosen <[email protected]>

Closes apache#20180 from JoshRosen/SPARK-22983-dont-push-filters-beneath-aggs-with-empty-grouping-expressions.

(cherry picked from commit 2c73d2a)
Signed-off-by: gatorsmile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants