Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: Fix SUM(all zeroes) to return 0 instead of NULL #65796

Merged
merged 7 commits into from
Dec 9, 2020

Conversation

palesz
Copy link
Contributor

@palesz palesz commented Dec 3, 2020

Previously the SUM(all zeroes) was NULL, but after this change the SUM
SQL function call is automatically upgraded into a stats aggregation
instead of a sum aggregation. The stats aggregation only results in
NULL if the there were no rows, no values to aggregate, which is the
expected behaviour across different SQL implementations.

This is a workaround for #45251 .

Previously the SUM(all zeroes) was `NULL`, but after this change the SUM
SQL function call is automatically upgraded into a `stats` aggregation
instead of a `sum` aggregation. The `stats` aggregation only results in
`NULL` if the there were no rows, no values to aggregate, which is the
expected behaviour across different SQL implementations.

This is a workaround for elastic#45251 .
@palesz palesz added >bug :Analytics/SQL SQL querying v8.0.0 Team:QL (Deprecated) Meta label for query languages team v7.11.0 labels Dec 3, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-ql (Team:QL)

@palesz
Copy link
Contributor Author

palesz commented Dec 3, 2020

#65792 is prerequisite of this PR.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments and questions. Also, I'd like to see a test being added to QueryTranslatorTests. Thanks.


aggregatingAllNullsWithCountStar
schema::COUNT_AllNulls:l
SELECT COUNT(*) as "COUNT_AllNulls" FROM logs WHERE bytes_out IS NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a test that deals with this scenario: SELECT COUNT(*) count FROM test_emp WHERE first_name IS NULL


aggregatingAllNullsWithSum
schema::SUM_AllNulls:i
SELECT SUM(bytes_out) as "SUM_AllNulls" FROM logs WHERE bytes_out IS NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would look in checking if adding or changing one of the entries in logs to have bytes_in as null would have a big impact on the existing tests. If not, then I would make the change (either adding an entry or changing an existent one) and then a more complex query like SELECT bytes_in, SUM(bytes_in) as SUM_AllNulls, MIN(bytes_in), MAX(bytes_in), AVG(bytes_in) FROM logs WHERE bytes_in = 0 OR bytes_in IS NULL GROUP BY bytes_in would be possible.

@@ -0,0 +1,73 @@

aggregatingAllZerosWithFirst-Ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why -Ignore. Also, why adding the test as sql-spec if the test already exists in .csv-spec?

Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left also a few comments, I agree with @astefan for having also a test in QueryTranslatorTests.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks good however the testing needs cleaning up.

@costin
Copy link
Member

costin commented Dec 3, 2020

#65792 is prerequisite of this PR.

Why? That fix focused on PIVOT, this is a SUM.

@palesz
Copy link
Contributor Author

palesz commented Dec 7, 2020

#65792 is prerequisite of this PR.

Why? That fix focused on PIVOT, this is a SUM.

PIVOT is a GROUP BY with aggregations underneath. Without the #65792 change I cannot promote the SUM aggregation to stats inside PIVOT and we would end up with SUM returning 0 in the GROUP BY, but returning NULL inside the PIVOT.

I have two (+ one) options:

  1. Do SQL: Enable the InnerAggregates inside PIVOT #65792 before the fix of SUM
  2. Don't do SQL: Enable the InnerAggregates inside PIVOT #65792 and only fix SUM inside the GROUP BY (double workaround)
  3. Do not support SUM inside PIVOT (breaking change and lost major functionality of PIVOTs)

@palesz palesz requested review from costin, matriv and astefan December 7, 2020 22:38
Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left some minor comments, though.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments but otherwise LGTM.

@palesz palesz requested review from costin and astefan December 8, 2020 21:13
Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

public LogicalPlan apply(LogicalPlan plan) {
final Map<Expression, Stats> statsPerField = new LinkedHashMap<>();

plan.forEachExpressionsUp(e -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR, but I was wondering: most forEachExpressionsUp/Down methods invocations do pattern matching as first thing. Wouldn't an alternative method similar to Node#forEachUp/Down taking a type token make sense?

Copy link
Member

@costin costin Dec 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue lays with collections. Expressions are not just used as individual nodes but also as properties. Take Project(List<NamedExpression> projections) - this led to issues in not only filtering but in reconstructing said collections with the new expressions while preserving their types. See the comment in LogicalPlan.doTransformExpression

It would be nicer to do:
plan.forEachExpressionsUp(s -> , Sum.class) instead of doing the instanceof check however the issue right now is preserving the type information before and after transformation without causing a CCE.

That said, I plan to take another look at this to see whether it can be sorted out.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@palesz palesz merged commit b74792a into elastic:master Dec 9, 2020
@palesz palesz deleted the fix/sum-null branch December 9, 2020 17:21
palesz pushed a commit to palesz/elasticsearch that referenced this pull request Dec 9, 2020
Previously the SUM(all zeroes) was `NULL`, but after this change the SUM
SQL function call is automatically upgraded into a `stats` aggregation
instead of a `sum` aggregation. The `stats` aggregation only results in
`NULL` if the there were no rows, no values (all nulls) to aggregate, which
is the expected behaviour across different SQL implementations.

This is a workaround for the issue elastic#45251 . Once the results of the `sum`
aggregation can differentiate between `SUM(all nulls)` and
`SUM(all zeroes`) the optimizer rule introduced in this commit needs to be
removed.

(cherry-picked from b74792a)
palesz pushed a commit that referenced this pull request Dec 9, 2020
Previously the SUM(all zeroes) was `NULL`, but after this change the SUM
SQL function call is automatically upgraded into a `stats` aggregation
instead of a `sum` aggregation. The `stats` aggregation only results in
`NULL` if the there were no rows, no values (all nulls) to aggregate, which
is the expected behaviour across different SQL implementations.

This is a workaround for the issue #45251 . Once the results of the `sum`
aggregation can differentiate between `SUM(all nulls)` and
`SUM(all zeroes`) the optimizer rule introduced in this commit needs to be
removed.

(cherry-picked from b74792a)
@Luegg
Copy link
Contributor

Luegg commented Jun 23, 2021

Closes #45251.

(see also #74396)

@Luegg Luegg linked an issue Jun 23, 2021 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL: SUM of multiple 0 values returns NULL
8 participants