Optimize count agg expr with null column statistics #1063

matthewmturner · 2021-09-28T15:16:54Z

Which issue does this PR close?

Closes #904

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

matthewmturner · 2021-09-28T15:19:14Z

@Dandandan @rdettai FYI - I took a first cut at this. Still reviewing on my end - but wanted to at least open the PR in case either of you had specific thoughts.

Dandandan · 2021-09-28T15:19:38Z

datafusion/src/physical_optimizer/aggregate_statistics.rs

+                {
+                    return Some((
+                        ScalarValue::UInt64(Some((num_rows - val) as u64)),
+                        "COUNT(Uint8(1))".to_string(),


Suggested change

"COUNT(Uint8(1))".to_string(),

"COUNT(UInt8(1))".to_string(),

Shouldn't this be the column name instead of UInt8?

My naive interpretation was that UInt8(1) was a required placeholder to go with the num_rows value in the line before it. Meaning if the column was put there then it would actually perform the count calculation - which we dont want since were using stats.

Would def be good to get a better explanation though!

I think it should be the column name here indeed for it to work.
The current way hash aggregates work is by accessing the value of the aggregate in a projection by the name generated by the expression in the aggregate (e.g. COUNT(UInt8(1)) for count(*) but COUNT(col) for a non null count. So by rewriting and giving it the same name, this avoids computing the aggregate while still being compatible with the usage of the aggregate in the projection.

Dandandan · 2021-09-28T15:22:07Z

I think the approach looks good @matthewmturner

Code could be simplified a bit or take more cases into account (as mentioned in TODO)

matthewmturner · 2021-09-28T15:26:22Z

I think the approach looks good @matthewmturner

Code could be simplified a bit or take more cases into account (as mentioned in TODO)

To confirm, are you referring to the "Optimize with exprs other than Column"?

rdettai · 2021-09-30T14:46:51Z

It would be very reassuring if these optimizations came with a more systematic testing routine. I agree that unit tests in this precise scenario are not very powerful. Currently we have a test here: https://github.com/apache/arrow-datafusion/blob/558d8ecfe2d1b0737de1e2548c5fbedc0ea17ada/datafusion/tests/sql.rs#L3020

that ensures that the count optimizer rule kicks in properly with EXPLAIN [ANALYZE]. What do you think about creating a separate integration test file called optimizers.rs that runs SQL queries and checks if they were optimized as expected?

matthewmturner · 2021-10-01T15:48:36Z

It would be very reassuring if these optimizations came with a more systematic testing routine. I agree that unit tests in this precise scenario are not very powerful. Currently we have a test here:

https://github.com/apache/arrow-datafusion/blob/558d8ecfe2d1b0737de1e2548c5fbedc0ea17ada/datafusion/tests/sql.rs#L3020

that ensures that the count optimizer rule kicks in properly with EXPLAIN [ANALYZE]. What do you think about creating a separate integration test file called optimizers.rs that runs SQL queries and checks if they were optimized as expected?

I'm happy to do this.

Do you think better to make that file on a separate PR or maybe i could just add it for these count optimizations for now and a separate PR for the other optimizations?

rdettai · 2021-10-04T07:43:33Z

Thanks for your motivation! In this PR only tests for this feature are required.

matthewmturner · 2021-10-08T05:31:08Z

@rdettai @Dandandan sry for delay on this. ive been on vacation with very limited internet / computer access. Will pick this up early next week when im back.

rdettai · 2021-10-08T10:03:11Z

It seems that something strange happened with your git modules, hence the modifications that are shown in the diff for parquet-testing and testing.

matthewmturner · 2021-10-12T21:26:35Z

@rdettai @Dandandan - I got this passing CI. Will start looking into tests I can do for this feature now. I'm going to need to study the structure a bit though to figure out the right way to do it - if you have anything specific in mind let me know.

matthewmturner · 2021-10-13T16:46:33Z

@Dandandan @rdettai FYI I added a few tests.

Dandandan · 2021-10-14T05:38:19Z

datafusion/src/physical_optimizer/aggregate_statistics.rs

+        let conf = ExecutionConfig::new();
+        let optimized = AggregateStatistics::new().optimize(Arc::new(plan), &conf)?;
+
+        assert!(optimized.as_any().is::<ProjectionExec>());


Maybe add a comment here that the added ProjectionExec is a sign the optimization was applied.

Sure - added it.

matthewmturner · 2021-10-14T14:09:34Z

@Dandandan @rdettai can you provide some color on why we have different return expressions for the take_optimizable_count and take_optimizable_count_with_nulls?

Specifically, the expression for take_optimizable_count is "COUNT(Uint8(1))" and for take_optimizable_count_with_nulls is "COUNT({})", col_expr.name()"

rdettai · 2021-10-15T13:10:07Z

I'm not sure this is a very well defined thing in the SQL standards, but usually in query engines, projected columns get default names. For instance the result value of SELECT sum(a) FROM table will be in a column named SUM(a) (or something like that).

If you start up the Datafusion CLI as explained in datafusion-cli/README.md, you will get:

> SELECT count(*) FROM foo;
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 1               |
+-----------------+

> SELECT min(a) FROM foo;
+------------+
| MIN(foo.a) |
+------------+
| 1          |
+------------+

So what would you expect the column name to be for SELECT count(a) FROM foo;?

Also, it would be pretty nice if the response column names were the same whether the plan went through an optimization or not 😄

Side note: I checked on Athena (if I remember correctly you are used to that engine), and their choice for example is to simply name the column _col0.

matthewmturner · 2021-10-15T14:22:46Z

@rdettai thank you for the explanation - it makes sense and you remember correctly :)

I can create a separate issue to look into the column names with / without optimization point.

houqp · 2021-10-16T04:34:23Z

I think the following two specs are related to the discussion here:

https://arrow.apache.org/datafusion/specification/invariants.html#physical-schema-is-invariant-under-physical-optimization
https://arrow.apache.org/datafusion/specification/output-field-name-semantic.html

Note that the existing specs are not set in stone, so we can propose changes to them if we can back them with good reasons.

matthewmturner · 2021-10-20T14:17:17Z

@Dandandan @rdettai anything else needed on this PR?

rdettai · 2021-10-21T14:38:46Z

datafusion/src/physical_optimizer/aggregate_statistics.rs

@@ -311,6 +372,33 @@ mod tests {
        Ok(())
    }

+    #[tokio::test]
+    async fn test_count_partial_with_nulls_direct_child() -> Result<()> {


this is not testing the code that you have added, it tests that take_optimizable_count also works if there are nulls in the source dataset.

Yes, thx for picking that up. Looking into it.

rdettai

Sorry for the delay. If I understand correctly, you have copied all the existing tests, just changing the datasource for one that has nulls in it:

this is very verbose
it isn't really testing anything new
it is not testing the code you have added

These optimizations only kick in in corner cases. My feeling is that they are only worth it if we manage to write them in a clean and well tested fashion, otherwise we take the benefit/risk ratio might quickly become bad 😃. Of course, @Dandandan might have another opinion as he opened the issue initially.

matthewmturner · 2021-10-21T15:13:00Z

Sorry for the delay. If I understand correctly, you have copied all the existing tests, just changing the datasource for one that has nulls in it:
* this is very verbose

* it isn't really testing anything new

* it is not testing the code you have added
These optimizations only kick in in corner cases. My feeling is that they are only worth it if we manage to write them in a clean and well tested fashion, otherwise we take the benefit/risk ratio might quickly become bad 😃. Of course, @Dandandan might have another opinion as he opened the issue initially.

@rdettai thank you for the feedback. It seems I underestimated the task. I will check it out more on my side - of course any specific guidance welcome as well :)

One follow up question, on your point that these optimizations only kick in in corner cases. Can you just clarify how a count or count on data with nulls is considered a corner case? I would have thought that would be considered a standard / common scenario.

Thx again.

matthewmturner · 2021-10-21T18:34:05Z

@rdettai @Dandandan one follow up question I realize I should have asked before. Was the 'take_optimized_count' meant as an optimization for COUNT(*) and the new one I added for COUNT(col)?

matthewmturner · 2021-10-26T14:46:31Z

@rdettai @Dandandan thank you for your patience on this. I've made a number of updates - would you be able to check it out? Regarding the point on taking into account more cases - should that be done in a separate PR and done in conjunction with updating take_optimizable_min and take_optimizable_max so that these optimizations are all aligned?

matthewmturner · 2021-11-03T14:12:09Z

@Dandandan @rdettai do you have any thoughts on this?

matthewmturner · 2021-11-09T18:13:12Z

@alamb do you have any thoughts on this PR / how to move forward?

alamb · 2021-11-10T14:34:02Z

Sorry for the delay @matthewmturner -- let me look more carefully at this PR. Note I think @rdettai is out this week and next which may explain the slow response. I have no such excuse however :)

alamb

I ran out of time to do a more thorough review, but I left some hints -- hopefully they make sense @matthewmturner ?

alamb · 2021-11-10T14:58:33Z

datafusion/src/physical_optimizer/aggregate_statistics.rs

        let conf = ExecutionConfig::new();
        let optimized = AggregateStatistics::new().optimize(Arc::new(plan), &conf)?;

+        let (col, count) = match nulls {
+            false => (Field::new("COUNT(Uint8(1))", DataType::UInt64, false), 3),


I don't understand why the column name output is different for columns with NULLs (and columns that don't have nulls)

I think the difference is if the aggregate is COUNT(*) --> COUNT(UInt8(1)) and COUNT(col) --> COUNT(col)

This can be seen in the datafusion-cli on master

❯ create table foo as select * from (values (1, NULL), (2, 2), (3,3)) as sq; 0 rows in set. Query took 0.007 seconds. ❯ select * from foo; +---------+---------+ | column1 | column2 | +---------+---------+ | 1 | | | 2 | 2 | | 3 | 3 | +---------+---------+ 3 rows in set. Query took 0.003 seconds. ❯ select count(column1) from foo; +--------------------+ | COUNT(foo.column1) | +--------------------+ | 3 | +--------------------+ 1 row in set. Query took 0.004 seconds. ❯ select count(column2) from foo; +--------------------+ | COUNT(foo.column2) | +--------------------+ | 2 | +--------------------+ 1 row in set. Query took 0.004 seconds. ❯ select count(*) from foo; +-----------------+ | COUNT(UInt8(1)) | +-----------------+ | 3 | +-----------------+ 1 row in set. Query took 0.002 seconds.

i think this is just a case of a poorly named variable. it should really be something like col_count i.e. if the test is on a table count or column count. when using datafusion-cli on this branch i get the same output as what you showed. i will update.

alamb · 2021-11-10T15:00:45Z

datafusion/src/physical_optimizer/aggregate_statistics.rs

+        &stats.column_statistics,
+        agg_expr.as_any().downcast_ref::<expressions::Count>(),
+    ) {
+        if casted_expr.expressions().len() == 1 {


it looks like this code handles count(col) whereas the code above only handles count(*) -- that seems strange -- perhaps we should update it so both can handle count(col) and count(*)?

My understanding is that COUNT(*) doesnt need to have a separate handler for nulls - assuming we expect same behavior as psql. For example in psql when i do the following:

postgres=# create table foo as select * from (values (1,NULL),(NULL,2),(3,3)) as sq; SELECT 3 postgres=# select * from foo; column1 | column2 ---------+--------- 1 | | 2 3 | 3 (3 rows) postgres=# select count(*) from foo; count ------- 3 (1 row)

Does it make sense to reframe these optimizations as the following:
take_optimizable_table_count (current take_optimizable_count)=> comes from COUNT(*) and returns num_rows
take_optimizable_column_count (current take_optimizable_count_with_nulls) => comes from COUNT(col) and return num_rows - null_count for col

I think those names make more sense to me

Ok - ive updated. Let me know if anything else needed.

matthewmturner · 2021-11-10T18:01:55Z

@alamb no apology necessary, thank you for the feedback. i will look into your comments.

alamb

@matthewmturner something seems to have gone wrong with the most recent merge (it has 10k lines added 🤔 )

I am not sure what is going on here.

Perhaps it is time to rebase this PR (I am happy to help do so) -- this PR has been hanging out for so long I want to help get it in

rdettai

I think this code will need some refactoring before being able to cover more cases of aggregation simplification, but this is good enough for now.

Dandandan · 2021-11-16T16:34:46Z

datafusion/src/physical_optimizer/aggregate_statistics.rs

        let conf = ExecutionConfig::new();
        let optimized = AggregateStatistics::new().optimize(Arc::new(plan), &conf)?;

+        let (col, count) = match nulls {
+            false => (Field::new("COUNT(Uint8(1))", DataType::UInt64, false), 3),


Suggested change

false => (Field::new("COUNT(Uint8(1))", DataType::UInt64, false), 3),

false => (Field::new("COUNT(UInt8(1))", DataType::UInt64, false), 3),

matthewmturner · 2021-11-16T17:03:48Z

I think this code will need some refactoring before being able to cover more cases of aggregation simplification, but this is good enough for now.

@rdettai are you referring to handling more than just column expressions? if so, i can do a follow on pr for updating this and the min/max optimizations (which this was based on).

alamb

Looks good to me -- thank you for sticking with it @matthewmturner !

github-actions bot added the datafusion Changes in the datafusion crate label Sep 28, 2021

Dandandan reviewed Sep 28, 2021

View reviewed changes

houqp added the performance Make DataFusion faster label Sep 28, 2021

Dandandan reviewed Oct 14, 2021

View reviewed changes

matthewmturner mentioned this pull request Oct 14, 2021

Improve testing of optimizers using EXPLAIN #1118

Closed

matthewmturner mentioned this pull request Oct 15, 2021

Ensure column names are equivalent with or without optimization #1123

Closed

rdettai reviewed Oct 21, 2021

View reviewed changes

alamb reviewed Nov 10, 2021

View reviewed changes

github-actions bot added ballista documentation Improvements or additions to documentation sql SQL Planner labels Nov 15, 2021

alamb reviewed Nov 15, 2021

View reviewed changes

matthewmturner added 15 commits November 16, 2021 10:33

Add prints

bf0be0a

First cut at optimizing count with nulls

ca4e1c8

Cleanup old changes

472c80a

Uint to UInt

ed7c838

COUNT with col name

ad5311c

Fix count_with_nulls and submodules

3e28a0f

Fix submodules

597338b

Fix testing submodule

fb683c6

Add tests

6dd40ea

Add comment on test and removed redundant test function

df032a2

Update testing for count nulls

c73d6bc

Update inexact stats test

62e0eeb

Cleanup test utils

a47e812

Cleanup mock_data and prints

42c01d1

Update count optimization function names

1c65b6a

github-actions bot added the development-process Related to development process of DataFusion label Nov 16, 2021

matthewmturner force-pushed the optimize_count_with_null branch from d0cd143 to 1c65b6a Compare November 16, 2021 15:47

rdettai approved these changes Nov 16, 2021

View reviewed changes

Dandandan reviewed Nov 16, 2021

View reviewed changes

Update column name

5f8c9fa

alamb approved these changes Nov 17, 2021

View reviewed changes

alamb merged commit e213254 into apache:master Nov 17, 2021

	"COUNT(Uint8(1))".to_string(),
	"COUNT(UInt8(1))".to_string(),

	false => (Field::new("COUNT(Uint8(1))", DataType::UInt64, false), 3),
	false => (Field::new("COUNT(UInt8(1))", DataType::UInt64, false), 3),

Optimize count agg expr with null column statistics #1063

Optimize count agg expr with null column statistics #1063

Conversation

matthewmturner commented Sep 28, 2021

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

matthewmturner commented Sep 28, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan commented Sep 28, 2021

matthewmturner commented Sep 28, 2021

rdettai commented Sep 30, 2021 • edited Loading

matthewmturner commented Oct 1, 2021

rdettai commented Oct 4, 2021

matthewmturner commented Oct 8, 2021

rdettai commented Oct 8, 2021

matthewmturner commented Oct 12, 2021

matthewmturner commented Oct 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewmturner commented Oct 14, 2021

rdettai commented Oct 15, 2021

matthewmturner commented Oct 15, 2021

houqp commented Oct 16, 2021

matthewmturner commented Oct 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdettai left a comment

Choose a reason for hiding this comment

matthewmturner commented Oct 21, 2021

matthewmturner commented Oct 21, 2021 • edited Loading

matthewmturner commented Oct 26, 2021

matthewmturner commented Nov 3, 2021

matthewmturner commented Nov 9, 2021

alamb commented Nov 10, 2021

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewmturner Nov 10, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewmturner commented Nov 10, 2021

alamb left a comment

Choose a reason for hiding this comment

rdettai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewmturner commented Nov 16, 2021 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

rdettai commented Sep 30, 2021 •

edited

Loading

matthewmturner commented Oct 21, 2021 •

edited

Loading

matthewmturner Nov 10, 2021 •

edited

Loading

matthewmturner commented Nov 16, 2021 •

edited

Loading