Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show statistics min and max in explain #8110

Closed
NGA-TRAN opened this issue Nov 9, 2023 · 2 comments · Fixed by #8112
Closed

Show statistics min and max in explain #8110

NGA-TRAN opened this issue Nov 9, 2023 · 2 comments · Fixed by #8112
Labels
enhancement New feature or request

Comments

@NGA-TRAN
Copy link
Contributor

NGA-TRAN commented Nov 9, 2023

Is your feature request related to a problem or challenge?

We want see statistics Min and Max in the query plan to verify whether the plan is optimized correctly. Currently, there are only statistics Rows and Bytes are displayed in the query plan. We want to add Min and Max.

Table for the reproducer

set datafusion.explain.show_statistics = true;


create table t1(state string, city string, min_temp float, area int, time timestamp) as values 
    ('MA', 'Boston', 70.4, 1, 50),
    ('MA', 'Bedford', 71.59, 2, 150),
    ('CA', 'SF', 79.0, 1, 300),
    ('MA', 'Boston', 75.4, 3, 250),
    ('MA', 'Andover', 69.5, 4, 250),
    ('MA', 'Bedford', 78.2, 2, 150),
    ('MA', 'Boston', 65.0, 2, 250),
    ('CA', 'SJ', 78.4, 1, 300),
    ('MA', 'Reading', 53.0, 4, 250),
    ('CA', 'SJ', 75.4, 5, 350);
explain select * from t1 where time <= to_timestamp(350);
+---------------+---------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                              |
+---------------+---------------------------------------------------------------------------------------------------+
| logical_plan  | Filter: t1.time <= TimestampNanosecond(350000000000, None)                                        |
|               |   TableScan: t1 projection=[state, city, min_temp, area, time]                                    |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Absent, Bytes=Absent]               |
|               |   FilterExec: time@4 <= 350000000000, statistics=[Rows=Absent, Bytes=Absent]                      |
|               |     MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960)] |
|               |                                                                                                   |
+---------------+---------------------------------------------------------------------------------------------------+

Describe the solution you'd like

Note that Min and Max go with column. It will be clear if we show column name for the Min and max but since we do not store column names right now, we will also show a list of (min, max) pairs like

statistics=[Rows=Exact(10), Bytes=Exact(2960), [ (Min=Exact(1), Max=Exact(100)), (Min=Exact(..), Max=Exact(...)), ... ] ]

In the future if there is demand, we will add column names

Here is how they look in my local branch. I will open an PR for this soon

explain select * from t1 where time <= to_timestamp(350);
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Filter: t1.time <= TimestampNanosecond(350000000000, None)                                                                                                                                                                                                                                                                                                                                                                                              |
|               |   TableScan: t1 projection=[state, city, min_temp, area, time]                                                                                                                                                                                                                                                                                                                                                                                          |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Absent, Bytes=Absent, [(Column[0]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[1]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[2]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[3]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[4]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent)]]                        |
|               |   FilterExec: time@4 <= 350000000000, statistics=[Rows=Absent, Bytes=Absent, [(Column[0]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[1]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[2]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[3]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[4]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent)]]                               |
|               |     MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(1), Bytes=Exact(2896), [(Column[0]: Min=Absent, Max=Absent, Null=Exact(0), Distinct=Absent),(Column[1]: Min=Absent, Max=Absent, Null=Exact(0), Distinct=Absent),(Column[2]: Min=Absent, Max=Absent, Null=Exact(0), Distinct=Absent),(Column[3]: Min=Absent, Max=Absent, Null=Exact(0), Distinct=Absent),(Column[4]: Min=Absent, Max=Absent, Null=Exact(0), Distinct=Absent)]] |
|               |                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.008 seconds.


❯ explain select state, min_temp from t1 where time <= to_timestamp(350);
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                              |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: t1.state, t1.min_temp                                                                                                                                                                                                                                                                                 |
|               |   Filter: t1.time <= TimestampNanosecond(350000000000, None)                                                                                                                                                                                                                                                      |
|               |     TableScan: t1 projection=[state, min_temp, time]                                                                                                                                                                                                                                                              |
| physical_plan | ProjectionExec: expr=[state@0 as state, min_temp@1 as min_temp], statistics=[Rows=Absent, Bytes=Absent, [(Column[0]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[1]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent)]]                                                                    |
|               |   CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Absent, Bytes=Absent, [(Column[0]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[1]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[2]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent)]]                    |
|               |     FilterExec: time@2 <= 350000000000, statistics=[Rows=Absent, Bytes=Absent, [(Column[0]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[1]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[2]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent)]]                           |
|               |       MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(1), Bytes=Exact(2896), [(Column[0]: Min=Absent, Max=Absent, Null=Exact(0), Distinct=Absent),(Column[1]: Min=Absent, Max=Absent, Null=Exact(0), Distinct=Absent),(Column[2]: Min=Absent, Max=Absent, Null=Exact(0), Distinct=Absent)]] |
|               |                                                                                                                                                                                                                                                                                                                   |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Describe alternatives you've considered

No response

Additional context

No response

@NGA-TRAN NGA-TRAN added the enhancement New feature or request label Nov 9, 2023
@NGA-TRAN
Copy link
Contributor Author

NGA-TRAN commented Nov 9, 2023

@alamb I am actively working on this

@alamb
Copy link
Contributor

alamb commented Nov 9, 2023

FYI @berkaysynnada who I think will also be interested in this feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants