[minior fix]: adjust the projection statistics #7428

liukun4515 · 2023-08-28T02:09:34Z

Which issue does this PR close?

Closes #.

Rationale for this change

support total_byte_size use the primitive_size of the datatype
if column of projection can't get the primitive_size, use the child statistic as statistics of the projection

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jackwener

Thanks @liukun4515

jackwener · 2023-08-28T02:23:37Z

datafusion/core/src/physical_plan/projection.rs

+                // TODO stats: knowing the type of the new columns we can guess the output size
+                // If we can't get the exact statistics for the project
+                // Before we get the exact result, we just use the child status


we can add a future ticket about statistic derive. We need a method to handle Unknown statistic status of Expression

I'm busy recently, if I have time, I will investigate this problem.

liukun4515 · 2023-08-28T03:06:02Z

Some plan changes, I will find time to check them.

liukun4515 · 2023-08-28T07:07:20Z

@jackwener PTAL again, the sql test cases changed.
I think the changes is reasonable. After this changes the left and right projection have the statistics to determine the order of join

liukun4515 · 2023-08-28T07:10:31Z

datafusion/sqllogictest/test_files/joins.slt

----CoalesceBatchesExec: target_batch_size=4096
------HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(join_t1.t1_id + UInt32(12)@2, join_t2.t2_id + UInt32(1)@1)]
--------CoalescePartitionsExec
+----ProjectionExec: expr=[t1_id@2 as t1_id, t1_name@3 as t1_name, join_t1.t1_id + UInt32(12)@4 as join_t1.t1_id + UInt32(12), t2_id@0 as t2_id, join_t2.t2_id + UInt32(1)@1 as join_t2.t2_id + UInt32(1)]


after this pr, the t1 and t2 has the statistics in the projection exe.

The table of t2 project the t2_id, the table of t1 project the t1_id and t1_name, hence the cost of t1 is greater than t2.

Collect the left in the join, use the t2 as the building table.

Dandandan · 2023-08-28T08:56:55Z

datafusion/core/src/physical_plan/projection.rs

+                // TODO stats: knowing the type of the new columns we can guess the output size
+                // If we can't get the exact statistics for the project
+                // Before we get the exact result, we just use the child status
+                total_byte_size: stats.total_byte_size,


If type or output has changed stats.is_exact will not be true anymore (for total_byte_size).

I think they are different things.

stats.is_exact is for the whole statistic. Current num_rows is exact.
Maybe we need add Unknow status for total_byte_size. so we just to make total_byte_size None

such as Presto:

PlanNodeStatsEstimate UNKNOWN. It's present `Unknown` for whole statistic. SymbolStatsEstimate UNKNOWN. It's present `Unknown` for single expr statistic.

Hm I think you're right.
The is_exact is currently a bit confusing though, I think it would be nice if we have a exact/inexact specifier for both row/byte statistics.

datafusion/core/src/physical_plan/projection.rs

jackwener · 2023-08-28T12:39:32Z

datafusion/core/src/physical_plan/projection.rs

+                // TODO stats: knowing the type of the new columns we can guess the output size
+                // If we can't get the exact statistics for the project
+                // Before we get the exact result, we just use the child status
+                total_byte_size: stats.total_byte_size,


Suggested change

total_byte_size: stats.total_byte_size,

total_byte_size: None,

Do you means we use the None or unknow as the total_byte_size, if we can't estimate the total_byte_size?
But it will lost the size information in side, I mean use the child's statistics directly.

cc @jackwener

Yes, because it will cause inaccuracy to use child's statistics directly.

I think we might need to add exactness information (is_exact) to total_byte_size as well to prevent not propagating some (inaccurate) information about the size.

I think it is fine to leave it inaccurate for now, as AFAIK we don't rely on exact information about total_byte_size like we do for num_rows statistics.

I think we can't set the tota_byte_size to none or default value, if we can't get the exact information.
Now i just follow the information from the child node, that is the best choice i can find.

cc @Dandandan @jackwener any comments for this
If there is no comments for this, i will merged this pr after 24h.

github-actions bot added the core Core DataFusion crate label Aug 28, 2023

adjust the projection statistics

5aaa677

liukun4515 force-pushed the fix_statistics_plan branch from 5dcb74a to 5aaa677 Compare August 28, 2023 02:13

liukun4515 requested review from jackwener and alamb August 28, 2023 02:13

jackwener approved these changes Aug 28, 2023

View reviewed changes

jackwener reviewed Aug 28, 2023

View reviewed changes

update the sql test case

df8ddab

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Aug 28, 2023

liukun4515 commented Aug 28, 2023

View reviewed changes

Dandandan reviewed Aug 28, 2023

View reviewed changes

datafusion/core/src/physical_plan/projection.rs Show resolved Hide resolved

jackwener reviewed Aug 28, 2023

View reviewed changes

fix clippy

4a31ceb

Dandandan approved these changes Aug 30, 2023

View reviewed changes

liukun4515 merged commit 58fc80e into apache:main Aug 31, 2023
21 checks passed

liukun4515 mentioned this pull request Sep 14, 2023

Optimize FilterExec::statistics / don't ignore errors #7553

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[minior fix]: adjust the projection statistics #7428

[minior fix]: adjust the projection statistics #7428

liukun4515 commented Aug 28, 2023

jackwener left a comment

jackwener Aug 28, 2023 •

edited

Loading

liukun4515 commented Aug 28, 2023

liukun4515 commented Aug 28, 2023

liukun4515 Aug 28, 2023

Dandandan Aug 28, 2023 •

edited

Loading

jackwener Aug 28, 2023 •

edited

Loading

Dandandan Aug 28, 2023

jackwener Aug 28, 2023

liukun4515 Aug 29, 2023

jackwener Aug 29, 2023

Dandandan Aug 29, 2023

Dandandan Aug 29, 2023

liukun4515 Aug 29, 2023

liukun4515 Aug 30, 2023

jackwener Aug 30, 2023

	total_byte_size: stats.total_byte_size,
	total_byte_size: None,

[minior fix]: adjust the projection statistics #7428

[minior fix]: adjust the projection statistics #7428

Conversation

liukun4515 commented Aug 28, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jackwener left a comment

Choose a reason for hiding this comment

jackwener Aug 28, 2023 • edited Loading

Choose a reason for hiding this comment

liukun4515 commented Aug 28, 2023

liukun4515 commented Aug 28, 2023

Choose a reason for hiding this comment

Dandandan Aug 28, 2023 • edited Loading

Choose a reason for hiding this comment

jackwener Aug 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackwener Aug 28, 2023 •

edited

Loading

Dandandan Aug 28, 2023 •

edited

Loading

jackwener Aug 28, 2023 •

edited

Loading