Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[minior fix]: adjust the projection statistics #7428

Merged
merged 3 commits into from
Aug 31, 2023

Conversation

liukun4515
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

  • support total_byte_size use the primitive_size of the datatype
  • if column of projection can't get the primitive_size, use the child statistic as statistics of the projection

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Aug 28, 2023
Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @liukun4515

Comment on lines +445 to +447
// TODO stats: knowing the type of the new columns we can guess the output size
// If we can't get the exact statistics for the project
// Before we get the exact result, we just use the child status
Copy link
Member

@jackwener jackwener Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add a future ticket about statistic derive. We need a method to handle Unknown statistic status of Expression

I'm busy recently, if I have time, I will investigate this problem.

@liukun4515
Copy link
Contributor Author

Some plan changes, I will find time to check them.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Aug 28, 2023
@liukun4515
Copy link
Contributor Author

@jackwener PTAL again, the sql test cases changed.
I think the changes is reasonable. After this changes the left and right projection have the statistics to determine the order of join

----CoalesceBatchesExec: target_batch_size=4096
------HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(join_t1.t1_id + UInt32(12)@2, join_t2.t2_id + UInt32(1)@1)]
--------CoalescePartitionsExec
----ProjectionExec: expr=[t1_id@2 as t1_id, t1_name@3 as t1_name, join_t1.t1_id + UInt32(12)@4 as join_t1.t1_id + UInt32(12), t2_id@0 as t2_id, join_t2.t2_id + UInt32(1)@1 as join_t2.t2_id + UInt32(1)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after this pr, the t1 and t2 has the statistics in the projection exe.

The table of t2 project the t2_id, the table of t1 project the t1_id and t1_name, hence the cost of t1 is greater than t2.

Collect the left in the join, use the t2 as the building table.

// TODO stats: knowing the type of the new columns we can guess the output size
// If we can't get the exact statistics for the project
// Before we get the exact result, we just use the child status
total_byte_size: stats.total_byte_size,
Copy link
Contributor

@Dandandan Dandandan Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If type or output has changed stats.is_exact will not be true anymore (for total_byte_size).

Copy link
Member

@jackwener jackwener Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are different things.

stats.is_exact is for the whole statistic. Current num_rows is exact.
Maybe we need add Unknow status for total_byte_size. so we just to make total_byte_size None

such as Presto:

PlanNodeStatsEstimate UNKNOWN. It's present `Unknown` for whole statistic.

SymbolStatsEstimate UNKNOWN. It's present `Unknown` for single expr statistic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I think you're right.
The is_exact is currently a bit confusing though, I think it would be nice if we have a exact/inexact specifier for both row/byte statistics.

// TODO stats: knowing the type of the new columns we can guess the output size
// If we can't get the exact statistics for the project
// Before we get the exact result, we just use the child status
total_byte_size: stats.total_byte_size,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
total_byte_size: stats.total_byte_size,
total_byte_size: None,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you means we use the None or unknow as the total_byte_size, if we can't estimate the total_byte_size?
But it will lost the size information in side, I mean use the child's statistics directly.

cc @jackwener

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because it will cause inaccuracy to use child's statistics directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need to add exactness information (is_exact) to total_byte_size as well to prevent not propagating some (inaccurate) information about the size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to leave it inaccurate for now, as AFAIK we don't rely on exact information about total_byte_size like we do for num_rows statistics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can't set the tota_byte_size to none or default value, if we can't get the exact information.
Now i just follow the information from the child node, that is the best choice i can find.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @Dandandan @jackwener any comments for this
If there is no comments for this, i will merged this pr after 24h.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM❤️

@liukun4515 liukun4515 merged commit 58fc80e into apache:main Aug 31, 2023
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants