-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[minior fix]: adjust the projection statistics #7428
Conversation
5dcb74a
to
5aaa677
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @liukun4515
// TODO stats: knowing the type of the new columns we can guess the output size | ||
// If we can't get the exact statistics for the project | ||
// Before we get the exact result, we just use the child status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can add a future ticket about statistic derive
. We need a method to handle Unknown statistic
status of Expression
I'm busy recently, if I have time, I will investigate this problem.
Some plan changes, I will find time to check them. |
@jackwener PTAL again, the sql test cases changed. |
----CoalesceBatchesExec: target_batch_size=4096 | ||
------HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(join_t1.t1_id + UInt32(12)@2, join_t2.t2_id + UInt32(1)@1)] | ||
--------CoalescePartitionsExec | ||
----ProjectionExec: expr=[t1_id@2 as t1_id, t1_name@3 as t1_name, join_t1.t1_id + UInt32(12)@4 as join_t1.t1_id + UInt32(12), t2_id@0 as t2_id, join_t2.t2_id + UInt32(1)@1 as join_t2.t2_id + UInt32(1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after this pr, the t1 and t2 has the statistics in the projection exe.
The table of t2
project the t2_id, the table of t1
project the t1_id and t1_name, hence the cost of t1
is greater than t2
.
Collect the left in the join, use the t2
as the building table.
// TODO stats: knowing the type of the new columns we can guess the output size | ||
// If we can't get the exact statistics for the project | ||
// Before we get the exact result, we just use the child status | ||
total_byte_size: stats.total_byte_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If type or output has changed stats.is_exact
will not be true anymore (for total_byte_size
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they are different things.
stats.is_exact
is for the whole statistic. Current num_rows
is exact.
Maybe we need add Unknow status for total_byte_size
. so we just to make total_byte_size
None
such as Presto:
PlanNodeStatsEstimate UNKNOWN. It's present `Unknown` for whole statistic.
SymbolStatsEstimate UNKNOWN. It's present `Unknown` for single expr statistic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I think you're right.
The is_exact
is currently a bit confusing though, I think it would be nice if we have a exact/inexact specifier for both row/byte statistics.
// TODO stats: knowing the type of the new columns we can guess the output size | ||
// If we can't get the exact statistics for the project | ||
// Before we get the exact result, we just use the child status | ||
total_byte_size: stats.total_byte_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total_byte_size: stats.total_byte_size, | |
total_byte_size: None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you means we use the None
or unknow
as the total_byte_size
, if we can't estimate the total_byte_size
?
But it will lost the size
information in side, I mean use the child's statistics directly.
cc @jackwener
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because it will cause inaccuracy to use child's statistics
directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might need to add exactness information (is_exact) to total_byte_size
as well to prevent not propagating some (inaccurate) information about the size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine to leave it inaccurate for now, as AFAIK we don't rely on exact information about total_byte_size
like we do for num_rows
statistics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can't set the tota_byte_size
to none
or default value
, if we can't get the exact information.
Now i just follow the information from the child node, that is the best choice i can find.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @Dandandan @jackwener any comments for this
If there is no comments for this, i will merged this pr after 24h.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM❤️
Which issue does this PR close?
Closes #.
Rationale for this change
total_byte_size
use the primitive_size of thedatatype
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?