You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This is an umbrella issue to gather all improvements regarding statistics.
Describe the solution you'd like
The list below should probably be better prioritized:
better validate that the column_statistics vector is aligned on the schema fields vector (same size, same types...) when constructing the ExecutionPlan instance (ex Adapt column statistics API #717)
remove total_byte_size as we are not using it OR better estimate it when we have both a fixed size type and the num_rows for the output columns
replace the is_exact field at the Statistics level with per-field information
have more granularity in statistics that just (value, is_exact): possible solutions are histograms (cf Spark CBOs)
fix the way LocalLimitExec propagates its inexact statistics (requires more granular statistics)
estimate statistics in CSV datasource
estimate statistics in JSON dataource
better estimate output statistics of hash_aggregate
better estimate output statistics of filters (requires more granular statistics, in particular histograms)
Additional context
Statistics are usually sourced at the datasource level, then propagated through the plan tree according to the types of nodes. They are used to choose between different logically equivalent plans or plan configurations. The more rules are implemented for propagating the statistics, the more information the optimizer will have to take good decisions. But at the same time, an overly complex abstraction that is not used by any optimization rule would bloat the code base and make it harder to maintain. For that reason, extensions of the statistics system should be driven by the addition of concrete optimization rules that require them.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This is an umbrella issue to gather all improvements regarding statistics.
Describe the solution you'd like
The list below should probably be better prioritized:
column_statistics
vector is aligned on the schemafields
vector (same size, same types...) when constructing theExecutionPlan
instance (ex Adapt column statistics API #717)total_byte_size
as we are not using it OR better estimate it when we have both a fixed size type and thenum_rows
for the output columnsis_exact
field at theStatistics
level with per-field information(value, is_exact)
: possible solutions are histograms (cf Spark CBOs)LocalLimitExec
propagates its inexact statistics (requires more granular statistics)Additional context
Statistics are usually sourced at the datasource level, then propagated through the plan tree according to the types of nodes. They are used to choose between different logically equivalent plans or plan configurations. The more rules are implemented for propagating the statistics, the more information the optimizer will have to take good decisions. But at the same time, an overly complex abstraction that is not used by any optimization rule would bloat the code base and make it harder to maintain. For that reason, extensions of the statistics system should be driven by the addition of concrete optimization rules that require them.
The text was updated successfully, but these errors were encountered: