Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

apache / datafusion Public

Notifications You must be signed in to change notification settings
Fork 1.2k
Star 6.5k

Code
Issues 1.1k
Pull requests 76
Discussions
Actions
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Security
Insights

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Improve statistics (umbrella issue) #997

Open

1 of 14 tasks

rdettai opened this issue Sep 13, 2021 · 0 comments

Open

1 of 14 tasks

Improve statistics (umbrella issue) #997

rdettai opened this issue Sep 13, 2021 · 0 comments

Labels

New feature or request

Comments

Copy link

Contributor

rdettai commented Sep 13, 2021 •

edited

Loading

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This is an umbrella issue to gather all improvements regarding statistics.

Describe the solution you'd like
The list below should probably be better prioritized:

Moving cost based optimizations to physical planning #962
Expressions should also evaluate on statistics #992
better validate that the column_statistics vector is aligned on the schema fields vector (same size, same types...) when constructing the ExecutionPlan instance (ex Adapt column statistics API #717)
remove total_byte_size as we are not using it OR better estimate it when we have both a fixed size type and the num_rows for the output columns
replace the is_exact field at the Statistics level with per-field information
have more granularity in statistics that just (value, is_exact): possible solutions are histograms (cf Spark CBOs)
fix the way LocalLimitExec propagates its inexact statistics (requires more granular statistics)
estimate statistics in CSV datasource
estimate statistics in JSON dataource
better estimate output statistics of hash_aggregate
better estimate output statistics of hash_join
better estimate output statistics of projection (requires Expressions should also evaluate on statistics #992)
better estimate output statistics of window_agg
better estimate output statistics of filters (requires more granular statistics, in particular histograms)

Additional context
Statistics are usually sourced at the datasource level, then propagated through the plan tree according to the types of nodes. They are used to choose between different logically equivalent plans or plan configurations. The more rules are implemented for propagating the statistics, the more information the optimizer will have to take good decisions. But at the same time, an overly complex abstraction that is not used by any optimization rule would bloat the code base and make it harder to maintain. For that reason, extensions of the statistics system should be driven by the addition of concrete optimization rules that require them.

The text was updated successfully, but these errors were encountered:

yjshen and waynexia reacted with thumbs up emoji

houqp reacted with heart emoji

All reactions

👍 2 reactions
❤️ 1 reaction

rdettai added the enhancement New feature or request label

This was referenced Sep 13, 2021

Move CBOs and Statistics to physical plan #965

Merged

Adapt column statistics API #717

Open

crepererum mentioned this issue

Statistics::is_exact semantics #5613

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Assignees

No one assigned

Labels

New feature or request

Projects

None yet

Milestone

No milestone

Development

No branches or pull requests

1 participant

Footer

© 2024 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.