Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Trim Strings during the construction of min/max statistics #7829

Closed
Tracked by #7823
dantengsky opened this issue Sep 23, 2022 · 0 comments · Fixed by #7958
Closed
Tracked by #7823

Feature: Trim Strings during the construction of min/max statistics #7829

dantengsky opened this issue Sep 23, 2022 · 0 comments · Fixed by #7958
Labels
C-feature Category: feature

Comments

@dantengsky
Copy link
Member

dantengsky commented Sep 23, 2022

Summary

While collecting the min/max values of columns, we kept the exact values of them. For columns of type string(alike), the min/max values may be large(say, a column of type CHAR(4096)) and that makes the meta files that contain the statistics large.

It would be better if we can trim the strings to some moderate length, say 8 chars, in a way that preserves the property of min/max statistics: the trimmed max should be larger than the non-trimmed one, and the trimmed min should be lesser than the non-trimmed one.

Thus, with some loss of accuracy (slightly more likely to be false-positive, which IMO we can afford), the size of fuse table meta files could be reduced.

where the min/max vals are gathered:

https://github.com/datafuselabs/databend/blob/dae90d856e380ea29716e87148cc69d07ccff8ff/src/query/storages/fuse/src/statistics/column_statistic.rs#L45-L53

update : 2022-09-29

also, for column types like variants, we should not keep the min-max stats for them

we have not generated min-max stats for them...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature Category: feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant