Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: feed estimated number of distinct values to bloom filter #7825

Closed
Tracked by #7823
dantengsky opened this issue Sep 23, 2022 · 2 comments
Closed
Tracked by #7823

BUG: feed estimated number of distinct values to bloom filter #7825

dantengsky opened this issue Sep 23, 2022 · 2 comments
Labels
C-feature Category: feature

Comments

@dantengsky
Copy link
Member

dantengsky commented Sep 23, 2022

Summary

currently, the "number of distinct values" we feed to the bloom filter is the number of rows, which is too conservative (naive:).
we should use something like hyperloglog to estimate the NDV of give column and then feed the NDV to the bloom filter, for columns that have low cardinality, this will "shrink" the bloom filter index significantly.

https://github.com/datafuselabs/databend/blob/f59efc05bc080e762846fbe8bac24c806e432a75/src/query/storages/index/src/bloom_filter.rs#L151-L162

may also be related to #7314

@dantengsky dantengsky added the C-feature Category: feature label Sep 23, 2022
@BohuTANG
Copy link
Member

This may make the bloom filter bitset smaller.
ClickHouse also has a bloom filter index, it seems to work like databend, I haven't checked his bitset size yet.
cc @drmingdrmer

@dantengsky
Copy link
Member Author

vanilla bloom filter is replaced by xor filter in PR #7870

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature Category: feature
Projects
None yet
Development

No branches or pull requests

2 participants