feat: fuse block file cache #8647

BohuTANG · 2022-11-05T06:14:28Z

Summary

Now, Databend will cache these in memory:

snapshot file
segment file
index file

Because all of these items' sizes are not big and fit the memory well.

To understand more about how the Databend cache works, for example:
Q1:

select id, name, age, city from t1 where age > 20 and age < 30;

The progress likes:

Read the latest snapshot file -- cached in memory
Pruning the segment file and reading it from the s3 -- cached in memory
Read the id&name&age&city column(In Databend , it is called block and in Parquet format) from the s3 -- no cached

So if we run another SQL, Q2:

select id, name, age, city from t1 where age > 25 and age < 30;

The progress is as follows:

Read the latest snapshot file from memory
Pruning the segment file and reading from memory
Read the id&name&age&city column(In Databend , it is called block and in Parquet format) from the s3 -- no cached

If we cache the column parquet files in step 3, the Q2 will avoid file reads from s3, faster!

Update:
Each column file(AKA block file) in Databend is a parquet file with one RowGroup, and the range reader is the RowGroup data, which is the entire parquet file except the footer.

Question

To avoid writing through blocks to disk(which would affect reads), we should use a Memory + Disk LRU cache, such as memory size: 1GB, and Disk size: 10GB, and make the block write-back async.

At last, I'd like to show the performance gains Snowflake has made thanks to caching for the hits dataset test:
Run Q1: SELECT COUNT() FROM hits.public.hits2;
Then Q2: SELECT COUNT() FROM hits.public.hits2 WHERE AdvEngineID <> 0;

The text was updated successfully, but these errors were encountered:

BohuTANG · 2022-11-05T06:14:52Z

cc @Xuanwo @dantengsky

Xuanwo · 2022-11-18T04:08:25Z

Addressed by #8830

BohuTANG added the C-feature Category: feature label Nov 5, 2022

BohuTANG mentioned this issue Nov 5, 2022

Tracking: Large dataset insert and read #7823

Closed

50 tasks

BohuTANG mentioned this issue Jan 15, 2023

Release proposal: Nightly v1.0 #9604

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: fuse block file cache #8647

feat: fuse block file cache #8647

BohuTANG commented Nov 5, 2022 •

edited

Loading

BohuTANG commented Nov 5, 2022

Xuanwo commented Nov 18, 2022

feat: fuse block file cache #8647

feat: fuse block file cache #8647

Comments

BohuTANG commented Nov 5, 2022 • edited Loading

Question

BohuTANG commented Nov 5, 2022

Xuanwo commented Nov 18, 2022

BohuTANG commented Nov 5, 2022 •

edited

Loading