Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reclaim ColumnFileBig that cannot be contained by any segment #5950

Closed
breezewish opened this issue Sep 19, 2022 · 3 comments · Fixed by #6378
Closed

Reclaim ColumnFileBig that cannot be contained by any segment #5950

breezewish opened this issue Sep 19, 2022 · 3 comments · Fixed by #6378
Assignees
Labels

Comments

@breezewish
Copy link
Member

breezewish commented Sep 19, 2022

Enhancement

Suppose we write a ColumnFileBig into the memtable of a segment without any delta:

-Inf                            +Inf
  |<------------------------------>| Segmemt
                     |<-CFBig->|

Then physical split happens:

-Inf                            +Inf
  |<--------------->|<------------>| Segmemt
                     |<-CFBig->|

The ColumnFileBig is now referenced by both two segments. And,

  1. The right segment may trigger a delta merge because the delta layer is big.

  2. The left segment will not trigger a delta merge because the delta is still empty -- Its referenced CFBig is not contained in the segment.

As a result, the ColumnFileBig is kept being referenced and not recycled, until user manually triggers a DeltaMerge for all segments.

This happens when we ingest SSTs quickly (using a higher ingest concurrency), result in 25% space amplification in my experiment.

@breezewish breezewish added the type/enhancement The issue or PR belongs to an enhancement. label Sep 19, 2022
@flowbehappy
Copy link
Contributor

I wonder why "Its referenced CFBig is not contained in the segment."?

@breezewish
Copy link
Member Author

breezewish commented Sep 19, 2022

I wonder why "Its referenced CFBig is not contained in the segment."?

We intentionally made a filter when calculating the delta rows and delta bytes, to filter out packs that is not contained by the segment, to be more precise:

void ColumnFileBig::calculateStat(const DMContext & context)

So in the example scenario, we will have delta (logical) rows|bytes = 0 (which is, reasonable).

@flowbehappy
Copy link
Contributor

I get it. Although a CFBig is contained by a segment, the data in the CFBig is not. Because all of the data in CFBig is filtered out by segment's range.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants