-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db: efficient skipping of points deleted by RANGEDEL in the same file #2424
Comments
For For snapshots without disaggregated stores, we will still need the ability to hold pebble snapshots for longer for range snapshots. With delegated snapshots, it is less likely that the leaseholder of a range will hold a long snapshot and we could tune this to almost never be the leaseholder if it made a difference. So I don't think it is possible to remove all usages, but we could limit it so that MVCCStats held it for much shorter, and range snapshots were normally only from followers. How long in practice do we see snapshots open for? I would be surprised if it is ever over 1 minute today. Is that long enough to meaningfully hold up compactions or are we seeing them held open longer? |
Say the
I don't know. But yes, it should be much less than 1 min for a 512MB range (assuming good cache hit rate).
Compactions are not held up for snapshots or open iterators (which is the right design choice). But compactions need to consider open snapshots and not delete the data needed by a snapshot, which has caused problems as noted in this issue. |
Related to #1070. |
from @nicktrav: Stumbled upon another use of snapshots over here related to the TSDB: https://github.com/cockroachdb/cockroach/blob/ee9831d3151b3a559495bc001b9d4b661cf046b1/pkg/kv/kvserver/ts_maintenance_queue.go#L151-L156 |
Good catch @nicktrav! That use of a snapshot looks wholly unnecessary. |
|
The work here has been subsumed by a combination of the obsolete bit in sstables #2465 which provides for efficient skipping, and by EventuallyFileOnlySnapshot (which we are slowly moving all CRDB snapshot use cases to) #2740 which obviates the need for compactions to retain obsolete points needed by snapshots. |
When a RANGEDEL deletes points in a lower-level, but the physical deletion has not yet happened,
mergingIter
seeks to the end-key of the RANGEDEL, which results in efficient skipping of the deleted points during iteration.This efficient skipping does not work if the RANGEDEL is in the same file as the points. We have seen customer issues where the RANGEDEL has fallen down to L6 and deletes a large number of points, but because of an open Pebble snapshot, the compaction is unable to physically delete those points. This happens in CockroachDB despite the rare use of Pebble snapshots (for GC and sending range snapshots). This results in sequential iteration over all the deleted points and checking whether each point is covered by the RANGEDEL, which is very inefficient, hence customer escalations related to slow queries.
A solution is to use block property filters to efficiently skip blocks whose key range and seqnum range are fully covered by the RANGEDEL. The key range bound for a block is already known due to the index block (not tight, but should be good enough). The seqnum interval can be collected using a new seqnum block property collector. Intervals are delta-encoded and varint-encoded so should be cheap to store in the index block entry. Seqnum zeroing can sometimes make this cheaper, if the whole block has zero seqnums. If part of the block has seqnum zeroing, we potentially change an interval [S1, S2) to [0, S2), which could have similar cost (first will be encoded as varint(S1), varint(S2-S1) and the second as varint(S2)).
Alternatives: We have discussed eliminating all uses of Pebble snapshots in CockroachDB, which would eliminate this problem in the CockroachDB context. With the upcoming disaggregated storage
The counter argument to this alternative is that we are planning to make ranges much larger, and if we need a consistent storage view of the state machine, a Pebble snapshot is very useful (and ranges are not moved frequently so there is low probability of excising happening).
Next steps:
mvccGCQueue
needs a consistent state using a snapshot, or can afford to use multiple iterators when scanning a range.mergingIter
so maybe we can localize the changes tomergingIter
,levelIter
and the sstable iters. We should try to prototype this to see if the complexity is warranted.The text was updated successfully, but these errors were encountered: