-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: outgoing replica snapshot is not properly using an engine snapshot #75824
Comments
Your suggestion is to keep the long-lived snapshot but to make sure that the iterator isn't long-lived, right? Could you explain in more detail what the difference is? Both an iterator and a snapshot prevent SSTables that are "reachable" from them to be preserved, is this "pinning" behavior you mention related to how pebble manages its in-memory caches? How big a difference does it make? So far the problems I've observed with slow snapshots were related to SSTs not being deleted, am I missing another dimension? It doesn't seem that this suggestion would make a change there. |
Correct
A pebble.Snapshot only preserves the versions of a key visible to the seqnum of the snapshot. All other old versions can be deleted (these are pebble "versions", not MVCC versions). This is handled via compactions being snapshot aware. So if you consider a typical workload with a low write rate, most versions that are visible to the snapshot will continue to be the latest version, so the snapshot is not causing much "old data" to be retained.
I was imagining something like looking at the walltime occasionally when using the Iterator, and if the time since Iterator creation is over 1min, closing it and creating a new one. |
Thanks for explaining! Makes sense.
Yeah, good idea, the right place to do that would be this method: cockroach/pkg/kv/kvserver/store_snapshot.go Lines 313 to 370 in 1335460
We'll probably want to remove the |
Sumeer points out that this would be straightforward with the new replica data iterator pattern in #84070. |
This fell through the cracks in the pre-stability scramble, unfortunately. Won't get it in for 22.2. |
Seems doable: How necessary is it though, now that If we still want to do this, can this be pushed down to lower levels? For example, the implementation of Can be a bigger change, but the rationale is:
|
Another fix that I can imagine is that Pebble iterator is tied to a snapshot (instead of pinning memtable and SSTs upfront), but when it scans through things, it dynamically pins/unpins the bits it touches. @sumeerbhola is anything like that feasible in Pebble? |
Yeah, I was thinking we'd return e.g.
7 of those iterators are typically trivial. The user point keyspace is the one that takes time. However, we now have a 1 hour timeout for sending Raft snapshots: cockroach/pkg/kv/kvserver/replica_command.go Lines 60 to 68 in 0036160
In the original incident, the snapshot had been stuck for at least 12 days. Now that we have a backstop timeout of 1 hour, I'm not sure if it's worth releasing the iterator pinning periodically. Wdyt @sumeerbhola?
Seeks are expensive, so we definitely don't want to do this for latency-sensitive operations. Also, we can only do it on readers that have consistent iterators since we'd otherwise change the view of the iterator (it would see newer data after being reopened). I think it makes sense to do this above Pebble, if we even want to do this. |
Makes sense. Here are a few alternatives.
Maybe on Pebble level there could be a way to avoid re-seeking or do it cheaper (only need to do it for one SST when it's compacted, instead of the whole stack). Also, the latency sensitivity could be a parameter when creating the iterator.
Does Pebble iterator/snapshot also know this info? |
With the exception of Pebble snapshots, consistent iterators are implemented in CRDB via iterator cloning. See e.g. cockroach/pkg/storage/pebble_batch.go Lines 178 to 192 in 5bafe90
So this may only be viable on Pebble snapshots, that are already consistent in Pebble by definition. Other reader types (e.g. engine and batch) probably won't be viable, because the view of the iterator will change when it's refreshed. I think it kind of has to, because otherwise it won't be able to release the pinned SSTs, but I'll defer to storage here. |
We are inclined to close this, as 1h timeout does the job, and the initial suggested fix introduces complexity. @sumeerbhola Still interested in your input though. |
Closing sounds fine. No need to add code that we don't really need. |
related to the incident mentioned in #75728, https://github.com/cockroachlabs/support/issues/1403 (internal)
To send a replica snapshot we construct an engine snapshot at
cockroach/pkg/kv/kvserver/replica_raftstorage.go
Line 420 in be1b6c4
and then immediately proceed to constructing a
ReplicaEngineDataIterator
that will construct apebble.Iterator
cockroach/pkg/kv/kvserver/replica_raftstorage.go
Line 577 in be1b6c4
The lifetime of the
pebble.Iterator
and thepebble.Snapshot
are the same, so we don't get any benefit from thepebble.Snapshot
pinning data but not pinning sstables or memtables, since the latter get pinned by thepebble.Iterator
.If sending a snapshot can be slow for whatever reason, we should consider periodically closing the
pebble.Iterator
and creating a new one.cc to folks on that investigation: @aayushshah15 @tbg @nicktrav
Jira issue: CRDB-12851
The text was updated successfully, but these errors were encountered: