Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: outgoing replica snapshot is not properly using an engine snapshot #75824

Closed
sumeerbhola opened this issue Feb 1, 2022 · 12 comments
Closed
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) E-starter Might be suitable for a starter project for new employees or team members. O-postmortem Originated from a Postmortem action item.

Comments

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Feb 1, 2022

related to the incident mentioned in #75728, https://github.com/cockroachlabs/support/issues/1403 (internal)

To send a replica snapshot we construct an engine snapshot at

snap := r.store.engine.NewSnapshot()

and then immediately proceed to constructing a ReplicaEngineDataIterator that will construct a pebble.Iterator
iter := rditer.NewReplicaEngineDataIterator(&desc, snap, true /* replicatedOnly */)

The lifetime of the pebble.Iterator and the pebble.Snapshot are the same, so we don't get any benefit from the pebble.Snapshot pinning data but not pinning sstables or memtables, since the latter get pinned by the pebble.Iterator.
If sending a snapshot can be slow for whatever reason, we should consider periodically closing the pebble.Iterator and creating a new one.

cc to folks on that investigation: @aayushshah15 @tbg @nicktrav

Jira issue: CRDB-12851

@sumeerbhola sumeerbhola added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. labels Feb 1, 2022
@tbg
Copy link
Member

tbg commented Feb 2, 2022

Your suggestion is to keep the long-lived snapshot but to make sure that the iterator isn't long-lived, right? Could you explain in more detail what the difference is? Both an iterator and a snapshot prevent SSTables that are "reachable" from them to be preserved, is this "pinning" behavior you mention related to how pebble manages its in-memory caches? How big a difference does it make? So far the problems I've observed with slow snapshots were related to SSTs not being deleted, am I missing another dimension? It doesn't seem that this suggestion would make a change there.
Just trying to understand what this would improve in practice, as it seems a little cumbersome to manage the iterator livecycle separately.

@sumeerbhola
Copy link
Collaborator Author

sumeerbhola commented Feb 2, 2022

Your suggestion is to keep the long-lived snapshot but to make sure that the iterator isn't long-lived, right?

Correct

Could you explain in more detail what the difference is?

A pebble.Snapshot only preserves the versions of a key visible to the seqnum of the snapshot. All other old versions can be deleted (these are pebble "versions", not MVCC versions). This is handled via compactions being snapshot aware. So if you consider a typical workload with a low write rate, most versions that are visible to the snapshot will continue to be the latest version, so the snapshot is not causing much "old data" to be retained.
In contrast, a pebble.Iterator causes all memtables and sstables as of the time of the iterator creation to be kept until the iterator is closed. Which means that if these memtables are flushed or sstables are compacted, both the input and output needs to be preserved even if they contain the same key versions. So even the latest version of a key will end up with multiple copies being retained, until the Iterator closes. In addition to consuming more disk space, this can also cause memory pressure and OOMs.

Just trying to understand what this would improve in practice, as it seems a little cumbersome to manage the iterator livecycle separately.

I was imagining something like looking at the walltime occasionally when using the Iterator, and if the time since Iterator creation is over 1min, closing it and creating a new one.

@tbg
Copy link
Member

tbg commented Feb 2, 2022

Thanks for explaining! Makes sense.

I was imagining something like looking at the walltime occasionally when using the Iterator, and if the time since Iterator creation is over 1min, closing it and creating a new one.

Yeah, good idea, the right place to do that would be this method:

// Send implements the snapshotStrategy interface.
func (kvSS *kvBatchSnapshotStrategy) Send(
ctx context.Context,
stream outgoingSnapshotStream,
header SnapshotRequest_Header,
snap *OutgoingSnapshot,
) (int64, error) {
assertStrategy(ctx, header, SnapshotRequest_KV_BATCH)
// bytesSent is updated as key-value batches are sent with sendBatch. It
// does not reflect the log entries sent (which are never sent in newer
// versions of CRDB, as of VersionUnreplicatedTruncatedState).
bytesSent := int64(0)
// Iterate over all keys using the provided iterator and stream out batches
// of key-values.
kvs := 0
var b storage.Batch
defer func() {
if b != nil {
b.Close()
}
}()
for iter := snap.Iter; ; iter.Next() {
if ok, err := iter.Valid(); err != nil {
return 0, err
} else if !ok {
break
}
kvs++
unsafeKey := iter.UnsafeKey()
unsafeValue := iter.UnsafeValue()
if b == nil {
b = kvSS.newBatch()
}
if err := b.PutEngineKey(unsafeKey, unsafeValue); err != nil {
return 0, err
}
if bLen := int64(b.Len()); bLen >= kvSS.batchSize {
if err := kvSS.sendBatch(ctx, stream, b); err != nil {
return 0, err
}
bytesSent += bLen
b.Close()
b = nil
}
}
if b != nil {
if err := kvSS.sendBatch(ctx, stream, b); err != nil {
return 0, err
}
bytesSent += int64(b.Len())
}
kvSS.status = redact.Sprintf("kv pairs: %d", kvs)
return bytesSent, nil
}

We'll probably want to remove the Iter field altogether, and just create it there.

@tbg tbg added the E-starter Might be suitable for a starter project for new employees or team members. label Feb 2, 2022
@rupertharwood-crl rupertharwood-crl added the O-postmortem Originated from a Postmortem action item. label Feb 3, 2022
@erikgrinaker
Copy link
Contributor

Sumeer points out that this would be straightforward with the new replica data iterator pattern in #84070.

#84070 (review)

@erikgrinaker erikgrinaker self-assigned this Jul 26, 2022
@erikgrinaker
Copy link
Contributor

This fell through the cracks in the pre-stability scramble, unfortunately. Won't get it in for 22.2.

@erikgrinaker erikgrinaker removed their assignment Aug 25, 2022
@pav-kv
Copy link
Collaborator

pav-kv commented Oct 19, 2022

Seems doable: kvBatchSnapshotStrategy.Send can yield when it spent enough time reading in IterateReplicaKeySpans, and let the latter recreate the iter and seek to the position where it stopped.

How necessary is it though, now that IterateReplicaKeySpans already creates/closes multiple iterators? Currently 8: point and range iter for each of the 4 spans.

If we still want to do this, can this be pushed down to lower levels? For example, the implementation of EngineIterator (which here is Pebble iterator) could do that on NextEngineKey calls and the like, or even further down the stack: wherever the SSTs are locked.

Can be a bigger change, but the rationale is:

  • this is Pebble-specific behaviour, it seems unconventional to handle this above the interface
  • doing it on lower levels can benefit other usages of the iterator (can you think of any?)

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 19, 2022

Another fix that I can imagine is that Pebble iterator is tied to a snapshot (instead of pinning memtable and SSTs upfront), but when it scans through things, it dynamically pins/unpins the bits it touches. @sumeerbhola is anything like that feasible in Pebble?

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Oct 19, 2022

Seems doable: kvBatchSnapshotStrategy.Send can yield when it spent enough time reading in IterateReplicaKeySpans, and let the latter recreate the iter and seek to the position where it stopped.

Yeah, I was thinking we'd return e.g. iterutil.Reopen() from the IterateReplicaKeySpans callback whenever we wanted it to reopen the iterator.

How necessary is it though, now that IterateReplicaKeySpans already creates/closes multiple iterators? Currently 8: point and range iter for each of the 4 spans.

7 of those iterators are typically trivial. The user point keyspace is the one that takes time.

However, we now have a 1 hour timeout for sending Raft snapshots:

// sendSnapshotTimeout is the timeout for sending snapshots. While a snapshot is
// in transit, Raft log truncation is halted to allow the recipient to catch up.
// If the snapshot takes very long to transfer for whatever reason this can
// cause the Raft log to grow very large. We therefore set a conservative
// timeout to eventually allow Raft log truncation while avoiding snapshot
// starvation -- even if another snapshot is sent immediately, this still
// allows truncation up to the new snapshot index.
var sendSnapshotTimeout = envutil.EnvOrDefaultDuration(
"COCKROACH_RAFT_SEND_SNAPSHOT_TIMEOUT", 1*time.Hour)

In the original incident, the snapshot had been stuck for at least 12 days. Now that we have a backstop timeout of 1 hour, I'm not sure if it's worth releasing the iterator pinning periodically. Wdyt @sumeerbhola?

If we still want to do this, can this be pushed down to lower levels? For example, the implementation of EngineIterator (which here is Pebble iterator) could do that on NextEngineKey calls and the like, or even further down the stack: wherever the SSTs are locked.

Seeks are expensive, so we definitely don't want to do this for latency-sensitive operations. Also, we can only do it on readers that have consistent iterators since we'd otherwise change the view of the iterator (it would see newer data after being reopened). I think it makes sense to do this above Pebble, if we even want to do this.

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 19, 2022

Makes sense. Here are a few alternatives.

Seeks are expensive, so we definitely don't want to do this for latency-sensitive operations.

Maybe on Pebble level there could be a way to avoid re-seeking or do it cheaper (only need to do it for one SST when it's compacted, instead of the whole stack). Also, the latency sensitivity could be a parameter when creating the iterator.

Also, we can only do it on readers that have consistent iterators

Does Pebble iterator/snapshot also know this info?

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Oct 20, 2022

Also, we can only do it on readers that have consistent iterators

Does Pebble iterator/snapshot also know this info?

With the exception of Pebble snapshots, consistent iterators are implemented in CRDB via iterator cloning. See e.g. pebbleBatch:

if iter.inuse {
return newPebbleIteratorByCloning(p.iter, opts, StandardDurability, p.SupportsRangeKeys())
}
if iter.iter != nil {
iter.setOptions(opts, StandardDurability)
} else {
iter.initReuseOrCreate(
handle, p.iter, p.iterUsed, opts, StandardDurability, p.SupportsRangeKeys())
if p.iter == nil {
// For future cloning.
p.iter = iter.iter
}
p.iterUsed = true
}

So this may only be viable on Pebble snapshots, that are already consistent in Pebble by definition. Other reader types (e.g. engine and batch) probably won't be viable, because the view of the iterator will change when it's refreshed. I think it kind of has to, because otherwise it won't be able to release the pinned SSTs, but I'll defer to storage here.

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 24, 2022

We are inclined to close this, as 1h timeout does the job, and the initial suggested fix introduces complexity.

@sumeerbhola Still interested in your input though.

@sumeerbhola
Copy link
Collaborator Author

Closing sounds fine. No need to add code that we don't really need.

@pav-kv pav-kv closed this as completed Oct 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) E-starter Might be suitable for a starter project for new employees or team members. O-postmortem Originated from a Postmortem action item.
Projects
None yet
Development

No branches or pull requests

5 participants