kvserver: outgoing replica snapshot is not properly using an engine snapshot #75824

sumeerbhola · 2022-02-01T21:22:18Z

related to the incident mentioned in #75728, https://github.com/cockroachlabs/support/issues/1403 (internal)

To send a replica snapshot we construct an engine snapshot at

cockroach/pkg/kv/kvserver/replica_raftstorage.go

Line 420 in be1b6c4

snap := r.store.engine.NewSnapshot()

and then immediately proceed to constructing a ReplicaEngineDataIterator that will construct a pebble.Iterator

cockroach/pkg/kv/kvserver/replica_raftstorage.go

Line 577 in be1b6c4

    
           iter := rditer.NewReplicaEngineDataIterator(&desc, snap, true /* replicatedOnly */)

The lifetime of the pebble.Iterator and the pebble.Snapshot are the same, so we don't get any benefit from the pebble.Snapshot pinning data but not pinning sstables or memtables, since the latter get pinned by the pebble.Iterator.
If sending a snapshot can be slow for whatever reason, we should consider periodically closing the pebble.Iterator and creating a new one.

cc to folks on that investigation: @aayushshah15 @tbg @nicktrav

Jira issue: CRDB-12851

The text was updated successfully, but these errors were encountered:

tbg · 2022-02-02T09:16:44Z

Your suggestion is to keep the long-lived snapshot but to make sure that the iterator isn't long-lived, right? Could you explain in more detail what the difference is? Both an iterator and a snapshot prevent SSTables that are "reachable" from them to be preserved, is this "pinning" behavior you mention related to how pebble manages its in-memory caches? How big a difference does it make? So far the problems I've observed with slow snapshots were related to SSTs not being deleted, am I missing another dimension? It doesn't seem that this suggestion would make a change there.
Just trying to understand what this would improve in practice, as it seems a little cumbersome to manage the iterator livecycle separately.

sumeerbhola · 2022-02-02T14:02:42Z

Your suggestion is to keep the long-lived snapshot but to make sure that the iterator isn't long-lived, right?

Correct

Could you explain in more detail what the difference is?

A pebble.Snapshot only preserves the versions of a key visible to the seqnum of the snapshot. All other old versions can be deleted (these are pebble "versions", not MVCC versions). This is handled via compactions being snapshot aware. So if you consider a typical workload with a low write rate, most versions that are visible to the snapshot will continue to be the latest version, so the snapshot is not causing much "old data" to be retained.
In contrast, a pebble.Iterator causes all memtables and sstables as of the time of the iterator creation to be kept until the iterator is closed. Which means that if these memtables are flushed or sstables are compacted, both the input and output needs to be preserved even if they contain the same key versions. So even the latest version of a key will end up with multiple copies being retained, until the Iterator closes. In addition to consuming more disk space, this can also cause memory pressure and OOMs.

Just trying to understand what this would improve in practice, as it seems a little cumbersome to manage the iterator livecycle separately.

I was imagining something like looking at the walltime occasionally when using the Iterator, and if the time since Iterator creation is over 1min, closing it and creating a new one.

tbg · 2022-02-02T17:14:30Z

Thanks for explaining! Makes sense.

I was imagining something like looking at the walltime occasionally when using the Iterator, and if the time since Iterator creation is over 1min, closing it and creating a new one.

Yeah, good idea, the right place to do that would be this method:

cockroach/pkg/kv/kvserver/store_snapshot.go

Lines 313 to 370 in 1335460

    
           // Send implements the snapshotStrategy interface. 
        
           func (kvSS *kvBatchSnapshotStrategy) Send( 
        
           	ctx context.Context, 
        
           	stream outgoingSnapshotStream, 
        
           	header SnapshotRequest_Header, 
        
           	snap *OutgoingSnapshot, 
        
           ) (int64, error) { 
        
           	assertStrategy(ctx, header, SnapshotRequest_KV_BATCH) 
        
           	// bytesSent is updated as key-value batches are sent with sendBatch. It 
        
           	// does not reflect the log entries sent (which are never sent in newer 
        
           	// versions of CRDB, as of VersionUnreplicatedTruncatedState). 
        
           	bytesSent := int64(0) 
        
           	// Iterate over all keys using the provided iterator and stream out batches 
        
           	// of key-values. 
        
           	kvs := 0 
        
           	var b storage.Batch 
        
           	defer func() { 
        
           		if b != nil { 
        
           			b.Close() 
        
           		} 
        
           	}() 
        
           	for iter := snap.Iter; ; iter.Next() { 
        
           		if ok, err := iter.Valid(); err != nil { 
        
           			return 0, err 
        
           		} else if !ok { 
        
           			break 
        
           		} 
        
           		kvs++ 
        
           		unsafeKey := iter.UnsafeKey() 
        
           		unsafeValue := iter.UnsafeValue() 
        
           		if b == nil { 
        
           			b = kvSS.newBatch() 
        
           		} 
        
           		if err := b.PutEngineKey(unsafeKey, unsafeValue); err != nil { 
        
           			return 0, err 
        
           		} 
        
           		if bLen := int64(b.Len()); bLen >= kvSS.batchSize { 
        
           			if err := kvSS.sendBatch(ctx, stream, b); err != nil { 
        
           				return 0, err 
        
           			} 
        
           			bytesSent += bLen 
        
           			b.Close() 
        
           			b = nil 
        
           		} 
        
           	} 
        
           	if b != nil { 
        
           		if err := kvSS.sendBatch(ctx, stream, b); err != nil { 
        
           			return 0, err 
        
           		} 
        
           		bytesSent += int64(b.Len()) 
        
           	} 
        
           	kvSS.status = redact.Sprintf("kv pairs: %d", kvs) 
        
           	return bytesSent, nil 
        
           }

We'll probably want to remove the Iter field altogether, and just create it there.

erikgrinaker · 2022-07-26T20:10:14Z

Sumeer points out that this would be straightforward with the new replica data iterator pattern in #84070.

#84070 (review)

erikgrinaker · 2022-08-25T09:54:30Z

This fell through the cracks in the pre-stability scramble, unfortunately. Won't get it in for 22.2.

pav-kv · 2022-10-19T12:56:25Z

Seems doable: kvBatchSnapshotStrategy.Send can yield when it spent enough time reading in IterateReplicaKeySpans, and let the latter recreate the iter and seek to the position where it stopped.

How necessary is it though, now that IterateReplicaKeySpans already creates/closes multiple iterators? Currently 8: point and range iter for each of the 4 spans.

If we still want to do this, can this be pushed down to lower levels? For example, the implementation of EngineIterator (which here is Pebble iterator) could do that on NextEngineKey calls and the like, or even further down the stack: wherever the SSTs are locked.

Can be a bigger change, but the rationale is:

this is Pebble-specific behaviour, it seems unconventional to handle this above the interface
doing it on lower levels can benefit other usages of the iterator (can you think of any?)

pav-kv · 2022-10-19T13:22:30Z

Another fix that I can imagine is that Pebble iterator is tied to a snapshot (instead of pinning memtable and SSTs upfront), but when it scans through things, it dynamically pins/unpins the bits it touches. @sumeerbhola is anything like that feasible in Pebble?

erikgrinaker · 2022-10-19T13:55:04Z

Seems doable: kvBatchSnapshotStrategy.Send can yield when it spent enough time reading in IterateReplicaKeySpans, and let the latter recreate the iter and seek to the position where it stopped.

Yeah, I was thinking we'd return e.g. iterutil.Reopen() from the IterateReplicaKeySpans callback whenever we wanted it to reopen the iterator.

How necessary is it though, now that IterateReplicaKeySpans already creates/closes multiple iterators? Currently 8: point and range iter for each of the 4 spans.

7 of those iterators are typically trivial. The user point keyspace is the one that takes time.

However, we now have a 1 hour timeout for sending Raft snapshots:

cockroach/pkg/kv/kvserver/replica_command.go

Lines 60 to 68 in 0036160

    
           // sendSnapshotTimeout is the timeout for sending snapshots. While a snapshot is 
        
           // in transit, Raft log truncation is halted to allow the recipient to catch up. 
        
           // If the snapshot takes very long to transfer for whatever reason this can 
        
           // cause the Raft log to grow very large. We therefore set a conservative 
        
           // timeout to eventually allow Raft log truncation while avoiding snapshot 
        
           // starvation -- even if another snapshot is sent immediately, this still 
        
           // allows truncation up to the new snapshot index. 
        
           var sendSnapshotTimeout = envutil.EnvOrDefaultDuration( 
        
           	"COCKROACH_RAFT_SEND_SNAPSHOT_TIMEOUT", 1*time.Hour)

In the original incident, the snapshot had been stuck for at least 12 days. Now that we have a backstop timeout of 1 hour, I'm not sure if it's worth releasing the iterator pinning periodically. Wdyt @sumeerbhola?

If we still want to do this, can this be pushed down to lower levels? For example, the implementation of EngineIterator (which here is Pebble iterator) could do that on NextEngineKey calls and the like, or even further down the stack: wherever the SSTs are locked.

Seeks are expensive, so we definitely don't want to do this for latency-sensitive operations. Also, we can only do it on readers that have consistent iterators since we'd otherwise change the view of the iterator (it would see newer data after being reopened). I think it makes sense to do this above Pebble, if we even want to do this.

pav-kv · 2022-10-19T15:21:21Z

Makes sense. Here are a few alternatives.

Seeks are expensive, so we definitely don't want to do this for latency-sensitive operations.

Maybe on Pebble level there could be a way to avoid re-seeking or do it cheaper (only need to do it for one SST when it's compacted, instead of the whole stack). Also, the latency sensitivity could be a parameter when creating the iterator.

Also, we can only do it on readers that have consistent iterators

Does Pebble iterator/snapshot also know this info?

erikgrinaker · 2022-10-20T07:54:44Z

Also, we can only do it on readers that have consistent iterators

Does Pebble iterator/snapshot also know this info?

With the exception of Pebble snapshots, consistent iterators are implemented in CRDB via iterator cloning. See e.g. pebbleBatch:

cockroach/pkg/storage/pebble_batch.go

Lines 178 to 192 in 5bafe90

    
           if iter.inuse { 
        
           	return newPebbleIteratorByCloning(p.iter, opts, StandardDurability, p.SupportsRangeKeys()) 
        
           } 
        
           if iter.iter != nil { 
        
           	iter.setOptions(opts, StandardDurability) 
        
           } else { 
        
           	iter.initReuseOrCreate( 
        
           		handle, p.iter, p.iterUsed, opts, StandardDurability, p.SupportsRangeKeys()) 
        
           	if p.iter == nil { 
        
           		// For future cloning. 
        
           		p.iter = iter.iter 
        
           	} 
        
           	p.iterUsed = true 
        
           }

So this may only be viable on Pebble snapshots, that are already consistent in Pebble by definition. Other reader types (e.g. engine and batch) probably won't be viable, because the view of the iterator will change when it's refreshed. I think it kind of has to, because otherwise it won't be able to release the pinned SSTs, but I'll defer to storage here.

pav-kv · 2022-10-24T10:16:15Z

We are inclined to close this, as 1h timeout does the job, and the initial suggested fix introduces complexity.

@sumeerbhola Still interested in your input though.

sumeerbhola · 2022-10-25T15:10:29Z

Closing sounds fine. No need to add code that we don't really need.

sumeerbhola added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. labels Feb 1, 2022

tbg added the E-starter Might be suitable for a starter project for new employees or team members. label Feb 2, 2022

rupertharwood-crl added the O-postmortem Originated from a Postmortem action item. label Feb 3, 2022

erikgrinaker added the T-kv-replication label May 31, 2022

erikgrinaker self-assigned this Jul 26, 2022

erikgrinaker removed their assignment Aug 25, 2022

exalate-issue-sync bot assigned pav-kv Aug 25, 2022

pav-kv closed this as completed Oct 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: outgoing replica snapshot is not properly using an engine snapshot #75824

kvserver: outgoing replica snapshot is not properly using an engine snapshot #75824

sumeerbhola commented Feb 1, 2022 •

edited by cockroach-jira-scripts

Loading

tbg commented Feb 2, 2022

sumeerbhola commented Feb 2, 2022 •

edited

Loading

tbg commented Feb 2, 2022

erikgrinaker commented Jul 26, 2022

erikgrinaker commented Aug 25, 2022

pav-kv commented Oct 19, 2022 •

edited

Loading

pav-kv commented Oct 19, 2022 •

edited

Loading

erikgrinaker commented Oct 19, 2022 •

edited

Loading

pav-kv commented Oct 19, 2022

erikgrinaker commented Oct 20, 2022 •

edited

Loading

pav-kv commented Oct 24, 2022

sumeerbhola commented Oct 25, 2022

kvserver: outgoing replica snapshot is not properly using an engine snapshot #75824

kvserver: outgoing replica snapshot is not properly using an engine snapshot #75824

Comments

sumeerbhola commented Feb 1, 2022 • edited by cockroach-jira-scripts Loading

tbg commented Feb 2, 2022

sumeerbhola commented Feb 2, 2022 • edited Loading

tbg commented Feb 2, 2022

erikgrinaker commented Jul 26, 2022

erikgrinaker commented Aug 25, 2022

pav-kv commented Oct 19, 2022 • edited Loading

pav-kv commented Oct 19, 2022 • edited Loading

erikgrinaker commented Oct 19, 2022 • edited Loading

pav-kv commented Oct 19, 2022

erikgrinaker commented Oct 20, 2022 • edited Loading

pav-kv commented Oct 24, 2022

sumeerbhola commented Oct 25, 2022

sumeerbhola commented Feb 1, 2022 •

edited by cockroach-jira-scripts

Loading

sumeerbhola commented Feb 2, 2022 •

edited

Loading

pav-kv commented Oct 19, 2022 •

edited

Loading

pav-kv commented Oct 19, 2022 •

edited

Loading

erikgrinaker commented Oct 19, 2022 •

edited

Loading

erikgrinaker commented Oct 20, 2022 •

edited

Loading