bulk/kv: uploading to cloud storage in ExportRequest evaluation considered harmful #66486

nvanbenschoten · 2021-06-15T16:51:26Z

In #66338, we found that rate-limiting ExportRequests during evaluation could lead to transitive stalls in workload traffic, if interleaved poorly with a Split. That PR moved rate-limiting of ExportRequest above latching to avoid holding latches when not strictly necessary. We've also discussed ideas around dropping latches earlier for reads, before evaluation, in #66485. That issue seems challenging and large in scope, but promising.

However, for now, we need to continue to be careful about the duration that read-only requests hold read latches.

During a code audit earlier today, we found that ExportRequest can be configured to upload files to cloud storage directly during evaluation, while holding latches:

cockroach/pkg/ccl/storageccl/export.go

Lines 194 to 203 in a61f01d

    
           if exportStore != nil { 
        
           	exported.Path = GenerateUniqueSSTName(base.SQLInstanceID(cArgs.EvalCtx.NodeID())) 
        
           	if err := retry.WithMaxAttempts(ctx, base.DefaultRetryOptions(), maxUploadRetries, func() error { 
        
           		// We blindly retry any error here because we expect the caller to have 
        
           		// verified the target is writable before sending ExportRequests for it. 
        
           		if err := cloud.WriteFile(ctx, exportStore, exported.Path, bytes.NewReader(data)); err != nil { 
        
           			log.VEventf(ctx, 1, "failed to put file: %+v", err) 
        
           			return err 
        
           		} 
        
           		return nil

This seems potentially disastrous, as it means that we will be performing network operations during evaluation. In fact, we'll even retry this upload up to 5 times (maxUploadRetries). So it's hard to place any limit on the duration that a given Export request may run for. As a result, it's hard to place any limit on the duration that a given Export request may transitively block foreground reads and writes.

I'd like to learn whether we need this capability and push to get rid of it. Even once we address #66485, it still seems like an abuse to touch the network during request evaluation, which is meant to operate in a sandboxed scope of a replica. That is simply not what the framework is meant for.

Interestingly, we do have a separate code path that avoids this. We have a way to specify that an ExportRequest should return an SST (using ReturnSST) instead of immediately uploading it. We then can perform the upload from the DistSQL backupProcessor:

cockroach/pkg/ccl/backupccl/backup_processor.go

Lines 437 to 439 in a61f01d

    
           // writeFile writes the data specified in the export response file to the backup 
        
           // destination. The ExportRequest will do this if its ReturnSST argument is set 
        
           // to false. In that case, we want to write the file from the processor.

This seems like a much more appropriate way to evaluate a backup. It also seems like it doesn't trade much in terms of performance when the backupProcessor is scheduled on the same node as the range's leaseholder. Either way, we're still pulling chunks into memory and then uploading them. The only difference is that we'll pull the chunk up a few levels higher in the stack.

Am I understanding all of this correctly? If so, what can we do here?

/cc. @dt @aayushshah15 @andreimatei

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2021-06-15T17:06:12Z

It's worth pointing out that this would compose really badly with evaluation-time rate-limiting of ExportRequests. It should be less severe with #66338, as a few slow Exports will no longer cascade with the possibility of blocking many ranges.

dt · 2021-06-15T19:26:03Z

The ReturnSST flag is currently only set by the non-system tenant's backupDataProcessor, where we want the per-tenant SQL pod process to be the one that does or doesn't get to make outbound network requests, not the shared KV layer, but we did some handwringing when made that change over the cost of the extra hop on network, memory, etc. The overhead of backups is already a sensitive issue for some customers so making it higher was something we were pretty wary of doing. Tenants may not have existing expectations, but any regression for existing non-tenant users might be poorly received.

If anything, I think we were planning to start making ExportRequest run longer in the near future. Currently it iterates into a buffer until some size limit, then stops and opens a remote file and writes the content of the buffer. This has two drawbacks: one is that it forces the buffer, and thus resulting file, to be sized to fit in memory, which is not great for us in the context of #44480. Another less common but more severe issue is ranges with write traffic/ttl/etc that mean they end up with large quantities of revisions to of a single key. This is a problem because files have key boundaries, so all the revisions of any one key must be exported in one file. Ever since the max range size was upped, ranges are now able to accumulate far more revisions of a key than Export will write based on its max file size, meaning clusters can get themselves into a state that can't they can't back themselves up.

To fix both these, we've been working on instead chasing it so we write directly while iterating: we've changed the API for our external IO to return an io.Writer and changed storage.ExportToSST to take an io.Writer and are planning to change ExportRequest to open the remote file first, then pass it to the iteration to eliminate the buffer, and along with it the size limit. This however would mean we'll be exporting potentially even longer.

It seems like #66485 should help a lot here, shouldn't it?

miretskiy · 2021-06-15T22:09:39Z

We should be able to evaluate the impact of always using ReturnSST though? Perhaps few runs of 2-4TB backup with/without this flag?

dt · 2021-06-16T14:51:39Z

In the meantime we can/should add a setting to opt into proxying writes to the sql proc, to make it easier to measure the overhead and/or give users who want it -- and whatever that overhead is -- that choice. Opened #66540.

Related to cockroachdb#66486. Command evaluation is meant to operate in a sandbox. It certainly shouldn't have access to a DB handle.

67094: kv: remove EvalContext.DB r=nvanbenschoten a=nvanbenschoten Related to #66486. Command evaluation is meant to operate in a sandbox. It certainly shouldn't have access to a `DB` handle. Co-authored-by: Nathan VanBenschoten <[email protected]>

nvanbenschoten · 2021-07-06T15:29:59Z

There was discussion that moved to https://cockroachlabs.slack.com/archives/C2C5FKPPB/p1623782623130900.

aliher1911 · 2021-10-29T11:56:27Z

Writes were moved from under the request evaluation to processor.

nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery A-kv-transactions Relating to MVCC and the transactional model. labels Jun 15, 2021

blathers-crl bot added the T-disaster-recovery label Jun 15, 2021

jlinder added the T-kv KV Team label Jun 16, 2021

nvanbenschoten mentioned this issue Jun 30, 2021

kv: remove EvalContext.DB #67094

Merged

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jun 30, 2021

kv: remove EvalContext.DB

b2bed86

Related to cockroachdb#66486. Command evaluation is meant to operate in a sandbox. It certainly shouldn't have access to a DB handle.

lunevalex added O-postmortem Originated from a Postmortem action item. N-followup Needs followup. labels Jul 1, 2021

aliher1911 self-assigned this Jul 13, 2021

This was referenced Jul 29, 2021

Allow ExportRequest to resume exports on timestamp boundary #68231

Closed

Expose number of keys mvcc_incremental_iterator skips over for better resource accounting #68234

Closed

aliher1911 closed this as completed Oct 29, 2021

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bulk/kv: uploading to cloud storage in ExportRequest evaluation considered harmful #66486

bulk/kv: uploading to cloud storage in ExportRequest evaluation considered harmful #66486

nvanbenschoten commented Jun 15, 2021

nvanbenschoten commented Jun 15, 2021

dt commented Jun 15, 2021 •

edited

Loading

miretskiy commented Jun 15, 2021

dt commented Jun 16, 2021

nvanbenschoten commented Jul 6, 2021

aliher1911 commented Oct 29, 2021

bulk/kv: uploading to cloud storage in ExportRequest evaluation considered harmful #66486

bulk/kv: uploading to cloud storage in ExportRequest evaluation considered harmful #66486

Comments

nvanbenschoten commented Jun 15, 2021

nvanbenschoten commented Jun 15, 2021

dt commented Jun 15, 2021 • edited Loading

miretskiy commented Jun 15, 2021

dt commented Jun 16, 2021

nvanbenschoten commented Jul 6, 2021

aliher1911 commented Oct 29, 2021

dt commented Jun 15, 2021 •

edited

Loading