kvcoord: add `TxnCoordSender` method to refresh read spans #68051

erikgrinaker · 2021-07-26T10:23:07Z

To implement the Index Lookups Memory Limits and Parallelism RFC, the streamer will have to use leaf transactions to run parallel requests (as a root TxnCoordSender does not support this).

This will prevent the use of automatic span refreshing, instead propagating these errors to the client who will have to retry the transaction. We can avoid this by adding a method to the TxnCoordSender that, given a refresh timestamp, explicitly refreshes the read spans and handles errors as appropriate. The streamer will then coordinate in-flight requests, update the root txn with leaf state via UpdateRootWithLeafFinalState, and explicitly submit a refresh. The method should ensure we return appropriate info in any errors propagated to clients (i.e. the original keys that caused the refresh, as well as which keys caused the refresh to fail).

For more details, see the Hiding ReadWithinUncertaintyInterval errors section in the RFC.

/cc @cockroachdb/kv

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2021-07-26T18:49:44Z

This is related to #24798.

erikgrinaker · 2021-09-28T12:47:15Z

@andreimatei So it turns out we already have an (unused) method TxnCoordSender.ManualRefresh() that does this:

cockroach/pkg/kv/kvclient/kvcoord/txn_coord_sender.go

Lines 1276 to 1298 in efefd3d

    
           func (tc *TxnCoordSender) ManualRefresh(ctx context.Context) error { 
        
           	tc.mu.Lock() 
        
           	defer tc.mu.Unlock() 
        
           	// Hijack the pre-emptive refresh code path to perform the refresh but 
        
           	// provide the force flag to ensure that the refresh occurs unconditionally. 
        
           	// We provide an empty BatchRequest - maybeRefreshPreemptivelyLocked just 
        
           	// needs the transaction proto. The function then returns a BatchRequest 
        
           	// with the updated transaction proto. We use this updated proto to call 
        
           	// into updateStateLocked directly. 
        
           	var ba roachpb.BatchRequest 
        
           	ba.Txn = tc.mu.txn.Clone() 
        
           	const force = true 
        
           	refreshedBa, pErr := tc.interceptorAlloc.txnSpanRefresher.maybeRefreshPreemptivelyLocked(ctx, ba, force) 
        
           	if pErr != nil { 
        
           		pErr = tc.updateStateLocked(ctx, ba, nil, pErr) 
        
           	} else { 
        
           		var br roachpb.BatchResponse 
        
           		br.Txn = refreshedBa.Txn 
        
           		pErr = tc.updateStateLocked(ctx, ba, &br, nil) 
        
           	} 
        
           	return pErr.GoError() 
        
           }

Since the streamer library will be calling TxnCoordSender.UpdateRootWithLeafFinalState to update the txn state, which would update Txn.WriteTimestamp if it's set, this may already be sufficient (with a few small tweaks to handle leaf txns, i.e. canAutoRetry). Or we could modify it to take in the refresh timestamp rather than reading it from Txn.WriteTimestamp. Am I missing anything?

andreimatei · 2021-09-28T14:34:31Z

What you say makes sense to me.

Or we could modify it to take in the refresh timestamp rather than reading it from Txn.WriteTimestamp.

Taking the timestamp to refresh to as an argument sounds good; if we don't need it, even better I'd say.

erikgrinaker · 2021-10-04T19:07:42Z

I've looked this over, and from what I can tell we have the main pieces we need already in place. I think the streamer would do something like this:

br, pErr := leafTxn.Send(ctx, ba)
if pErr != nil {
	canRefreshTxn, refreshTxn := roachpb.CanTransactionRefresh(ctx, pErr)
	if !canRefreshTxn {
		return pErr.GoError()
	}
	rootTxn.Sender().UpdateRootWithLeafFinalState(ctx, &roachpb.LeafTxnFinalState{
		Txn: *refreshTxn,
		RefreshSpans: nil, // set appropriate spans
	})
	if err := rootTxn.ManualRefresh(); err != nil {
		return err
	}
}

As such, I'm going to close this for now, but please reopen (or open a new issue) if you find that this isn't sufficient @yuzefovich.

yuzefovich · 2022-01-12T05:15:26Z

Hey @erikgrinaker, I just rebased my prototype for this on top of latest master, and roachpb.CanTransactionRefresh is gone, and I can't seem to find its replacement 😅 Could you please point me to it?

yuzefovich · 2022-01-12T06:21:13Z

Never mind, I found that it was refactored in #73557.

yuzefovich · 2022-01-13T03:29:07Z

@erikgrinaker I still need your help after all 😄

My current prototype is something like

	br, pErr := leafTxn.Send(ctx, ba)
	if pErr != nil {
		canRefreshTxn, refreshTS := roachpb.TransactionRefreshTimestamp(pErr)
		if !canRefreshTxn {
			return pErr.GoError()
		}

		leafTxnFinalState, err := leafTxn.GetLeafTxnFinalState(ctx)
		if err != nil {
			return err
		}
		refreshTxn := pErr.GetTxn()
		refreshTxn.Refresh(refreshTS)
		leafTxnFinalState.Txn = *refreshTxn
		rootTxn.Sender().UpdateRootWithLeafFinalState(ctx, leafTxnFinalState)
		if err := rootTxn.ManualRefresh(ctx); err != nil {
			return err
		}
	}

Does this look right? What happens to the old leafTxn is the refresh succeeds? Do I have to create a new leafTxn?

erikgrinaker · 2022-01-13T11:08:13Z

Hm, I'm not sure if that's going to work. Because refreshTxn.Refresh() forwards the read and write timestamps:

cockroach/pkg/roachpb/data.go

Lines 1135 to 1139 in 01cb847

    
           func (t *Transaction) Refresh(timestamp hlc.Timestamp) { 
        
           	t.WriteTimestamp.Forward(timestamp) 
        
           	t.ReadTimestamp.Forward(t.WriteTimestamp) 
        
           	t.WriteTooOld = false 
        
           }

But then ManualRefresh() is a noop if these timestamps are equal:

cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go

Lines 441 to 444 in 52c2df4

    
           // If the transaction has yet to be pushed, no refresh is necessary. 
        
           if ba.Txn.ReadTimestamp == ba.Txn.WriteTimestamp { 
        
           	return ba, nil 
        
           }

@nvanbenschoten Any thoughts on how to accomplish the above following #73557? Do we need to update ManualRefresh() or wire up a new method for this?

nvanbenschoten · 2022-01-14T17:39:25Z

This is interesting. I agree that refreshTxn.Refresh() is not the right thing to do for the reason you mentioned. Is it needed though? What would happen if we just did the following? Won't the leafTxn already be updated with an increased WriteTimestamp that reflects the pErr by the time we hear about it?

	br, pErr := c.Send(ctx, ba)
	if pErr != nil {
		canRefreshTxn, refreshTS := roachpb.TransactionRefreshTimestamp(pErr)
		if !canRefreshTxn {
			return pErr.GoError()
		}

		leafTxnFinalState, err := leafTxn.GetLeafTxnFinalState(ctx)
		if err != nil {
			return err
		}
		rootTxn.Sender().UpdateRootWithLeafFinalState(ctx, leafTxnFinalState)
		if err := rootTxn.ManualRefresh(ctx); err != nil {
			return err
		}
	}

yuzefovich · 2022-01-31T18:37:08Z

Is it on me to try the latest proposal from Nathan and see if it works? Or are you still thinking about how to achieve the manual refresh?

erikgrinaker · 2022-02-01T15:56:45Z

Sorry, this fell through the cracks. I would go with Nathan's suggestion for now, and I'll look this over real quick when I have a chance.

erikgrinaker · 2022-02-07T10:44:00Z

Yeah, I had a quick look, and I think that should work -- the leaf txn should have its WriteTimestamp bumped by the error, which gets propagated to the root txn here and then refreshed. But it's definitely worth writing an integration test for this.

yuzefovich · 2022-05-17T03:24:14Z

@erikgrinaker @nvanbenschoten Apologies for having to page the context back in, but I finally got to working on this, and maybe the solution proposed by Nathan doesn't quite work (or at least I'm misusing it).

In https://github.com/yuzefovich/cockroach/tree/streamer-refresh in the last two commits I'm introducing a test for this behavior as well as a fix (which is currently incomplete, but I don't think it's important). With the snippet similar to #68051 (comment), when trying to re-execute the BatchRequest, I'm getting the same injected RWUI error coming from here. It seems as if the TxnCoordSender of the leafTxn is poisoned and ManualRefresh didn't fix it.

Is it the problem with the test where I inject RWUI error and it is not reflected in the refresh spans of LeafTxnFinalState? Or should I be creating a new LeafTxn once the root txn is refreshed?

nvanbenschoten · 2022-05-18T22:26:04Z

when trying to re-execute the BatchRequest, I'm getting the same injected RWUI error coming from here

Are you creating a new leaf transaction after the manual refresh? Once a leaf transaction has hit this error, I think it needs to be ingested into the root txn (through UpdateRootWithLeafFinalState) and then discarded.

Also, could you try changing this line from roachpb.NewError(...) to roachpb.NewErrorWithTxn(..., ba.Txn)?

yuzefovich · 2022-05-19T01:57:18Z

Are you creating a new leaf transaction after the manual refresh?

No, I wasn't creating a new leaf and was using the old one. After creating a fresh LeafTxn the retried BatchRequest goes through, but it still has the old timestamp which goes against my intuition. It looks like we probably should not be just ignoring refreshTS value from the snippet - how should we incorporate it? My expectation is that after ManualRefresh and creating a new LeafTxn, the new LeafTxn should be using refreshTS value as its ReadTimestamp, but it is currently not the case.

Also, could you try changing this line from roachpb.NewError(...) to roachpb.NewErrorWithTxn(..., ba.Txn)?

This doesn't seem to change anything (neither w/ nor w/o re-creating a leaf txn).

nvanbenschoten · 2022-05-20T15:56:46Z

but it still has the old timestamp which goes against my intuition

This goes against my intuition as well. I think it has to do with the NewErrorWithTxn. In cases where a transaction's timestamp is bumped due to a tscache conflict (write-read conflict), a closed timestamp conflict (write-read conflict), a write-too-old conflict (write-write conflict), and most others, the returned Txn already has its WriteTimestamp bumped. That will cause TxnCoordSender.updateStateLocked to bump the coordinator's timestamp.

We're not doing this in the test, so I think that explains things. However, I'm not sure that we do bump the write timestamp on a ReadWithinUncertaintyInterval error before performing the rest of the refresh, because a txn can't proceed with evaluation until the error is refreshed away. So the error does not carry a provisional updated Txn.

With that in mind, it does feel like we should be using the refreshTS straight from TransactionRefreshTimestamp and feeding that into a new ManualRefreshTo(context.Context, hlc.Timestamp) method which first forwards the txn's WriteTimestamp to the provided timestamp before performing the refresh.

yuzefovich · 2022-05-20T20:09:56Z

I see, I think this makes sense, thanks Nathan. It almost worked right away - I started getting

TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh - unknown error: <nil>): "unnamed" meta={id=9e4d6b4c key=/Min pri=0.00941665 epo=0 ts=1653076575.760907001,1 min=1653076575.760907000,0 seq=0} lock=false stat=PENDING rts=1653076575.760907000,0 wto=false gul=1653076576.260907000,0

during ManualRefreshTo. Apparently, txnSpanRefresher.refreshedTimestamp was empty causing this error during a refresh, so for now I added a bump to that timestamp in pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go. Could you please take a quick look at this snippet?

nvanbenschoten · 2022-06-08T23:47:57Z

I don't think that snippet is quite right. The handling of txnSpanRefresher.refreshedTimestamp is pretty funky so I see why you wrote the code that way. However, as written, I think you're bumping the refreshTxn's ReadTimestamp to the write timestamp and then using this to set refreshedTimestamp. So the refresh would be a no-op in this case.

Let me see whether I can clean up this handling of txnSpanRefresher.refreshedTimestamp so that you don't need to deal with this.

Related to cockroachdb#68051. This is a partial reversion of d6ec977 which downgrades the role of `txnSpanRefresher.refreshedTimestamp` back to being used as a sanity check that we don't allow incoherent refresh spans into the refresh footprint. We no longer use the field to determine where to refresh from. Instead, we use the pre-refreshed BatchRequest.Txn.ReadTimestamp to determine the lower-bound of the refresh. This avoids some awkward logic in txnSpanRefresher.SendLocked (e.g. the logic needed in b9fb236). It also avoids the kinds of issues we saw when trying to expand the use of manual refreshing in cockroachdb#68051. Release note: None.

nvanbenschoten · 2022-06-09T04:25:01Z

#82649 should avoid the need to hack around the empty txnSpanRefresher.refreshedTimestamp.

Related to cockroachdb#68051. This is a partial reversion of d6ec977 which downgrades the role of `txnSpanRefresher.refreshedTimestamp` back to being used as a sanity check that we don't allow incoherent refresh spans into the refresh footprint. We no longer use the field to determine where to refresh from. Instead, we use the pre-refreshed BatchRequest.Txn.ReadTimestamp to determine the lower-bound of the refresh. This avoids some awkward logic in txnSpanRefresher.SendLocked (e.g. the logic needed in b9fb236). It also avoids the kinds of issues we saw when trying to expand the use of manual refreshing in cockroachdb#68051. Release note: None.

82649: kv: only use txnSpanRefresher.refreshedTimestamp for assertions r=nvanbenschoten a=nvanbenschoten Related to #68051. This is a partial reversion of d6ec977 which downgrades the role of `txnSpanRefresher.refreshedTimestamp` back to being used as a sanity check that we don't allow incoherent refresh spans into the refresh footprint. We no longer use the field to determine where to refresh from. Instead, we use the pre-refreshed BatchRequest.Txn.ReadTimestamp to determine the lower-bound of the refresh. This avoids some awkward logic in txnSpanRefresher.SendLocked (e.g. the logic needed in b9fb236). It also avoids the kinds of issues we saw when trying to expand the use of manual refreshing in #68051. Release note: None. Co-authored-by: Nathan VanBenschoten <[email protected]>

github-actions · 2023-12-04T11:05:47Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-client Relating to the KV client and the KV interface. T-kv KV Team labels Jul 26, 2021

erikgrinaker self-assigned this Jul 26, 2021

erikgrinaker closed this as completed Oct 4, 2021

knz reopened this May 18, 2022

exalate-issue-sync bot unassigned erikgrinaker Jun 2, 2022

nvanbenschoten mentioned this issue Jun 9, 2022

kv: only use txnSpanRefresher.refreshedTimestamp for assertions #82649

Merged

github-actions bot added the no-issue-activity label Dec 4, 2023

yuzefovich removed the no-issue-activity label Dec 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvcoord: add `TxnCoordSender` method to refresh read spans #68051

kvcoord: add `TxnCoordSender` method to refresh read spans #68051

erikgrinaker commented Jul 26, 2021 •

edited

Loading

nvanbenschoten commented Jul 26, 2021

erikgrinaker commented Sep 28, 2021

andreimatei commented Sep 28, 2021

erikgrinaker commented Oct 4, 2021

yuzefovich commented Jan 12, 2022

yuzefovich commented Jan 12, 2022

yuzefovich commented Jan 13, 2022

erikgrinaker commented Jan 13, 2022

nvanbenschoten commented Jan 14, 2022

yuzefovich commented Jan 31, 2022

erikgrinaker commented Feb 1, 2022

erikgrinaker commented Feb 7, 2022

yuzefovich commented May 17, 2022

nvanbenschoten commented May 18, 2022

yuzefovich commented May 19, 2022 •

edited

Loading

nvanbenschoten commented May 20, 2022

yuzefovich commented May 20, 2022

nvanbenschoten commented Jun 8, 2022

nvanbenschoten commented Jun 9, 2022

github-actions bot commented Dec 4, 2023

kvcoord: add TxnCoordSender method to refresh read spans #68051

kvcoord: add TxnCoordSender method to refresh read spans #68051

Comments

erikgrinaker commented Jul 26, 2021 • edited Loading

nvanbenschoten commented Jul 26, 2021

erikgrinaker commented Sep 28, 2021

andreimatei commented Sep 28, 2021

erikgrinaker commented Oct 4, 2021

yuzefovich commented Jan 12, 2022

yuzefovich commented Jan 12, 2022

yuzefovich commented Jan 13, 2022

erikgrinaker commented Jan 13, 2022

nvanbenschoten commented Jan 14, 2022

yuzefovich commented Jan 31, 2022

erikgrinaker commented Feb 1, 2022

erikgrinaker commented Feb 7, 2022

yuzefovich commented May 17, 2022

nvanbenschoten commented May 18, 2022

yuzefovich commented May 19, 2022 • edited Loading

nvanbenschoten commented May 20, 2022

yuzefovich commented May 20, 2022

nvanbenschoten commented Jun 8, 2022

nvanbenschoten commented Jun 9, 2022

github-actions bot commented Dec 4, 2023

kvcoord: add `TxnCoordSender` method to refresh read spans #68051

kvcoord: add `TxnCoordSender` method to refresh read spans #68051

erikgrinaker commented Jul 26, 2021 •

edited

Loading

yuzefovich commented May 19, 2022 •

edited

Loading