kv: txn giving up on refresh span collection causes closed ts to kick it out #44645

knz · 2020-02-03T17:57:39Z

Found by user:

txn starts
txn has a lot of operations whereby they exceed max_refresh_span_bytes and refresh span collection stops
txn lasts for more than 30s
closed ts "Catches up", doesn't find refresh spans and "kicks the txn out" (pushes it and client receives an error)
the error is not the usual retry error because it is not caused by contention, but the error message does not clarify what is happening

There are three separate issues here:

we want a larger default for max_refresh_span_bytes so that the scenario becomes less likely. This is predicated on better memory tracking in KV, a separate work item (planned for 20.1, see the work @tbg has started on [dnm] kv: expose (and use) byte batch response size limit #44341 ). I think this is orthogonal and should be kept out of scope here.
when the scenario happens we want the error message to be clearer about what needs to happen: either decrease the duration of the txn, or decrease the its number of refresh spans (fewer reads/writes), or increase max_refresh_span_bytes, or increase the closed ts delay
or we could avoid the situation entirely? Make the closed ts lag behind the long-running txn if it has disabled refresh spans collection.

cc @ajwerner @tbg for triage.

Jira issue: CRDB-5215

RoachietheSupportRoach · 2020-02-03T18:12:37Z

Zendesk ticket #4611 has been linked to this issue.

ajwerner · 2020-02-03T18:20:27Z

txn lasts for more than 30s

To be more precise we should say that the txn lasts more than kv.closed_timestamp.target_interval which defaulted to 30s in 19.1 and 19.2 and will default to 3s in 20.1.

closed ts "Catches up", doesn't find refresh spans and "kicks the txn out" (pushes it and client receives an error)

The closed timestamp subsystem doesn't know anything about refresh spans. The closed timestamp subsystem prevents periodically attempts to make history below some timestamp immutable (more specifically, no new intents may be laid down before the closed timestamp though intents which were already written can still be resolved at that timestamp). When a write fails due to the closed timestamp, the transaction will be forced to refresh. The mechanism is identical to a read in the timestamp cache leading to a push (in fact the closed timestamp value for a write is utilized in Replica.applyTimestampCache:

cockroach/pkg/storage/replica_tscache.go

Lines 253 to 257 in 31513b7

    
           // minReadTS is used as a per-request low water mark for the value returned from 
        
           // the timestamp cache. That is, if the timestamp cache returns a value below 
        
           // minReadTS, minReadTS (without an associated txn id) will be used instead to 
        
           // adjust the batch's timestamp. 
        
           func (r *Replica) applyTimestampCache(

When a transaction coordinator detects that it has been pushed then it will perform a refresh (the exact details of when this refresh occurs relative to other operations is not currently paged into my head). If there are too many spans to refresh (as defined by the kv.transaction.max_refresh_span_bytes cluster setting) then we won't even try.

the error is not the usual retry error because it is not caused by contention, but the error message does not clarify what is happening

Is the error not a usual retry error?

we want a larger default for max_refresh_span_bytes so that the scenario becomes less likely. This is predicated on better memory tracking in KV, a separate work item (planned for 20.1, see the work @tbg has started on #44341 ). I think this is orthogonal and should be kept out of scope here.

+1

when the scenario happens we want the error message to be clearer about what needs to happen: either decrease the duration of the txn, or decrease the its number of refresh spans (fewer reads/writes), or increase max_refresh_span_bytes, or increase the closed ts delay

I think this is a good idea. We could wrap the error in another layer to indicate that we didn't even try to refresh due to the refresh span byte limit.

or we could avoid the situation entirely? Make the closed ts lag behind the long-running txn if it has disabled refresh spans collection.

I don't think we've ever discussed a mechanism which always avoids pushing long-running txns. We have talked about detecting the pushes and then backing off the closed timestamp (#36478). The problem with this is that it makes an already best-effort mechanism much less predictable. There is talk about even more dramatically reducing the closed timestamp. Generally I'm opposed to ideas which prevent pushing of transactions. The refresh mechanism has become something of a cornerstone of our transaction protocol.

knz · 2020-02-03T18:25:57Z

Is the error not a usual retry error?

It's a retry error but not the usual one. We've been very vocal in docs, support etc that retry errors are an artifact of contention. That's the usual case.

This one here happens without any contention whatsoever. It's misleading to bin it in the same conceptual category as our usual retry errors.

I'm not sure I would suggest to change the type of the error object, but we should absolutely clarify that it's not the run-off-the-mill retry error.

ajwerner · 2020-02-03T18:34:11Z

Is the error not a usual retry error?

It's a retry error but not the usual one. We've been very vocal in docs, support etc that retry errors are an artifact of contention. That's the usual case.

This one here happens without any contention whatsoever. It's misleading to bin it in the same conceptual category as our usual retry errors.

I'm not sure I would suggest to change the type of the error object, but we should absolutely clarify that it's not the run-off-the-mill retry error.

In terms of the cause of the error, I agree that it differs from a contention caused by a read. In terms of implementation and impact it is identical. It is absolutely the case that in general it should be treated as a run-of-the-mill retry in the case where refresh fails. Retry errors which occur for long-running transactions that cannot refresh due to having too many refresh spans probably should be handled differently than retry errors which just failed to refresh due to contention. From there we should further break down whether the push that lead to the refresh was caused by contention or by a closed timestamp.

I think all I'm saying is that there's two different interactions which are both worthy of differentiating.

Don't want to hijack this thread, but does it mean that CDC is "lagged" by ~30 seconds, compared to some fictional global real time event (commit) happens?

CDC emits rows with a timestamp field of exactly the hlc timestamp at which the transaction which performed the write commits. The latency from write to emit is currently lower bounded by waiting for the schema to be proven for the given timestamp which is implemented as polling. To reduce that latency from write to emit you can lower the changefeed.experimental_poll_interval though it will have some cost. The default there is 1s so each write will generally experience a uniform emit delay between 0-1s. Another feature of changefeeds is the resolved timestamp (see the option in the docs here). A resolved timestamp informs the client that all rows with timestamps older than the resolved timestamp have been seen (there will never be unseen rows with older timestamps emitted). The client configures how frequently these timestamps should be emitted. The actual timestamp which is emitted will lag the present by at least the kv.closed_timestamp.target_interval as we cannot resolve timestamps which have not been closed.

andreimatei · 2020-03-13T20:04:34Z

I've extracted being smarter about the refresh spans tracking in #46095, if y'all don't mind.
This issue can remain for better error reporting in situations where all else fails.

ajwerner · 2020-04-08T18:08:14Z

To clarify for a future reader: the remaining work item on this issue AFAICT is to propagate a different, clearer error when a query is forced to retry due to the closed timestamp rather than the timestamp cache and then to provide documentation to help the user better understand the source of the restart and the available remedies.

irfansharif · 2020-04-30T21:00:54Z

One thing I'm a bit confused about, reading through zendesk#4611 and this thread.

The idea is that the closed ts infrastructure [...] is able to avoid conflicting with the txn [...] by using precise refresh span information.

The closed timestamp subsystem doesn't know anything about refresh spans.

Which one is it? I'm assuming the latter? Which brings me to my next question: now that in 20.1 kv.closed_timestamp.target_duration defaults to 3s, are we then forcing all txns taking over 3s to refresh?

nvanbenschoten · 2020-04-30T21:10:41Z

Which one is it? I'm assuming the latter?

Yes, the closed timestamp subsystem doesn't have any understanding of refresh spans.

are we then forcing all txns taking over 3s to refresh?

Yes, all read-write transactions taking over 3s will be forced to refresh.

knz · 2020-04-30T21:42:55Z

That feels super aggressive though. If there was any client interaction, that means there will be a retry error?

ajwerner · 2020-04-30T21:55:24Z

That feels super aggressive though. If there was any client interaction, that means there will be a retry error?

Only if new writes are issued and then the reads cannot be refreshed successfully. If the refresh fails, it generally indicates that there was some contention.

ajwerner · 2020-04-30T21:56:31Z

#46095 will likely mitigate a large number of refresh failures due to refresh span size.

knz · 2020-05-01T11:13:57Z

Only if new writes are issued and then the reads cannot be refreshed successfully. If the refresh fails, it generally indicates that there was some contention.

I read this as "the behavior under contention will produce more retry errors than it used to in the past". This sounds like a regression from a UX perspective.

github-actions · 2023-09-19T11:08:23Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

knz · 2023-09-19T18:04:48Z

@nvanbenschoten I think we fixed this, right?

nvanbenschoten · 2023-09-19T18:18:54Z

Yes, we addressed the parts of this that seem actionable. I'll close for now, but we can open more specific issues if this kind of behavior remains problematic.

knz added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. C-support labels Feb 3, 2020

ajwerner mentioned this issue Aug 13, 2020

kv: pessimistic-mode, replicated read locks to enable large, long running transactions #52768

Closed

jordanlewis removed the C-support label Feb 18, 2021

jlinder added the T-kv KV Team label Jun 16, 2021

github-actions bot added the no-issue-activity label Sep 19, 2023

knz removed the no-issue-activity label Sep 19, 2023

nvanbenschoten closed this as completed Sep 19, 2023

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Closed in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: txn giving up on refresh span collection causes closed ts to kick it out #44645

kv: txn giving up on refresh span collection causes closed ts to kick it out #44645

knz commented Feb 3, 2020 •

edited by cockroach-jira-scripts

Loading

RoachietheSupportRoach commented Feb 3, 2020

ajwerner commented Feb 3, 2020

knz commented Feb 3, 2020

ajwerner commented Feb 3, 2020

andreimatei commented Mar 13, 2020

ajwerner commented Apr 8, 2020

irfansharif commented Apr 30, 2020 •

edited

Loading

nvanbenschoten commented Apr 30, 2020

knz commented Apr 30, 2020

ajwerner commented Apr 30, 2020 •

edited

Loading

ajwerner commented Apr 30, 2020

knz commented May 1, 2020

github-actions bot commented Sep 19, 2023

knz commented Sep 19, 2023

nvanbenschoten commented Sep 19, 2023

kv: txn giving up on refresh span collection causes closed ts to kick it out #44645

kv: txn giving up on refresh span collection causes closed ts to kick it out #44645

Comments

knz commented Feb 3, 2020 • edited by cockroach-jira-scripts Loading

RoachietheSupportRoach commented Feb 3, 2020

ajwerner commented Feb 3, 2020

knz commented Feb 3, 2020

ajwerner commented Feb 3, 2020

andreimatei commented Mar 13, 2020

ajwerner commented Apr 8, 2020

irfansharif commented Apr 30, 2020 • edited Loading

nvanbenschoten commented Apr 30, 2020

knz commented Apr 30, 2020

ajwerner commented Apr 30, 2020 • edited Loading

ajwerner commented Apr 30, 2020

knz commented May 1, 2020

github-actions bot commented Sep 19, 2023

knz commented Sep 19, 2023

nvanbenschoten commented Sep 19, 2023

knz commented Feb 3, 2020 •

edited by cockroach-jira-scripts

Loading

irfansharif commented Apr 30, 2020 •

edited

Loading

ajwerner commented Apr 30, 2020 •

edited

Loading