-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: txn giving up on refresh span collection causes closed ts to kick it out #44645
Comments
Zendesk ticket #4611 has been linked to this issue. |
To be more precise we should say that the txn lasts more than
The closed timestamp subsystem doesn't know anything about refresh spans. The closed timestamp subsystem prevents periodically attempts to make history below some timestamp immutable (more specifically, no new intents may be laid down before the closed timestamp though intents which were already written can still be resolved at that timestamp). When a write fails due to the closed timestamp, the transaction will be forced to refresh. The mechanism is identical to a read in the timestamp cache leading to a push (in fact the closed timestamp value for a write is utilized in cockroach/pkg/storage/replica_tscache.go Lines 253 to 257 in 31513b7
When a transaction coordinator detects that it has been pushed then it will perform a refresh (the exact details of when this refresh occurs relative to other operations is not currently paged into my head). If there are too many spans to refresh (as defined by the
Is the error not a usual retry error?
+1
I think this is a good idea. We could wrap the error in another layer to indicate that we didn't even try to refresh due to the refresh span byte limit.
I don't think we've ever discussed a mechanism which always avoids pushing long-running txns. We have talked about detecting the pushes and then backing off the closed timestamp (#36478). The problem with this is that it makes an already best-effort mechanism much less predictable. There is talk about even more dramatically reducing the closed timestamp. Generally I'm opposed to ideas which prevent pushing of transactions. The refresh mechanism has become something of a cornerstone of our transaction protocol. |
It's a retry error but not the usual one. We've been very vocal in docs, support etc that retry errors are an artifact of contention. That's the usual case. This one here happens without any contention whatsoever. It's misleading to bin it in the same conceptual category as our usual retry errors. I'm not sure I would suggest to change the type of the error object, but we should absolutely clarify that it's not the run-off-the-mill retry error. |
In terms of the cause of the error, I agree that it differs from a contention caused by a read. In terms of implementation and impact it is identical. It is absolutely the case that in general it should be treated as a I think all I'm saying is that there's two different interactions which are both worthy of differentiating.
CDC emits rows with a |
I've extracted being smarter about the refresh spans tracking in #46095, if y'all don't mind. |
To clarify for a future reader: the remaining work item on this issue AFAICT is to propagate a different, clearer error when a query is forced to retry due to the closed timestamp rather than the timestamp cache and then to provide documentation to help the user better understand the source of the restart and the available remedies. |
One thing I'm a bit confused about, reading through zendesk#4611 and this thread.
Which one is it? I'm assuming the latter? Which brings me to my next question: now that in 20.1 |
Yes, the closed timestamp subsystem doesn't have any understanding of refresh spans.
Yes, all read-write transactions taking over 3s will be forced to refresh. |
That feels super aggressive though. If there was any client interaction, that means there will be a retry error? |
Only if new writes are issued and then the reads cannot be refreshed successfully. If the refresh fails, it generally indicates that there was some contention. |
#46095 will likely mitigate a large number of refresh failures due to refresh span size. |
I read this as "the behavior under contention will produce more retry errors than it used to in the past". This sounds like a regression from a UX perspective. |
We have marked this issue as stale because it has been inactive for |
@nvanbenschoten I think we fixed this, right? |
Yes, we addressed the parts of this that seem actionable. I'll close for now, but we can open more specific issues if this kind of behavior remains problematic. |
Found by user:
There are three separate issues here:
we want a larger default for max_refresh_span_bytes so that the scenario becomes less likely. This is predicated on better memory tracking in KV, a separate work item (planned for 20.1, see the work @tbg has started on [dnm] kv: expose (and use) byte batch response size limit #44341 ). I think this is orthogonal and should be kept out of scope here.
when the scenario happens we want the error message to be clearer about what needs to happen: either decrease the duration of the txn, or decrease the its number of refresh spans (fewer reads/writes), or increase max_refresh_span_bytes, or increase the closed ts delay
or we could avoid the situation entirely? Make the closed ts lag behind the long-running txn if it has disabled refresh spans collection.
cc @ajwerner @tbg for triage.
Jira issue: CRDB-5215
The text was updated successfully, but these errors were encountered: