-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: adaptive closed timestamp target duration #36478
Comments
If this lets us reduce the closed timestamp target, it will reduce the latency between changefeed resolved timestamps and realtime. It will also help with #34455 |
I have been thinking about what it would take to allow lowering the closed timestamp target duration. I'm a bit fuzzy on what our "hard requirements" are. Do we want to serve real-time OLTP traffic using follower reads? Because that will require new technology and I'd say it's generally not reasonably achievable right now. Do we just want to get the 40s down "most of the time" but accepting that if there are long txns then that (or higher) is what it's going to be? Let's say it's the latter and it would be a dream to get it to the ~1s range. My assumption is that ultimately we want long operations to finish without restarting "too often", but that we also want to make sure that the closed timestamp isn't delayed for no reason. To that end it seems that rather than doing a "blind backoff" as proposed in the prototype above, we could use some of the logic that's present in rangefeeds: cockroach/pkg/storage/rangefeed/processor.go Line 253 in ef375ae
Whenever a rangefeed is active, it keeps an priority queue (by age) of transactions refcounted by the number of intents they have on the range. Either that drops to zero or too much time passes and the txn gets pushed, according to the constants here: cockroach/pkg/storage/rangefeed/processor.go Lines 27 to 33 in ef375ae
This seems relevant to what we want to accomplish:
Each time we prepare to update the closed timestamp, we check the tracked transactions that were previously forced to restart due to closed timestamps (which we now track, see above). Any such transaction is (not guaranteed but) likely to return to this range wanting to write below the proposed next closed timestamp, and so we do not close out the timestamp yet, under the provision that the txn will be "nudged" periodically to make sure its true status is reflected in our tracking. Note that we restrict to "txns that have had to restart because of closed timestamps before" to avoid holding up the closed timestamp for transactions that run for a long time but never revisit their ranges (or which can restart via a refresh). We'd have to check whether this is the right thing to do. Another thing: as we've most recently seen recently in #38668, we generally want to restart transactions eagerly if at the conn executor level we're still in the first statement (or rather, if nothing has been reported back to the client yet), where the restart can be hidden from the client. Doing so means that the transaction will touch its complete set of ranges only iteratively as it has to restart repeatedly, which means it could run into closed timestamps one after another and be forced to restart O(ranges accessed) times. This is actually less severe in practice since the closed timestamp is per-store and so you'd get at most O(stores accessed) restarts, but it's worth pointing out. Txns could delay their restart selectively when they've left the auto-retry window, for example. In practice, the approach described boils to not holding up the closed timestamp for a fresh txn the first time. However, if it does get restarted as a result and comes back, we hold up all closed timestamps. An orthogonal exploration (which I haven't thought about much) is whether it could make sense to move away from a single per-store timestamp to a set of per-store timestamps sharding the set of replicas. Essentially, if we want to avoid a slow txn touching a single range from preventing new closed timestamps for the whole store, we have to be able to maintain multiple closed timestamps per store, perhaps cc @nvanbenschoten curious what your take on this is. PS here's a minimal repro of an endlessly restarting txn |
@tbg I'm not sure whether this directly responds to your proposal, but I was doing some thinking about how systems that offer snapshot isolation perform follower reads and avoid any timestamp cache-like structure and came to an interesting observation that's tangential to this discussion - a transaction that has written an intent should never need to consult the timestamp cache when moving that intent forward. It already knows that nothing could have read above the intent timestamp. Taking this a step further, why does a transaction that has already written an intent need to consult the closed timestamp mechanism when moving that intent forward? An observer of the closed timestamps already needs to be aware of intents below the closed timestamp level (which could be committed at any moment), so why force re-writes of the intent all the way above the closed timestamp? I can't think of any reason to, other than that we may have defined the semantics around closed timestamps slightly incorrectly. However, neither of the observers of closed timestamps actually rely on these strict semantics of intent rewrites being pushed above the closed timestamp level. RangeFeed would be just fine without it since it already tracks intents below the closed timestamp in its This implies that if a transaction has written all of its intents, it should never need more than a single retry to move them all forward and commit them. If it's not writing any new intents, it should never be forced to restart again. In other words, I don't think transactions should ever need to endlessly restart, no matter how aggressive the closed timestamp duration is. |
That's a good observation, seems like it would knock out large classes of restarts. Infinite restarts would still be possible if every restart of the txn wrote a new key (say a few txns are contending on a counter increment that they then use to insert an intent into some other table). |
I don't know how useful it is to think about transactions that write completely new keys in each epoch as anything other than brand new transactions. And it's actually kind of hard to come up with a reason why anyone would want to do this but still maintain a transaction retry loop. For instance, the example you gave doesn't actually fall into this situation, because the second one of the txns lays an intent on the counter, it will still observe the same incremented value on each retry and still insert into the same secondary table. I'm wondering if our desire to support arbitrary forms of non-determinism across transaction restarts is preventing us from coming to a better solution for the overwhelmingly common case. If we could make any assumption about "proper" uses of transaction retries then I think a much cleaner solution could fall out of this problem and out of a lot of our retry handling. |
This doesn't seem too unlikely to me. Consider updating a row with an indexed |
In cases where we detect this causing starvation, I think we should consider promoting "optimistic read locks" to pessimistic read locks along the lines of what's proposed in the |
I thought more about Nathan's suggestion above and it checks out. Additionally there's no good reason for applying closed timestamps to the txn record creation (I worried about txns that can never commit if they always run into the closed timestamp on their txn record). Closed timestamps gives us "immutable" MVCC snapshots (i.e. basically the replica data, but restricting to reads below the closed timestamp), but they're not quite immutable so I'll call them "closed". It's always fine for intents to show up in an "immutable" snapshot, because intents basically force the reader to stop using the snapshot. It's also fine for an intent to turn into a committed value in a closed snapshot, as long as the committed value's timestamp is not lower than the intent's (this happens all the time as intents get resolved). Similarly, intents may change their value, or move forward in time. It's not fine for a committed value to show up in a closed snapshot without an intent first (this could happen if writing an intent didn't force the commit timestamp above the closed timestamp). What really happens when we place an intent is that we're picking up a lower bound on what the commit timestamp of the transaction could be via the closed timestamp. As long as we funnel that back to the commit operation somehow, we're good, no matter which timestamp we give the intent. (Our current behavior where we place it at the lower bound is good because it reduces write ops during intent resolution, this is just a thought experiment). Since we're free to turn intents into committed valus as long as it's not in the past of the intent, as Nathan points out we don't have to pick up a new lower bound when we rewrite the intent - we can essentially ignore the closed timestamp then. By the way, something very similar already happens when a transaction refreshes - it's not like the txn goes back to all of its intents to verify that they could be placed at their new timestamps while respecting the closed timestamp. And even just the simple act of resolving an intent creates this sort of transition potentially well below the current closed timestamp. There's a lot of precedent for this behavior, we just never really figured out the rules here. Now for the transaction record itself, there just isn't a reason to have it respect the local closed timestamp. I would like it to stop respecting it because if you're running a global transaction in which the latency between gateway and leaseholder exceeds the closed timestamp target duration, any timestamp the gateway picks will be closed out by the time the One reason I had in the past for making transaction record writes obey the closed timestamp is that I thought that it would allow us to "disaster recover" from a closed snapshot. Reasoning through my FUD around this I've found this to be unrelated, though. A simple round of parallel commit can already give you a closed snapshot in which one intent has been resolved and the STAGING record and another intent are still there, from which the recovery protocol will happily declare the transaction aborted, leaving you with a dirty write (and this kind of problem gets really out of hand with a global transaction protocol Nathan and I have been thinking about). So for disaster recovery, we'd "just" lower our timestamp until we don't see any more intents. And besides, I don't think this kind of disaster recovery is very easy to build nor is it more useful than continuous backup. |
I don't think you meant this, but just to be clear, an intent can't show up in a previously immutable snapshot that did not contain it.
The transaction will only restart forever if the transaction record can't be pushed. Some of the proposals we have floated have used durable read locks to allow that to happen. In those cases, the final commit timestamp is the maximum timestamp that any read/write intent was written at. |
I did mean this - in theory we could decide that a transaction that's already operating at ts=100 could lay down its intents at t=10, completely disregarding the closed timestamp. What can't happen is that it can commit at 10, it needs to commit at least at the closed timestamps it ignored when first laying down the intents. What we do instead in CRDB today, which is lay down the intent at the "best known lower bound for the commit timestamp" (which takes into account the local closed timestamp) is of course a much better choice, for two reasons: it automatically communicates the new lower bound to the transaction (by containing a pushed txn) and it's more likely that intent resolution won't have to rewrite the version row. These are practical reasons, we're not doing it for correctness.
Yes, but today's transaction protocol doesn't allow that, and I think we want to always have that as a fallback that works, without having to fine-tune the closed timestamp duration against the latencies we can expect for local transactions. It needs to be robust (what if the cluster is overloaded, networks slower than expected, etc - we don't want to fall off a cliff) and not forcing the txn record above the closedTS is the way to make it robust. |
I have an experimental branch which I'm probably not going to get to anytime soon, so I wanted to at least put the idea out there to see if there are any takers.
The closed timestamp target duration is currently set high (30s) because transactions running longer than the target duration tend to get disrupted. If we set, say, a target duration of 1s, a transaction that takes two seconds would never be able to finish (unless it got to refresh its way to a commit).
The idea in the WIP is to make the closed timestamp target duration receptive to when it forces a transaction to restart. For example, if the target duration is 1s normally, and a transaction is forced to restart because its timestamp is below the recent closed timestamp, this will cause the target duration to temporarily double (or whatever) to 2s. If this is not enough, it will double again, etc, until the "slow" transaction can commit.
It's completely unclear to me whether this can be productionized and how low we can comfortably set the closed timestamp after, but it seems worth looking at as a strategy to automatically converge to a closed timestamp duration that's "usually" not in the way of ongoing requests while at the same time allowing the occasional slow request to finish, too.
The heuristics in the WIP are not great and it's not even using the right input data (we should pass
txn.OrigTimestamp
toRequestBackoff
, if that is even good enough).The text was updated successfully, but these errors were encountered: