storage: adaptive closed timestamp target duration #36478

tbg · 2019-04-03T11:16:08Z

I have an experimental branch which I'm probably not going to get to anytime soon, so I wanted to at least put the idea out there to see if there are any takers.

The closed timestamp target duration is currently set high (30s) because transactions running longer than the target duration tend to get disrupted. If we set, say, a target duration of 1s, a transaction that takes two seconds would never be able to finish (unless it got to refresh its way to a commit).

The idea in the WIP is to make the closed timestamp target duration receptive to when it forces a transaction to restart. For example, if the target duration is 1s normally, and a transaction is forced to restart because its timestamp is below the recent closed timestamp, this will cause the target duration to temporarily double (or whatever) to 2s. If this is not enough, it will double again, etc, until the "slow" transaction can commit.

It's completely unclear to me whether this can be productionized and how low we can comfortably set the closed timestamp after, but it seems worth looking at as a strategy to automatically converge to a closed timestamp duration that's "usually" not in the way of ongoing requests while at the same time allowing the occasional slow request to finish, too.

The heuristics in the WIP are not great and it's not even using the right input data (we should pass txn.OrigTimestamp to RequestBackoff, if that is even good enough).

The text was updated successfully, but these errors were encountered:

danhhz · 2019-04-03T14:07:14Z

If this lets us reduce the closed timestamp target, it will reduce the latency between changefeed resolved timestamps and realtime. It will also help with #34455

tbg · 2019-07-23T11:51:30Z

I have been thinking about what it would take to allow lowering the closed timestamp target duration.

I'm a bit fuzzy on what our "hard requirements" are. Do we want to serve real-time OLTP traffic using follower reads? Because that will require new technology and I'd say it's generally not reasonably achievable right now. Do we just want to get the 40s down "most of the time" but accepting that if there are long txns then that (or higher) is what it's going to be?

Let's say it's the latter and it would be a dream to get it to the ~1s range. My assumption is that ultimately we want long operations to finish without restarting "too often", but that we also want to make sure that the closed timestamp isn't delayed for no reason.

To that end it seems that rather than doing a "blind backoff" as proposed in the prototype above, we could use some of the logic that's present in rangefeeds:

cockroach/pkg/storage/rangefeed/processor.go

Line 253 in ef375ae

oldTxns := p.rts.intentQ.Before(before)

Whenever a rangefeed is active, it keeps an priority queue (by age) of transactions refcounted by the number of intents they have on the range. Either that drops to zero or too much time passes and the txn gets pushed, according to the constants here:

cockroach/pkg/storage/rangefeed/processor.go

Lines 27 to 33 in ef375ae

    
           // defaultPushTxnsInterval is the default interval at which a Processor will 
        
           // push all transactions in the unresolvedIntentQueue that are above the age 
        
           // specified by PushTxnsAge. 
        
           defaultPushTxnsInterval = 250 * time.Millisecond 
        
           // defaultPushTxnsAge is the default age at which a Processor will begin to 
        
           // consider a transaction old enough to push. 
        
           defaultPushTxnsAge = 10 * time.Second

This seems relevant to what we want to accomplish:

if a txn is forced to restart due to the closed timestamp duration, it will have left an intent (that's hopefully true, but it would need double checking)
if each range had a lightish-weight always-on rangefeed subscription that maintained the txn priority queue, the txn will now be tracked in it
let's say that we can also track along the txn whether it was ever forced to restart (on any range in any epoch) due to the closed timestamp
when the txn manages to commit or is aborted, it will be removed from tracking as the intent(s) get removed
we can nudge the txn periodically to make sure it isn't abandoned

Each time we prepare to update the closed timestamp, we check the tracked transactions that were previously forced to restart due to closed timestamps (which we now track, see above). Any such transaction is (not guaranteed but) likely to return to this range wanting to write below the proposed next closed timestamp, and so we do not close out the timestamp yet, under the provision that the txn will be "nudged" periodically to make sure its true status is reflected in our tracking.

Note that we restrict to "txns that have had to restart because of closed timestamps before" to avoid holding up the closed timestamp for transactions that run for a long time but never revisit their ranges (or which can restart via a refresh). We'd have to check whether this is the right thing to do.

Another thing: as we've most recently seen recently in #38668, we generally want to restart transactions eagerly if at the conn executor level we're still in the first statement (or rather, if nothing has been reported back to the client yet), where the restart can be hidden from the client. Doing so means that the transaction will touch its complete set of ranges only iteratively as it has to restart repeatedly, which means it could run into closed timestamps one after another and be forced to restart O(ranges accessed) times. This is actually less severe in practice since the closed timestamp is per-store and so you'd get at most O(stores accessed) restarts, but it's worth pointing out. Txns could delay their restart selectively when they've left the auto-retry window, for example.

In practice, the approach described boils to not holding up the closed timestamp for a fresh txn the first time. However, if it does get restarted as a result and comes back, we hold up all closed timestamps.

An orthogonal exploration (which I haven't thought about much) is whether it could make sense to move away from a single per-store timestamp to a set of per-store timestamps sharding the set of replicas. Essentially, if we want to avoid a slow txn touching a single range from preventing new closed timestamps for the whole store, we have to be able to maintain multiple closed timestamps per store, perhaps (1s, 10s, 60s), with each associated to a set of RangeIDs they're valid for. Most ranges would be in the "1s" group but they would be moved over into the trailing groups as needed. This would generally tend to isolate tables/apps from each other's long-running txns and localizes the negative impact of long operations. However, it doesn't prevent these problems.

cc @nvanbenschoten curious what your take on this is.

PS here's a minimal repro of an endlessly restarting txn

nvanbenschoten · 2019-08-16T06:38:53Z

@tbg I'm not sure whether this directly responds to your proposal, but I was doing some thinking about how systems that offer snapshot isolation perform follower reads and avoid any timestamp cache-like structure and came to an interesting observation that's tangential to this discussion - a transaction that has written an intent should never need to consult the timestamp cache when moving that intent forward. It already knows that nothing could have read above the intent timestamp.

Taking this a step further, why does a transaction that has already written an intent need to consult the closed timestamp mechanism when moving that intent forward? An observer of the closed timestamps already needs to be aware of intents below the closed timestamp level (which could be committed at any moment), so why force re-writes of the intent all the way above the closed timestamp? I can't think of any reason to, other than that we may have defined the semantics around closed timestamps slightly incorrectly. However, neither of the observers of closed timestamps actually rely on these strict semantics of intent rewrites being pushed above the closed timestamp level. RangeFeed would be just fine without it since it already tracks intents below the closed timestamp in its unresolvedIntentQueue and a follower read will kick off concurrency control when it hits an intent below its read timestamp, just like normal.

This implies that if a transaction has written all of its intents, it should never need more than a single retry to move them all forward and commit them. If it's not writing any new intents, it should never be forced to restart again. In other words, I don't think transactions should ever need to endlessly restart, no matter how aggressive the closed timestamp duration is.

tbg · 2019-08-16T08:57:33Z

That's a good observation, seems like it would knock out large classes of restarts. Infinite restarts would still be possible if every restart of the txn wrote a new key (say a few txns are contending on a counter increment that they then use to insert an intent into some other table).

nvanbenschoten · 2019-08-16T14:09:12Z

Infinite restarts would still be possible if every restart of the txn wrote a new key

I don't know how useful it is to think about transactions that write completely new keys in each epoch as anything other than brand new transactions. And it's actually kind of hard to come up with a reason why anyone would want to do this but still maintain a transaction retry loop. For instance, the example you gave doesn't actually fall into this situation, because the second one of the txns lays an intent on the counter, it will still observe the same incremented value on each retry and still insert into the same secondary table.

I'm wondering if our desire to support arbitrary forms of non-determinism across transaction restarts is preventing us from coming to a better solution for the overwhelmingly common case. If we could make any assumption about "proper" uses of transaction retries then I think a much cleaner solution could fall out of this problem and out of a lot of our retry handling.

bdarnell · 2019-08-26T21:08:35Z

I don't know how useful it is to think about transactions that write completely new keys in each epoch as anything other than brand new transactions. And it's actually kind of hard to come up with a reason why anyone would want to do this but still maintain a transaction retry loop.

This doesn't seem too unlikely to me. Consider updating a row with an indexed last_update field. Each transaction attempt would update its intent on the primary row (and would benefit from the retry loop because of its intent there), but would also create a brand new intent in the last_update index (this could be mitigated by fixing the last_update timestamp on the first attempt, but that increases the chance of things being inserted out-of-order).

nvanbenschoten · 2019-08-26T21:30:30Z

In cases where we detect this causing starvation, I think we should consider promoting "optimistic read locks" to pessimistic read locks along the lines of what's proposed in the SELECT FOR UPDATE RFC. This would ensure that after some point, the refresh of the transaction would be guaranteed to succeed.

tbg · 2019-10-30T14:28:07Z

I thought more about Nathan's suggestion above and it checks out. Additionally there's no good reason for applying closed timestamps to the txn record creation (I worried about txns that can never commit if they always run into the closed timestamp on their txn record).
Part of why I feel comfortable claiming this is that I understand the invariants a bit better now. I hope that someone can reason through this with me and hopefully confirm.

Closed timestamps gives us "immutable" MVCC snapshots (i.e. basically the replica data, but restricting to reads below the closed timestamp), but they're not quite immutable so I'll call them "closed". It's always fine for intents to show up in an "immutable" snapshot, because intents basically force the reader to stop using the snapshot. It's also fine for an intent to turn into a committed value in a closed snapshot, as long as the committed value's timestamp is not lower than the intent's (this happens all the time as intents get resolved). Similarly, intents may change their value, or move forward in time. It's not fine for a committed value to show up in a closed snapshot without an intent first (this could happen if writing an intent didn't force the commit timestamp above the closed timestamp).

What really happens when we place an intent is that we're picking up a lower bound on what the commit timestamp of the transaction could be via the closed timestamp. As long as we funnel that back to the commit operation somehow, we're good, no matter which timestamp we give the intent. (Our current behavior where we place it at the lower bound is good because it reduces write ops during intent resolution, this is just a thought experiment). Since we're free to turn intents into committed valus as long as it's not in the past of the intent, as Nathan points out we don't have to pick up a new lower bound when we rewrite the intent - we can essentially ignore the closed timestamp then. By the way, something very similar already happens when a transaction refreshes - it's not like the txn goes back to all of its intents to verify that they could be placed at their new timestamps while respecting the closed timestamp. And even just the simple act of resolving an intent creates this sort of transition potentially well below the current closed timestamp. There's a lot of precedent for this behavior, we just never really figured out the rules here.

Now for the transaction record itself, there just isn't a reason to have it respect the local closed timestamp. I would like it to stop respecting it because if you're running a global transaction in which the latency between gateway and leaseholder exceeds the closed timestamp target duration, any timestamp the gateway picks will be closed out by the time the EndTransaction reaches the leaseholder and so the transaction will restart forever. As far as I can tell the current behavior came to be because of a) FUD and b) the promise of simplicity that comes from considering it "an MVCC write at the commit timestamp". I've found that if anything, we should think of it as an intent on a key that's never read, and we've seen above that intents live outside of the realm of MVCC for the most part, until they get resolved.

One reason I had in the past for making transaction record writes obey the closed timestamp is that I thought that it would allow us to "disaster recover" from a closed snapshot. Reasoning through my FUD around this I've found this to be unrelated, though. A simple round of parallel commit can already give you a closed snapshot in which one intent has been resolved and the STAGING record and another intent are still there, from which the recovery protocol will happily declare the transaction aborted, leaving you with a dirty write (and this kind of problem gets really out of hand with a global transaction protocol Nathan and I have been thinking about). So for disaster recovery, we'd "just" lower our timestamp until we don't see any more intents. And besides, I don't think this kind of disaster recovery is very easy to build nor is it more useful than continuous backup.

nvanbenschoten · 2019-10-30T15:16:36Z

It's always fine for intents to show up in an "immutable" snapshot, because intents basically force the reader to stop using the snapshot.

I don't think you meant this, but just to be clear, an intent can't show up in a previously immutable snapshot that did not contain it.

any timestamp the gateway picks will be closed out by the time the EndTransaction reaches the leaseholder and so the transaction will restart forever.

The transaction will only restart forever if the transaction record can't be pushed. Some of the proposals we have floated have used durable read locks to allow that to happen. In those cases, the final commit timestamp is the maximum timestamp that any read/write intent was written at.

tbg · 2019-10-30T19:20:08Z

I don't think you meant this, but just to be clear, an intent can't show up in a previously immutable snapshot that did not contain it.

I did mean this - in theory we could decide that a transaction that's already operating at ts=100 could lay down its intents at t=10, completely disregarding the closed timestamp. What can't happen is that it can commit at 10, it needs to commit at least at the closed timestamps it ignored when first laying down the intents. What we do instead in CRDB today, which is lay down the intent at the "best known lower bound for the commit timestamp" (which takes into account the local closed timestamp) is of course a much better choice, for two reasons: it automatically communicates the new lower bound to the transaction (by containing a pushed txn) and it's more likely that intent resolution won't have to rewrite the version row. These are practical reasons, we're not doing it for correctness.

The transaction will only restart forever if the transaction record can't be pushed. Some of the proposals we have floated have used durable read locks to allow that to happen. In those cases, the final commit timestamp is the maximum timestamp that any read/write intent was written at.

Yes, but today's transaction protocol doesn't allow that, and I think we want to always have that as a fallback that works, without having to fine-tune the closed timestamp duration against the latencies we can expect for local transactions. It needs to be robust (what if the cluster is overloaded, networks slower than expected, etc - we don't want to fall off a cliff) and not forcing the txn record above the closedTS is the way to make it robust.

lunevalex · 2020-08-13T18:48:21Z

@tbg should we close this, as it seems most of the work in this area has now shifted to #51294 and #41720

tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Apr 3, 2019

ajwerner mentioned this issue Aug 8, 2019

closedts: shorten default closed timestamp interval #37083

Closed

nvanbenschoten mentioned this issue Oct 18, 2019

storage: separated lock-table keyspace #41720

Closed

ajwerner mentioned this issue Feb 3, 2020

kv: txn giving up on refresh span collection causes closed ts to kick it out #44645

Closed

nvanbenschoten mentioned this issue Jul 10, 2020

kv: closed timestamp can starve read-write txns that take longer than 600ms #51294

Closed

lunevalex added the A-kv-transactions Relating to MVCC and the transactional model. label Jul 29, 2020

tbg closed this as completed Aug 14, 2020

nvanbenschoten mentioned this issue Apr 16, 2021

kv: don't bump transaction on closed timestamp when re-writing existing intent #63796

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: adaptive closed timestamp target duration #36478

storage: adaptive closed timestamp target duration #36478

tbg commented Apr 3, 2019

danhhz commented Apr 3, 2019

tbg commented Jul 23, 2019

nvanbenschoten commented Aug 16, 2019 •

edited

Loading

tbg commented Aug 16, 2019

nvanbenschoten commented Aug 16, 2019

bdarnell commented Aug 26, 2019

nvanbenschoten commented Aug 26, 2019

tbg commented Oct 30, 2019 •

edited

Loading

nvanbenschoten commented Oct 30, 2019

tbg commented Oct 30, 2019 •

edited

Loading

lunevalex commented Aug 13, 2020

storage: adaptive closed timestamp target duration #36478

storage: adaptive closed timestamp target duration #36478

Comments

tbg commented Apr 3, 2019

danhhz commented Apr 3, 2019

tbg commented Jul 23, 2019

nvanbenschoten commented Aug 16, 2019 • edited Loading

tbg commented Aug 16, 2019

nvanbenschoten commented Aug 16, 2019

bdarnell commented Aug 26, 2019

nvanbenschoten commented Aug 26, 2019

tbg commented Oct 30, 2019 • edited Loading

nvanbenschoten commented Oct 30, 2019

tbg commented Oct 30, 2019 • edited Loading

lunevalex commented Aug 13, 2020

nvanbenschoten commented Aug 16, 2019 •

edited

Loading

tbg commented Oct 30, 2019 •

edited

Loading

tbg commented Oct 30, 2019 •

edited

Loading