kv: avoid immediately launching goroutine for txn heartbeat loop #35009

nvanbenschoten · 2019-02-15T20:23:28Z

We currently launch a goroutine immediately for any transaction that doesn't send an EndTransaction in the same batch as its first write. This is a little crazy because very few transactions will ever actually run long enough to send even their first heartbeat after 1s. As an non-invasive way to improve this, we could introduce a new Stopper.RunDelayedAsyncTask method which waits for some specified period of time (in a single goroutine) before launching the async task. While waiting, the goroutine coordinating this could be monitoring context cancellation to remove queued async tasks before they fire off their own goroutine.

Using this approach, we could trigger a delayed async task in the heartbeater when we see the first write and delay the goroutine for an entire heartbeat interval, only paying the cost of the goroutine for transactions that will send at least one heartbeat.

An alternative to this is to create a single goroutine that handles transmitting all transaction heartbeats, but as has been discussed before, this will require protocol changes and migrations. If we're planning on taking something that large on, we should first consider alternative means of tracking transaction liveness, as we discuss here:

cockroach/pkg/kv/txn_interceptor_heartbeater.go

Lines 50 to 59 in b662cb6

    
           // NOTE: there are other mechanisms by which concurrent actors could determine 
        
           // the liveness of transactions. One proposal is to have concurrent actors 
        
           // communicate directly with transaction coordinators themselves. This would 
        
           // avoid the need for transaction heartbeats and the PENDING transaction state 
        
           // entirely. Another proposal is to detect abandoned transactions and failed 
        
           // coordinators at an entirely different level - by maintaining a node health 
        
           // plane. This would function under the idea that if the node a transaction's 
        
           // coordinator is running on is alive then that transaction is still in-progress 
        
           // unless it specifies otherwise. These are both approaches we could consider in 
        
           // the future.

Interestingly, there's already another case that would benefit from this in the txnPipeliner. We've always wanted to add a background job that periodically "proves" in-flight writes. Using a Stopper.RunDelayedAsyncTask method would allow us to add in this job without seeing a performance hit for short-running transactions. This would help us stay beneath the max inflight-write bytes limit we'll need to add for #32522.

cc. @ajwerner. I expect that the internal mechanism here will look very similar to what you added in requestbatcher.

Jira issue: CRDB-4614

The text was updated successfully, but these errors were encountered:

ajwerner · 2019-02-15T20:39:25Z

Does the txnCoordSender have a defined lifecycle? The reason I ask is about when cleanup happens for the DelayedAsyncTasks. It is true that we could do it lazily based on context cancellation but it might be nice to have a mechanism to remove tasks from the delay queue eagerly.

ajwerner · 2019-02-15T20:51:26Z

Ignore the above comment, I found what I was looking for

nvanbenschoten · 2019-02-15T20:56:55Z

Yeah, I was thinking we would remove txnEnd and instead replace it with a context cancellation function tied to the existing hbCtx.

Fixes cockroachdb#32522. This change creates a new cluster setting called: ``` kv.transaction.write_pipelining_max_outstanding_size ``` It limits the size in bytes that can be dedicated to tracking in-flight writes that have been pipelined. Once this limit is hit, not more writes will be pipelined by a transaction until some of the writes are proven and removed from the outstanding write set. This change once again illustrates the need for periodic proving of outstanding writes. We touch upon that in the type definition's comment and in cockroachdb#35009. Release note: None

Before this PR every transaction would start a goroutine for its heartbeat loop immediately upon creation. Most transactions never need to heartbeat so this goroutine is wasteful. This change delays the start of the heartbeat loop until after the first heartbeatInterval has passed by using the new DelayedStopper. Fixes cockroachdb#35009 Release note: None

Fixes cockroachdb#32522. This change creates a new cluster setting called: ``` kv.transaction.write_pipelining_max_outstanding_size ``` It limits the size in bytes that can be dedicated to tracking in-flight writes that have been pipelined. Once this limit is hit, not more writes will be pipelined by a transaction until some of the writes are proven and removed from the outstanding write set. This change once again illustrates the need for periodic proving of outstanding writes. We touch upon that in the type definition's comment and in cockroachdb#35009. Release note: None

35014: kv: introduce new "max outstanding size" setting to txnPipeliner r=nvanbenschoten a=nvanbenschoten Fixes #32522. This change creates a new cluster setting called: ``` kv.transaction.write_pipelining_max_outstanding_size ``` It limits the size in bytes that can be dedicated to tracking in-flight writes that have been pipelined. Once this limit is hit, not more writes will be pipelined by a transaction until some of the writes are proven and removed from the outstanding write set. This change once again illustrates the need for periodic proving of outstanding writes. We touch upon that in the type definition's comment and in #35009. Release note: None 35199: log: fix remaining misuse of runtime.Callers/runtime.FuncForPC r=nvanbenschoten a=nvanbenschoten Fixes #17770. This commit fixes the last user of `runtime.Callers` that misused the stdlib function by translating the PC values it returned directly into symbolic information (see https://golang.org/pkg/runtime/#Callers) [1]. Go's documentation warns that this is a recipe for disaster when mixed with mid-stack inlining. The other concern in #17770 was this comment: #17770 (comment). This was discussed in golang/go#29582 and addressed in golang/go@956879d. An alternative would be to use `runtime.Caller` here, but that would force an allocation that was hard-earned in #29017. Instead, this commit avoids any performance hit. ``` name old time/op new time/op delta Header-4 267ns ± 1% 268ns ± 0% ~ (p=0.584 n=10+10) VDepthWithVModule-4 260ns ± 3% 255ns ± 1% -1.87% (p=0.018 n=10+9) name old alloc/op new alloc/op delta Header-4 0.00B 0.00B ~ (all equal) VDepthWithVModule-4 0.00B 0.00B ~ (all equal) name old allocs/op new allocs/op delta Header-4 0.00 0.00 ~ (all equal) VDepthWithVModule-4 0.00 0.00 ~ (all equal) ``` [1] I went through and verified that this was still correct. Release note: None 35203: closedts: react faster to config changes r=danhhz a=tbg This approximately halves the duration of `make test PKG=./pkg/ccl/changefeedccl TESTS=/sinkless`, from ~60s to ~30s. Touches #34455. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Tobias Schottdorf <[email protected]>

github-actions · 2021-06-05T02:14:36Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

knz · 2021-06-05T09:22:37Z

still relevant

ajwerner · 2022-03-23T18:42:17Z

I'm taking this off my list.

nvanbenschoten · 2022-03-25T16:00:26Z

This was addressed by #67215.

nvanbenschoten added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-client Relating to the KV client and the KV interface. labels Feb 15, 2019

nvanbenschoten assigned ajwerner and nvanbenschoten Feb 15, 2019

nvanbenschoten mentioned this issue Feb 15, 2019

kv: introduce new "max outstanding size" setting to txnPipeliner #35014

Merged

ajwerner mentioned this issue Feb 15, 2019

kv: avoid immediately launching goroutine for txn heartbeat loop #35015

Closed

ajwerner mentioned this issue Feb 12, 2020

kv: make long running transactions less expensive (improve txn heartbeats) #45013

Open

github-actions bot added the no-issue-activity label Jun 5, 2021

knz removed the no-issue-activity label Jun 5, 2021

jlinder added the T-kv KV Team label Jun 16, 2021

ajwerner removed their assignment Mar 23, 2022

nvanbenschoten closed this as completed Mar 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: avoid immediately launching goroutine for txn heartbeat loop #35009

kv: avoid immediately launching goroutine for txn heartbeat loop #35009

nvanbenschoten commented Feb 15, 2019 •

edited by cockroach-jira-scripts

Loading

ajwerner commented Feb 15, 2019

ajwerner commented Feb 15, 2019

nvanbenschoten commented Feb 15, 2019 •

edited

Loading

github-actions bot commented Jun 5, 2021

knz commented Jun 5, 2021

ajwerner commented Mar 23, 2022

nvanbenschoten commented Mar 25, 2022

kv: avoid immediately launching goroutine for txn heartbeat loop #35009

kv: avoid immediately launching goroutine for txn heartbeat loop #35009

Comments

nvanbenschoten commented Feb 15, 2019 • edited by cockroach-jira-scripts Loading

ajwerner commented Feb 15, 2019

ajwerner commented Feb 15, 2019

nvanbenschoten commented Feb 15, 2019 • edited Loading

github-actions bot commented Jun 5, 2021

knz commented Jun 5, 2021

ajwerner commented Mar 23, 2022

nvanbenschoten commented Mar 25, 2022

nvanbenschoten commented Feb 15, 2019 •

edited by cockroach-jira-scripts

Loading

nvanbenschoten commented Feb 15, 2019 •

edited

Loading