-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: avoid or at least efficiently handle intent buildup #60585
Comments
FWIW I was playing around with create table foo(v string);
insert into foo values('a'), ('b'), ('c'), ('d'), ('e');
set statement_timeout='50s';
insert into foo select * from foo; -- repeat until stmt_timeout hit
select now(), max(crdb_internal.range_stats(start_key)->'intent_count') from crdb_internal.ranges_no_leases; The last insert leaves around ~600k intents, and despite hitting the statement timeout they clear out within a few minutes (at a couple thousand resolved a second). |
Have we checked this metric? Also, the GC path is going to hit the semaphore limiter. Is it possible that the gc queue somehow isn't resolving intents because of this? cockroach/pkg/kv/kvserver/intentresolver/intent_resolver.go Lines 602 to 608 in 23263a3
|
Two Qs from your friendly SREs:
|
Relates to cockroachdb#60585. This commit updates the GC queue's `process` method to allow GC even if it determines that doing so would not advance the GC threshold. This check is retained in the queue's `shouldQueue` method, so this change does not impact automated queueing decisions. However, it does ensure that if someone manually enqueues a range in the GC queue with the `ShouldSkipQueue` option, then the range is actually processed. This is important because manually the GC threshold is the first thing that the GC bumps when processing, meaning that the GC threshold alone is not a definitive indication that there is no more work to perform on a range. A processing pass that will not bump the threshold can still be useful, especially with the long context timeout associated with manual enqueuing. Release note (ui change): Manually enqueue range in a replica GC queue now properly respects the SkipShouldQueue option. This can be useful to force a GC of a specific Range.
I'd test whether we broke something in the cleanup path for dropped pgwire connections. I think a rollback for any ongoing txn is supposed to be happen courtesy of this code, but maybe something rotted. cockroach/pkg/sql/conn_executor.go Line 826 in 4086c3e
Cleaning up a lot of abandoned locks can be expensive... My understanding is that most reads should be smart in cleaning up many locks at once (but notably the |
Hmm. We'll be on the lookout for any prod issues that can be tied back to this issue. Thanks for the info! |
This is the one that raises alarms for me. If there's no backpressure it's easy for things to get out of control (I believe #5219 was the one that introduced backpressure here and it was motivated by the real-world problem in #4925). That's especially true when network latency is high (which is the case for at least one customer experiencing this issue).
I think its roots are even older, in that same PR #5219. While it wasn't discussed as much in the linked issues, I don't think it was premature optimization; we were seeing thundering herd problems where many readers would try to push/resolve the same recently-committed transaction. |
But at present, there are only two callers to this lock: the GC queue and the txn's own transaction record cleanup intent. The thundering herd wouldn't hit this path. The thundering herd goes here and has its own work avoidance (which uses the same kind of mechanism)
Either way, this certainly isn't making anything worse. |
I did |
I did the same experiment last night and saw the same thing. The only way I was able to get abandoned intents was to restart the cluster while async intent resolution was underway. Given that standard async intent resolution does take minutes to clean up about 2 million intents and we've seen instances with many more abandoned intents, this may not be such a wild theory. We're very quick to roll nodes these days when other problems occur. So maybe we're just hitting this issue due to poorly timed restarts and #59704. Do we have an example instance of this issue that we could look at the historical timeseries data for to confirm or deny this? cc. @joshimhoff. Regardless, we should also keep looking for other explanations. Something else that strikes me is that the GC queue hasn't been particularly helpful for these issues. Is this because the intents weren't 2 hours old, or because the GC queue is timing out before it can resolve the intents? If it's the latter, we should explore dynamically setting the GC queue's |
In the most recent example I looked at, the GC queue ran to completion just fine, and did not consider any intents, meaning they must've been less than 2h old. But that customer also had just run a high-priority select, which either cancelled or pushed all of the intents, so it wasn't too surprising. What was surprising in that was was that there were still 50k+ intents on the range after the SELECT, which made me think that they were actively being created. This may not be the common case we're looking at here. |
This makes sense to me - if we're really leaking intents mostly as a function of node restarts (tbd), then and even independently it makes sense that the GC queue would see it as its job to nuke them. We just have to make sure it's done in a way that avoids busy loops in pathological cases where a large number of (non-abandon) intents is always present on the table. If GC is bad at removing lots of intents, the timeout func could help, or even a dynamic timeout that is based on making progress rather than a fixed timeout, which is hard to compute in this case. |
FWIW I just did a test on a local roachprod cluster with As a short-term mitigation, what do you think about me making |
Yes. One of the CC free-tier clusters hit this issue, leading to issues with If it helps, consider that we can backport additional obserability into CC & quickly get it on all the 20.2 clusters. If we had metrics that tracked the number of intents in the "largest" ranges by number of intents, we could look actively for such clusters even ahead of impact (in order to help get a root cause). |
On the free-tier cluster, no restarts for last 2-3 weeks, @nvanbenschoten: The (The little blip where there appears to be no nodes is likely to be a monitoring glitch.) |
SGTM |
60541: opt: inverted-index accelerate filters of the form j->'a' = '{"b": "c"}' r=angelazxu a=angelazxu Previously, the optimizer did not plan inverted index scans for queries with the JSON fetch value operator `->` when an object or array was on the right side of the equality operator `=`, only when it was a boolean, string, number, or null. This change allows the inverted index to be used in these types of queries. It supports objects with multiple key/value pairs, nested objects, arrays, arrays nested within objects, and objects nested within arrays. Fixes: #59605 Release note: None 60619: kv: allow manual GC that does not advance threshold r=nvanbenschoten a=nvanbenschoten Relates to #60585. This commit updates the GC queue's `process` method to allow GC even if it determines that doing so would not advance the GC threshold. This check is retained in the queue's `shouldQueue` method, so this change does not impact automated queueing decisions. However, it does ensure that if someone manually enqueues a range in the GC queue with the `ShouldSkipQueue` option, then the range is actually processed. This is important because manually the GC threshold is the first thing that the GC bumps when processing, meaning that the GC threshold alone is not a definitive indication that there is no more work to perform on a range. A processing pass that will not bump the threshold can still be useful, especially with the long context timeout associated with manual enqueuing. Release note (ui change): Manually enqueue range in a replica GC queue now properly respects the SkipShouldQueue option. This can be useful to force a GC of a specific Range. 60638: sql: fix RENAME COLUMN for REGIONAL BY ROW r=ajstorm a=otan We need to correct the column reference on the table descriptor if we are renaming the REGIONAL BY ROW column. Resolves #59116 Release note: None 60698: vendor: bump pebble to 444296cf r=sumeerbhola a=sumeerbhola 444296cf Merge pull request #1066 from petermattis/pmattis/stability 59659388 update README with note about production readiness a516e691 *: optimize SeekPrefixGE to use Next 327d2757 sstable: hoist blockSizes from testBytesIteratedWithCompression 69b09310 sstable: add benchmarks for zstd cbe3a149 sstable: fix review comments deb29d88 sstable: include zstd into the test cases f13dea6f sstable: support zstd compression 73e8a3b0 vendor: added DataDog/zstd ec81e4c4 sstable: add zstd-related constants Release note: None Co-authored-by: Angela Xu <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Oliver Tan <[email protected]> Co-authored-by: sumeerbhola <[email protected]>
Relates to cockroachdb#60585. This commit updates the GC queue's `process` method to allow GC even if it determines that doing so would not advance the GC threshold. This check is retained in the queue's `shouldQueue` method, so this change does not impact automated queueing decisions. However, it does ensure that if someone manually enqueues a range in the GC queue with the `SkipShouldQueue` option, then the range is actually processed. This is important because the GC threshold is the first thing that the GC bumps when processing, meaning that the GC threshold alone is not a definitive indication that there is no more work to perform on a range. A processing pass that will not bump the threshold can still be useful, especially with the long context timeout associated with manual enqueuing. Release note (ui change): Manually enqueueing a range in the GC queue now properly respects the SkipShouldQueue option. This can be useful to force the GC of a specific Range.
Relates to cockroachdb#60585. This commit updates the GC queue's `process` method to allow GC even if it determines that doing so would not advance the GC threshold. This check is retained in the queue's `shouldQueue` method, so this change does not impact automated queueing decisions. However, it does ensure that if someone manually enqueues a range in the GC queue with the `SkipShouldQueue` option, then the range is actually processed. This is important because the GC threshold is the first thing that the GC bumps when processing, meaning that the GC threshold alone is not a definitive indication that there is no more work to perform on a range. A processing pass that will not bump the threshold can still be useful, especially with the long context timeout associated with manual enqueuing. Release note (ui change): Manually enqueueing a range in the GC queue now properly respects the SkipShouldQueue option. This can be useful to force the GC of a specific Range.
I think I have a plausible explanation for the intent buildup. As mentioned earlier, when we roll back a txn and the client context is cancelled, we give the final Lines 756 to 758 in 0eba39a
But we didn't really think this should affect cleanup, because we kick off async tasks that are disconnected from the client's context as long as we haven't hit the async task quota: cockroach/pkg/kv/kvserver/intentresolver/intent_resolver.go Lines 418 to 422 in a4b7a43
But of course, if the 3-second context expires before it gets to the point where it can kick off the async task, then it isn't going to clean up anything at all since it gets aborted here: cockroach/pkg/kv/kvserver/replica_write.go Lines 291 to 297 in c81a2a3
I initially considered this fairly edge-casey, because I thought this only applied to clients submitting a This means that every time a client disconnects during a txn or in-flight DML statement, the This is kind of what #64770 is getting at, but a bit broader. I'll submit a PR tomorrow that tries harder to clean up in the face of client context cancellation. |
I think I found the reason why we're leaking txn records as well: cockroach/pkg/kv/kvserver/intentresolver/intent_resolver.go Lines 757 to 762 in a4b7a43
We spin off a separate async task to clean up the txn record, but use the surrounding context. In cases where intent resolution is synchronous, and thus is connected to the caller's context, the call can return before |
Another thing to add here is that we don't try very hard to use the There's probably something we could do better here. We can't rely on a finalized transaction record to accurately reflect all of the intents that the transaction wrote because the transaction may have expired without ever being finalized cleanly. However, in cases where a transaction is finalized cleanly and the record includes its intent spans, we could do more. For instance, Of course, there are some practical considerations here. For instance, we'd want to avoid thundering herds of cleanup for the same transaction immediately after it is finalized. We may also want to avoid redundant work as best as we can in cases where a transaction's intents are partially cleaned up - we don't currently checkpoint cleanup progress anywhere. There's also some concern about GC-ing the record of a possibly active transaction, as is discussed in #53146. |
I think this is probably the right approach in general. When we're running some other operation, and happen to run into intents, we want to keep latency as low as possible and just deal with the intents that we have to.
We might want a separate GC queue for this rather than doing best-effort ad hoc cleanup. Since the txn records are stored in the range-local key space we can presumably scan them efficiently and clean up any finalized txn records. How to finalize abandoned txns is a separate concern (haven't looked into how this is currently done yet). |
64869: kvserver: improve intent cleanup for disconnected clients r=nvanbenschoten a=erikgrinaker ### kvserver: improve intent cleanup for disconnected clients Whenever a client disconnects during an open transaction or in-flight DML statement, the server-side session will asynchronously clean up the transaction record and intents by rolling back the transaction. However, this had a very short timeout of 3 seconds, and even though the actual `IntentResolver.CleanupTxnIntentsAsync()` cleanup task runs asynchronously and independently of the client's context in the typical case, the short timeout could cause transaction cleanup to be cancelled if the `EndTxn` request took more than 3 seconds to get all the way through Raft execution or if the async task limit was exceeded such that the cleanup task kept using the client's context. This in turn could lead to intents building up over time. This patch increases the timeout when rolling back transactions for disconnected clients to 1 minute, and also tries to perform transaction cleanup when a client disconnects while waiting for an `EndTxn` command to go through Raft. Resolves #64770, touches #60585. Release note (bug fix): improved transaction cleanup for disconnected clients, to reduce intent buildup. ### kvserver: fix race condition during synchronous txn record cleanup Transaction records and intents are normally cleaned up asynchronously via `IntentResolver.CleanupTxnIntentsAsync()`, separately from the client's context. When the async task limit is exceeded, cleanup instead becomes synchronous and attached to the client context. However, the final `gcTxnRecord()` call to remove the transaction record is asynchronous even when intent cleanup is synchronous, to avoid holding onto the intent resolver task slot. This call will typically return to the caller before `gcTxnRecord()` completes, which may cause the caller to cancel the context (either via `defer cancel()` or a client disconnection) and in turn cancel the `gcTxnRecord()` call. This patch gives the async `gcTxnRecord()` call a separate background context that's independent of the client's context even in the synchronous case, with a 20 second timeout to avoid goroutine leaks. Resolves #64868, touches #60585. Release note (bug fix): Fixed a race condition during transaction cleanup that could leave old transaction records behind until MVCC garbage collection. /cc @cockroachdb/kv Co-authored-by: Erik Grinaker <[email protected]>
I believe we may have plugged the main intent leak with #65592, which is being merged to Also, we've added a section to the |
65001: kvserver: make GC intent scoring more aggressive r=tbg a=erikgrinaker ### keys: rename RangeLastGCKey to RangeGCThresholdKey The key `RangeLastGCKey` was not actually used to store the last GC timestamp, but rather the last GC threshold. This commit renames the key to the more appropriate `RangeGCThresholdKey`. Release note: None ### kvserver: increase GC queue timeout The MVCC GC queue timeout was left at the default of 1 minute. Since the `GCThreshold` is advanced and persisted at the very start of the GC run (to coordinate with other processes such as exports and rangefeeds before removing any data), this means that if a GC run takes more than 1 minute to complete it will be aborted due to the timeout and may not run again for some time. This patch increases the timeout to 10 minutes. The primary purpose of the timeout is to avoid head-of-line blocking in the case of unavailable ranges, and a larger timeout still satisfies this property while also allowing more expensive cleanup attempts to complete. Release note (ops change): increased the timeout for range MVCC garbage collection from 1 minute to 10 minutes, to allow larger jobs to run to completion. ### kvserver: use PUSH_TOUCH during GC to push intent owners Using `PUSH_ABORT` to push intent owners during intent GC can cause it to block on (or possibly abort) long-running transactions. Since we would prefer GC to continue with other intents rather than waiting for active transactions, this patch changes the GC queue to use `PUSH_TOUCH` instead. This will still clean up abandoned txns. Resolves #65777. Release note (bug fix): Intent garbage collection no longer waits for or aborts running transactions. ### kvserver: add GC queue cooldown timer When the GC queue considered replicas for MVCC GC, it primarily considered the previous GC threshold and the amount of MVCC garbage that would be removed by advancing the threshold. We would like to also trigger GC based solely on intent stats. However, we can't know whether we'll actually be able to clean up any intents, since they may belong to an in-progress long-running transaction. There is therefore a risk that we will keep spinning on GC attempts due to high intent counts. This patch records the last GC queue processing timestamp, and adds a cooldown timer of 2 hours between each range GC attempt triggered by intent score alone to prevent spinning on GC attempts. Release note: None ### kvserver: make GC intent scoring more aggressive Users often experience buildup of intents due to the GC queue not being aggressive enough in cleaning them up. Currently, the GC queue will only trigger based on intents if the average intent age is 10 days and if there is also MVCC garbage that can be cleaned up. This patch makes the intent scoring more aggressive, triggering GC when the average intent age is 8 hours regardless of other MVCC garbage. The previous commit added a cooldown timer to prevent the GC queue from spinning on a replica if the intents couldn't be cleaned up (e.g. because they belong to an in-progress long-running transaction). Resolves #64266. Touches #60585. Release note (ops change): trigger MVCC and intent garbage collection when the average intent age is 8 hours, down from 10 days. /cc @cockroachdb/kv Co-authored-by: Erik Grinaker <[email protected]>
We've now resolved the known issues around intent buildup and mitigation. The most important mitigation (better intent handling during backup export requests in #64131) has been backported, but the other changes carry more risk and we've decided not to backport these for now -- they are thus only available on |
YAY! Thanks, Erik! |
…uests Fixes cockroachdb#50390. Related to cockroachdb#60585. Related to cockroachdb#103126. Closes cockroachdb#64500, which was an earlier attempt to solve this issue using a similar approach. This commit addresses the comments on that PR, which focused on the pagination of intent resolution when bypassing the request batcher. We still don't try to solve this issue, and instead limit the cases where intent resolution bypasses the request batcher to those where pagination is not necessary. This commit adds a new `sendImmediately` option to the `IntentResolver` `ResolveOptions`, which instructs the `IntentResolver` to send the provided intent resolution requests immediately, instead of adding them to a batch and waiting up to 10ms (defaultIntentResolutionBatchWait) for that batch to fill up with other intent resolution requests. This can be used to avoid any batching-induced latency and should be used only by foreground traffic that is willing to trade off some throughput for lower latency. The commit then sets this flag for single-key intent resolution requests initiated by the `lockTableWaiter`. Unlike most calls into the `IntentResolver`, which are performed by background tasks that are more than happy to trade increased latency for increased throughput, the `lockTableWaiter` calls into the `IntentResolver` on behalf of a foreground operation. It wants intent resolution to complete as soon as possible, so it is willing to trade reduced throughput for reduced latency. I tested this out by writing 10,000 different intents in a normal-priority transaction and then scanning over the table using a high-priority transaction. The test was performed on a 3-node roachprod cluster to demonstrate the effect with realistic network and disk latencies. ``` -- session 1 CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); -- session 2 BEGIN PRIORITY HIGH; SELECT count(*) FROM keys; ``` Without this change, the scan took 70.078s. With this change, the scan took 15.958s. This 78% speed up checks out. Each encountered intent is resolved serially (see cockroachdb#103126), so the per-intent latency drops from 7ms to 1.6ms. This improvement by about 5ms agrees with the `defaultIntentResolutionBatchIdle`, which delays each resolution request that passes through a RequestBatcher. With this change, these resolve intent requests are issued immediately and this delay is not experienced. While this is a significant improvement and will be important for Read Committed transactions after cockroachdb#102014, this is still not quite enough to resolve cockroachdb#103126. For that, we need to batch the resolve intent requests themselves using a form of deferred intent resolution like we added in cockroachdb#49218 (for finalized transactions). A similar improvement is seen for scans that encounter many abandoned intents from many different transactions. This scenario bypasses the deferred intent resolution path added in cockroachdb#49218, because the intents are each written by different transactions. The speedup for this scenario was presented in cockroachdb#64500. Release note (performance improvement): SQL statements that must clean up intents from many different previously abandoned transactions now do so moderately more efficiently.
…uests Fixes cockroachdb#50390. Related to cockroachdb#60585. Related to cockroachdb#103126. Closes cockroachdb#64500, which was an earlier attempt to solve this issue using a similar approach. This commit addresses the comments on that PR, which focused on the pagination of intent resolution when bypassing the request batcher. We still don't try to solve this issue, and instead limit the cases where intent resolution bypasses the request batcher to those where pagination is not necessary. This commit adds a new `sendImmediately` option to the `IntentResolver` `ResolveOptions`, which instructs the `IntentResolver` to send the provided intent resolution requests immediately, instead of adding them to a batch and waiting up to 10ms (defaultIntentResolutionBatchWait) for that batch to fill up with other intent resolution requests. This can be used to avoid any batching-induced latency and should be used only by foreground traffic that is willing to trade off some throughput for lower latency. The commit then sets this flag for single-key intent resolution requests initiated by the `lockTableWaiter`. Unlike most calls into the `IntentResolver`, which are performed by background tasks that are more than happy to trade increased latency for increased throughput, the `lockTableWaiter` calls into the `IntentResolver` on behalf of a foreground operation. It wants intent resolution to complete as soon as possible, so it is willing to trade reduced throughput for reduced latency. I tested this out by writing 10,000 different intents in a normal-priority transaction and then scanning over the table using a high-priority transaction. The test was performed on a 3-node roachprod cluster to demonstrate the effect with realistic network and disk latencies. ``` -- session 1 CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); -- session 2 BEGIN PRIORITY HIGH; SELECT count(*) FROM keys; ``` Without this change, the scan took 70.078s. With this change, the scan took 15.958s. This 78% speed up checks out. Each encountered intent is resolved serially (see cockroachdb#103126), so the per-intent latency drops from 7ms to 1.6ms. This improvement by about 5ms agrees with the `defaultIntentResolutionBatchIdle`, which delays each resolution request that passes through a RequestBatcher. With this change, these resolve intent requests are issued immediately and this delay is not experienced. While this is a significant improvement and will be important for Read Committed transactions after cockroachdb#102014, this is still not quite enough to resolve cockroachdb#103126. For that, we need to batch the resolve intent requests themselves using a form of deferred intent resolution like we added in cockroachdb#49218 (for finalized transactions). A similar improvement is seen for scans that encounter many abandoned intents from many different transactions. This scenario bypasses the deferred intent resolution path added in cockroachdb#49218, because the intents are each written by different transactions. The speedup for this scenario was presented in cockroachdb#64500. Release note (performance improvement): SQL statements that must clean up intents from many different previously abandoned transactions now do so moderately more efficiently.
103265: kv: resolve conflicting intents immediately for latency-sensitive requests r=nvanbenschoten a=nvanbenschoten Fixes #50390. Related to #60585. Related to #103126. Closes #64500, which was an earlier attempt to solve this issue using a similar approach. This commit addresses the comments on that PR, which focused on the pagination of intent resolution when bypassing the request batcher. We still don't try to solve this issue, and instead limit the cases where intent resolution bypasses the request batcher to those where pagination is not necessary. This commit adds a new `sendImmediately` option to the `IntentResolver` `ResolveOptions`, which instructs the `IntentResolver` to send the provided intent resolution requests immediately, instead of adding them to a batch and waiting up to 10ms (`defaultIntentResolutionBatchWait`) for that batch to fill up with other intent resolution requests. This can be used to avoid any batching-induced latency and should be used only by foreground traffic that is willing to trade off some throughput for lower latency. The commit then sets this flag for single-key intent resolution requests initiated by the `lockTableWaiter`. Unlike most calls into the `IntentResolver`, which are performed by background tasks that are more than happy to trade increased latency for increased throughput, the `lockTableWaiter` calls into the `IntentResolver` on behalf of a foreground operation. It wants intent resolution to complete as soon as possible, so it is willing to trade reduced throughput for reduced latency. I tested this out by writing 10,000 different intents in a normal-priority transaction and then scanning over the table using a high-priority transaction. The test was performed on a 3-node roachprod cluster to demonstrate the effect with realistic network and disk latencies. ```sql -- session 1 CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); -- session 2 BEGIN PRIORITY HIGH; SELECT count(*) FROM keys; ``` Without this change, the scan took **70.078s**. With this change, the scan took **15.958s**. This **78%** speed-up checks out. Each encountered intent is resolved serially (see #103126), so the **per-intent latency** drops from **7ms** to **1.6ms.** This improvement by about 5ms agrees with the `defaultIntentResolutionBatchIdle`, which delays each resolution request that passes through a RequestBatcher. With this change, these resolve intent requests are issued immediately and this delay is not experienced. While this is a significant improvement and will be important for Read Committed transactions after #102014, this is still not quite enough to resolve #103126. For that, we need to batch the resolve intent requests themselves using a form of deferred intent resolution like we added in #49218 (for finalized transactions). A similar improvement is seen for scans that encounter many abandoned intents from many different transactions. This scenario bypasses the deferred intent resolution path added in #49218, because the intents are each written by different transactions. The speedup for this scenario was presented in #64500. Release note (performance improvement): SQL statements that must clean up intents from many different previously abandoned transactions now do so moderately more efficiently. 103362: sql: validate primary / secondary region localities at end of txn r=fqazi a=fqazi Previously, if a database was restored with skip_localities, there was no way to modify this database to set the primary region since validation would kick in too early during the statement. This meant fixing the regions in a restored database was impossible if the primary region was no longer valid. To address this, this patch, delays locality validation till the end of the transaction. Fixes: #103290 Release note (bug fix): SET PRIMARY REGION and SET SECONDARY REGION did not validate transactionally, which could prevent cleaning up removed regions after a restore. 103373: concurrency: small refactors in preparation for reservation removal r=nvanbenschoten a=arulajmani See individual commits for details. Informs: #103361 103538: go.mod: bump Pebble to 6f2788660198, rework shared storage wrapper r=RaduBerinde a=RaduBerinde 6f278866 shared: improve interface for more efficient reading 9eb2c407 db: log events to testing.T in unit tests f32e7dc6 db: add reserved Pebblev4 sstable format 5a6b91b8 objstorage: improve test and add read ahead test 2bc4319e objstorage: remove genericFileReadable 8143ffb9 objstorage: fix readaheadState initialization 06d08888 db: add Compact.Duration metric e7213de0 db: add Uptime metric e9005aed db: don't delete files during ingest application 222b43ec internal/arenaskl: fix Skiplist doc typo Release note: None Epic: none Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Faizan Qazi <[email protected]> Co-authored-by: Arul Ajmani <[email protected]> Co-authored-by: Radu Berinde <[email protected]>
Describe the problem
We have received numerous instances in which large numbers of intents (hundreds of thousands and more) were able to build up on a range. While this is always theoretically possible - intent cleanup is async and best-effort, so an ill-timed crash can temporarily leave the intents behind - it appears that this is happening in scenarios where it is unexpected.
See for example (internal link): https://github.com/cockroachlabs/support/issues/817
The issue often leads to stuck backups because of #59704 (which we should independently fix).
To Reproduce
Reproducing these issues is part of the problem. So far we only strongly suspect that there are issues, but there is no concrete clue.
Expected behavior
Intents should only be able to build up when
Additional data / screenshots
Recap of the intent resolution behavior:
Intents are generally resolved when EndTxn(commit=true) has finished applying. We will then hit this code path:
cockroach/pkg/kv/kvserver/replica_write.go
Lines 188 to 192 in 218a5a3
which calls into this:
cockroach/pkg/kv/kvserver/intentresolver/intent_resolver.go
Lines 545 to 559 in 23263a3
One possibility is that the intent resolver erroneously skips large batches of txn intents because it is already tracking some smaller amount of work on behalf of the same txn, in which case we'd somehow end up in the
!locked
code path. In practice, this seems very unlikely, since we only invoke this method from the GC queue (which doesn't touch txns younger than 1h) and theEndTxn(commit=true)
proposal result itself, of which there is only ever one.This raises the question why
lockInflightTxnCleanup
even exists. I think it may either be premature optimization or related to the GC queue code path. The code hasn't changed much since 2017, i.e. it is very old.I did check that the cleanup intents are properly scoped to a new background-derived context, i.e. we're not accidentally tying these cleanup attempts to the short-lived caller's context. We're also not introducing an overly aggressive timeout, actually to the best of my knowledge in this particular path there isn't a timeout (?) as
runAsyncTask
derives fromcontext.Background()
and we don't add a timeout anywhere.Also, isn't this code wrong:
cockroach/pkg/internal/client/requestbatcher/batcher.go
Lines 355 to 366 in 8c5253b
It seems that it would erase
ba.sendDeadline
iflen(ba.Reqs) > 0
,ba.sendDeadline > 0
, and!rHasDeadline
. However, if anything, this would swallow a context cancellation rather than cause a premature one.Another thing worth noting is that parallel commits prevents clients from seeing the backpressure. When
EndTxn(commit=true)
runs, it may feel backpressure. However, that request in parallel commits is completely disjoint from the SQL client as it is fired-and-forgotten here:cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_committer.go
Lines 431 to 455 in ab0b87a
Environment:
Additional context
Once we have the separated lock table, it might be much cheaper to be reactive to the problem.
Epic: CRDB-2554
The text was updated successfully, but these errors were encountered: