storage: ExportRequest poorly leveraged to clean up abandoned intents #59704

nvanbenschoten · 2021-02-02T15:11:45Z

Currently, an ExportRequest will take an extremely long time to resolve a collection of abandoned intents. This can make it look like the ExportRequest is stuck somewhere in KV. The reason for this (or at least part of it) is that ExportMVCCToSst currently only returns a single intent if it returns a WriteIntentError. This is different than a request like ScanRequest, which will collect a set of intents in a WriteIntentError, which allows higher levels to process all of them at once.

We should improve ExportRequest to collect multiple intents in a single WriteIntentError.

We should also write a test that exercises the path where an ExportRequest needs to clean up millions of abandoned intents.

gz#7529

Epic: CRDB-2554

The text was updated successfully, but these errors were encountered:

tbg · 2021-02-02T16:22:27Z

We should also write a test that exercises the path where an ExportRequest needs to clean up millions of abandoned intents.

here's a random 2018 flashback for you #18661

tbg · 2021-02-02T16:22:37Z

(Nothing there is useful, just posting for entertainment)

aayushshah15 · 2021-02-02T19:39:25Z

The support issue that this popped up on saw intents being resolved at a rate of less than 10/sec, is that right?

I'm wondering if you could give me any intuition around why resolving these intents one-by-one is so egregiously slow, as I recall that the cluster in question had all its nodes in the same region (or even the same AZ IIRC).

Pardon my over-simplification, but aren't we just talking about issuing a bunch of ~~QueryIntent~~ ResolveIntent requests?

nvanbenschoten · 2021-02-04T18:36:24Z

My current working theory is that there is some quadratic behavior going on here. Each time, the ExportRequest begins scanning the range from the front and has to scan (while building SSTs) to the next abandoned intent. So we end up scanning the range num_abandoned_intents times. Collecting all intents in one go would avoid this.

tbg · 2021-02-08T09:07:14Z

Another option (on top of making intent handling more robust, which needs to happen anyway) could be a new request type (much like scan) which does not return any keys. Export could issue that ahead of its export requests to sanitize the keyspace. It is an extra step (which could be parallelized for much of the keyspace and wouldn't be very expensive with the separate lock table), but could further reduce the likelihood of the Export running into an intent in the first place.

sumeerbhola · 2021-02-08T14:55:11Z

My current working theory is that there is some quadratic behavior going on here.
...

Export could issue that ahead of its export requests to sanitize the keyspace. It is an extra step (which could be parallelized for much of the keyspace and wouldn't be very expensive with the separate lock table) ...

btw, this quadratic behavior was also discussed in #41720 (comment) (second bullet)

nvanbenschoten · 2021-02-10T17:05:23Z

To be concrete here, the work item will be updating MVCCIncrementalIterator to collect multiple intents in a single WriteIntentError, instead of just a single intent. This is similar to how the pebbleMVCCScanner works. This is on the border of Bulk-I/O and Storage, so I'd appreciate guidance on which direction to route this.

lunevalex · 2021-02-10T17:18:08Z

cc: @mwang1026 for consideration for 21.2

tbg · 2021-02-10T18:41:34Z

Fyi imo it would be shameful to let this sit until 21.2 given the amount of ongoing burden we think this creates. Can we try to fast track this?

…

On Wed, Feb 10, 2021, 18:18 lunevalex ***@***.***> wrote: cc: @mwang1026 <https://github.com/mwang1026> for consideration for 21.2 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#59704 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABGXPZH3RRE7KSR4576JH2DS6K5WDANCNFSM4W6ZOZBQ> .

mwang1026 · 2021-02-10T19:22:05Z

I'd selfishly like to fast-track this as well since as far as I can tell doesn't have clear mitigation steps and also isn't easy for a customer to self-diagnose. Who could we assign this to / what would process be for fast tracking? Do we consider this a bug?

aayushshah15 · 2021-02-10T22:14:49Z

Spoke to @nvanbenschoten offline, I'll take this on my plate.

dt · 2021-02-10T22:22:16Z

I think this is a dupe of #31762 but since it now has more discussion on it, maybe we close that one instead

nvanbenschoten · 2021-02-17T20:41:47Z

Something I found while exploring #60585 was that on release-20.2, we do not consult the finalizedTxnCache when req.WaitPolicy == lock.WaitPolicy_Error in lockTableWaiterImpl.WaitOn (here). I think we'll want to for two reasons:

it will avoid a transaction record push per abandoned intent.
it will allow us to defer and batch intent resolution like we do here.

So while addressing this issue, we'll want to make this change as well on release-20.2. We won't need it on master, as the finalizedTxnCache was pulled into the lockTable by @sumeerbhola in #57947.

Ideally, the testing we'll add when addressing this issue will be comprehensive enough to have caught this difference between master and release-20.2. For this to be true, we'll likely want at least two new tests:

an integration test that shows that BACKUP will efficiently clean up large swaths of abandoned intents
a new test scenario in pkg/kv/kvserver/concurrency/testdata/concurrency_manager/wait_policy_error, similar to the "request resolves the abandoned lock and proceeds" scenario, that shows that abandoned lock cleanup is deferred and batched.

joshimhoff · 2021-04-19T19:20:37Z

Yay to this issue. Can we backport the fix? What CRDB versions will the fix eventually land in? Us SREs are pretty motivated to roll this out fleet wide as early as possible.

Does this help with BACKUP in addition to bulk IO ops like IMPORT / EXPORT?

nvanbenschoten · 2021-04-19T19:46:18Z

Can we backport the fix?

Yes, we intend to backport this fix to v21.1, v20.2, and v20.1.

Does this help with BACKUP in addition to bulk IO ops like IMPORT / EXPORT?

Yep! This will help with any operation that uses ExportRequest, which includes BACKUP.

joshimhoff · 2021-04-19T19:47:16Z

Amazing.

aliher1911 · 2021-05-13T09:38:43Z

It turned out that 20.1 is not as straightforward. We use rocksdb as a default storage and the fix lives in the glue code between storage engine and layer above so it only applies to pebble. So unless there's read need for that, we'd rather encourage people to upgrade. As for backporting change to pebble storage in 20.1 it is possible and the pr exists but not sure if we should proceed or not.

nvanbenschoten · 2021-05-13T15:51:16Z

As for backporting change to pebble storage in 20.1 it is possible and the pr exists but not sure if we should proceed or not.

I don't think we should bother with this. From the sound of it, we don't have any more scheduled v20.1 patch releases. We also don't expect many v20.1 clusters to be using Pebble.

nvanbenschoten mentioned this issue Feb 4, 2021

backup: ExportRequest should run with high-priority, even with WaitPolicy_Error #59811

Closed

pbardea mentioned this issue Feb 4, 2021

backupccl: export requests slowly clear abandoned intents #59599

Closed

aayushshah15 self-assigned this Feb 10, 2021

tbg mentioned this issue Feb 15, 2021

kvserver: avoid or at least efficiently handle intent buildup #60585

Closed

dt mentioned this issue Feb 17, 2021

storage: ExportToSST does not handle intents well #31762

Closed

aliher1911 self-assigned this Apr 7, 2021

dt mentioned this issue Apr 13, 2021

Scheduled backup doesn't complete #63526

Closed

aliher1911 linked a pull request Apr 23, 2021 that will close this issue

storage: report all encountered intents in sst export error #64131

Merged

aliher1911 mentioned this issue Apr 23, 2021

storage: report all encountered intents in sst export error #64131

Merged

aliher1911 mentioned this issue May 6, 2021

kv: Add server option for maximum number of intents per WriteIntentError for scans #64783

Closed

craig bot closed this as completed in 28c24ed May 6, 2021

jseldess added docs-known-limitation docs-done labels May 28, 2021

exalate-issue-sync bot unassigned aayushshah15 Jun 13, 2022

exalate-issue-sync bot added the T-kv KV Team label Jun 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: ExportRequest poorly leveraged to clean up abandoned intents #59704

storage: ExportRequest poorly leveraged to clean up abandoned intents #59704

nvanbenschoten commented Feb 2, 2021 •

edited by lunevalex

Loading

tbg commented Feb 2, 2021

tbg commented Feb 2, 2021

aayushshah15 commented Feb 2, 2021 •

edited

Loading

nvanbenschoten commented Feb 4, 2021

tbg commented Feb 8, 2021 •

edited

Loading

sumeerbhola commented Feb 8, 2021

nvanbenschoten commented Feb 10, 2021

lunevalex commented Feb 10, 2021

tbg commented Feb 10, 2021 via email

mwang1026 commented Feb 10, 2021

aayushshah15 commented Feb 10, 2021

dt commented Feb 10, 2021

nvanbenschoten commented Feb 17, 2021

joshimhoff commented Apr 19, 2021 •

edited

Loading

nvanbenschoten commented Apr 19, 2021

joshimhoff commented Apr 19, 2021

aliher1911 commented May 13, 2021

nvanbenschoten commented May 13, 2021

storage: ExportRequest poorly leveraged to clean up abandoned intents #59704

storage: ExportRequest poorly leveraged to clean up abandoned intents #59704

Comments

nvanbenschoten commented Feb 2, 2021 • edited by lunevalex Loading

tbg commented Feb 2, 2021

tbg commented Feb 2, 2021

aayushshah15 commented Feb 2, 2021 • edited Loading

nvanbenschoten commented Feb 4, 2021

tbg commented Feb 8, 2021 • edited Loading

sumeerbhola commented Feb 8, 2021

nvanbenschoten commented Feb 10, 2021

lunevalex commented Feb 10, 2021

tbg commented Feb 10, 2021 via email

mwang1026 commented Feb 10, 2021

aayushshah15 commented Feb 10, 2021

dt commented Feb 10, 2021

nvanbenschoten commented Feb 17, 2021

joshimhoff commented Apr 19, 2021 • edited Loading

nvanbenschoten commented Apr 19, 2021

joshimhoff commented Apr 19, 2021

aliher1911 commented May 13, 2021

nvanbenschoten commented May 13, 2021

nvanbenschoten commented Feb 2, 2021 •

edited by lunevalex

Loading

aayushshah15 commented Feb 2, 2021 •

edited

Loading

tbg commented Feb 8, 2021 •

edited

Loading

joshimhoff commented Apr 19, 2021 •

edited

Loading