Seeing spanner 20k mutation limit errors for inserts in production #333

tublitzed · 2019-11-14T14:42:06Z

I have a large collection of bookmarks(3078 bookmarks and 546 folders) that all of a sudden is failing to sync in production.

Sync id: 130387415. Via bob, I got the following from the production logs:

A database error occurred: RpcFailure(RpcStatus { status: RpcStatusCode(3), details: Some("The transaction contains too many mutations. Insert and update operations count with the multiplicity of the number of columns they affect. For example, inserting values into one key column and four non-key columns count as five mutations total for the insert. Delete and delete range operations count as one mutation regardless of the number of columns affected. The total mutation count includes any changes to indexes that the transaction generates. Please reduce the number of writes, or use fewer indexes. (Maximum number: 20000)") }), status: 500 }

Here's a link to a sample anonymized collection of this same size.

The text was updated successfully, but these errors were encountered:

tublitzed · 2019-11-15T15:50:24Z

What's interesting here is that I hit the Spanner 20k limit with only ~3500 items...

pjenvey · 2019-11-15T17:30:18Z

The 20k here translates into 2000 items (1 mutation per each column. bsos table has 8 columns + 2 extra mutations for our secondary indexes).

It looks like the sync client will print the server's limits in its log, can you search your log for "max_total_records"? Let's confirm your client's seeing a value of "2000".

E.g. (from my log, probably non-spanner current sync prod):
1573781473132 Sync.Engine.Tabs TRACE new PostQueue config (after defaults): : {"max_request_bytes":2101248,"max_record_payload_bytes":2097152,"max_post_bytes":2101248,"max_post_records":100,"max_total_bytes":209715200,"max_total_records":10000}

jrconlin · 2019-11-15T17:36:46Z

Ah, I thought that it was one mutation per row too, not per column. That really changes how I view things. We pretty much want to reduce the number of columns to the bare minimum, but I get the feeling we'd get hit on data transfer then.

…

On Fri, Nov 15, 2019 at 9:30 AM Philip Jenvey ***@***.***> wrote: The 20k here translates into 2000 items (1 mutation per each column. bsos table has 8 columns + 2 extra mutations for our secondary indexes). It looks like the sync client will print the server's limits in its log, can you search your log for "max_total_records"? Let's confirm your client's seeing a value of "2000". E.g. (from my log, probably non-spanner current sync prod) 1573781473132 Sync.Engine.Tabs TRACE new PostQueue config (after defaults): : {"max_request_bytes":2101248,"max_record_payload_bytes":2097152,"max_post_bytes":2101248,"max_post_records":100,"max_total_bytes":209715200,"max_total_records":10000} — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#333?email_source=notifications&email_token=AAAIXKZMOMPKXR2JQNDZBMLQT3MCVA5CNFSM4JNMUJO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEGETYI#issuecomment-554453473>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAIXK7TRNTFPEY4CTVVAVLQT3MCVANCNFSM4JNMUJOQ> .

tublitzed · 2019-11-15T19:03:38Z

@pjenvey confirmed that my max_total_records is getting set correctly to 2k:

{"max_post_bytes":2097152,"max_post_records":100,"max_record_payload_bytes":2097152,"max_request_bytes":2101248,"max_total_bytes":209715200,"max_total_records":2000}

pjenvey · 2019-11-15T19:35:23Z

Batch commit also deletes the batch upon completion, adding an extra mutation or two. So our max_total_records should probably be 1998.

We need a db_test ensuring this limit, it should create 2 batches: of the max size and max size + 1, ensuring commit returns what we expect.

If it's longer running it could be disabled by default w/ #[ignore] but we have no spanner on CI to ensure it doesn't fail over time :(

tublitzed · 2019-11-15T22:46:21Z

Would it be worth also making a change to make this value configurable without a deployment (ie, maybe set the default here but have it be able to be overridden with an env var or something like that?

That way if something changes that's outside of our control (ie, since this 20k limit is Spanner internals) we don't need to do a deploy to fix.

jrconlin · 2019-11-15T23:43:47Z

@tublitzed We made it a config value.
https://github.com/mozilla-services/syncstorage-rs/pull/319/files
I use a config file here because I couldn't figure out a way to specify the sub value as an environment var. I believe that @pjenvey noted a way that it could be done.

We also enforce the hard spanner limit (not as a config, because it's a hard limit)
https://github.com/mozilla-services/syncstorage-rs/pull/324/files

reduce to 1999 (19990 mutations) accounting for batch commit's extra mutations: touch_collection (+1) and batch delete (+2: for delete + BsoExpiry update) schema: don't restrict bso_id in the schema (this was already applied to stage and fakeprod) and cleanup old NLL restriction gone in newer rust versions Closes #333

tublitzed · 2019-11-19T23:18:14Z

Hey, @pjenvey - I'm gonna re-open this one....I'm still seeing the same issue in production with the 0.1.7 release. Just verified now with same sync id and collection of bookmarks. I'm seeing the exact same errors in my local sync logs as before, ie something along these lines:

1574205202429	Sync.Collection	DEBUG	POST fail 503 https://sync-1-us-west1-g.sync.services.mozilla.com/1.5/130387415/storage/bookmarks?batch=90c260b6ef3d434291b4bae1066dda24&commit=true
1574205202429	Sync.Collection	WARN	POST request to https://sync-1-us-west1-g.sync.services.mozilla.com/1.5/130387415/storage/bookmarks?batch=90c260b6ef3d434291b4bae1066dda24&commit=true failed with status 503
1574205202429	Sync.Engine.Bookmarks	DEBUG	Uploading records failed: [object Object]
1574205202445	Sync.Engine.Bookmarks.BookmarkSyncUtils	DEBUG	pushChanges: Processed change records: {"weak":0,"skipped":3645,"updated":0}
1574205202445	Sync.ErrorHandler	DEBUG	Got Retry-After: 3600
1574205202445	Sync.Status	DEBUG	Status.sync: success.sync => error.sync.reason.serverMaintenance
1574205202445	Sync.Status	DEBUG	Status.service: success.status_ok => error.sync.failed
1574205202445	Sync.SyncScheduler	DEBUG	Got backoff notification: 3600000ms

pjenvey · 2019-11-19T23:38:29Z

@tublitzed if you still have TRACE statements, can you confirm the same error log shows a info/configuration for this server: -> "max_total_records":1999?

tublitzed · 2019-11-20T17:55:44Z

@pjenvey I do have TRACE statements in the logs, however I'm not seeing anything at all this time for max_total_records (which I was seeing before). Are you able to repro this on your end with a similarly large collection of bookmarks?

pjenvey · 2019-11-20T18:13:15Z

I'm unable to Restore bookmarks from the large_anon_collection.json linked above for some reason.

While I try to work that out, could you try restarting your browser? mhammond mentioned that would reset any potential stale state here, if that's the issue.

tublitzed · 2019-11-21T19:36:00Z

ah, shoot. @pjenvey I'm seeing the same thing with those anonymized bookmarks(ie, cannot restore). I perhaps went too far with my anonymizing script and butchered something that actually matters. Let me try adjusting to kill just the urls/titles and will comment back here once I've confirmed that I can actually use those to restore.

(Also, restarted browser, no-go).

tublitzed · 2019-11-21T19:49:13Z

Ok, @pjenvey - try again. I just replaced the file and fixed the issue with the anonymizing logic. Confirmed that I now can import them locally:

tublitzed · 2019-11-21T20:20:43Z

I thought I'd keep poking a bit to verify that my anon bookmarks do indeed also break when pointed at spanner. Here's a few interesting bits:

The first time, I get a successful sync log (even though when you open it, you'll see similar failures) -
succcess_sync.txt
All subsequent sync attempts yield only sync_error logs.
error_sync.txt
All subsequent syncs also take the same time as the first sync. Which is weird, since they should in theory only be syncing the diff, and should be faster, y? Curious based on Initial Rust project with file structure layout #1 and feat: initial organization, some db code, and the full request path #2 above if this points to something that's broken in PUT that's not happening in the POST.

tublitzed · 2019-11-21T21:05:41Z

Also just confirmed in Sentry that still I'm triggering this error with my non-anon bookmarks, FWIW.

tublitzed · 2019-11-21T22:08:54Z

Here's logs with TRACE enabled:

sync_error_with_trace.txt
sync_success_with_trace.txt

write the pending bsos immediately when committing a batch instead of queueing them in the batch_bsos, so we don't pay twice for their mutations also - optimize batch append to insert all items in one write - interleave BsoExpiry w/ user_collections Closes #333, #318

tublitzed · 2019-11-26T13:46:35Z

Sentry issue: SYNCSTORAGE-PROD-1

pjenvey · 2019-12-05T20:20:19Z

There's another occurrance of this today on prod, 0.1.8

a pile of of hacks which we'll hopefully revert shortly Issue #333

pjenvey · 2019-12-12T01:14:04Z

The sentry info from #373 gave us a lead for the latest iteration of this: https://sentry.prod.mozaws.net/operations/syncstorage-prod/issues/6845127/?query=is:unresolved

A nightly desktop client triggers it with a max size batch of predominantly bso updates vs inserts for its history collection (which seems like an unlikely but certainly possible scenario, modifying > 1999 items in your history at one time).

bsos INSERTs cost 10 mutations per row w/ the table having 8 columns and 2 secondary indices.

UPDATES touch 1-4 rows, so one would expect an UPDATE to cost somewhere between 6 or less (for each row plus secondary indices) or possibly even upwards of 10, matching the INSERT.

Instead I'm seeing UPDATES cost potentially 12 mutations. Given a batch of updates touching all 4 rows, UPDATE maxes out at 1666. It's not clear why there's a +2 mutation tax.

A quick but unfortunate fix would be to further lower max_total_records towards 1666 (with extra padding for the additional operations, see #333).

Let's seek an explanation from Google, which will probably require preparing an easily reproducible case for them to look at.

tublitzed · 2019-12-13T21:51:17Z

@pjenvey let's go ahead and make that quickfix as a separate task and then we can keep this issue open in the backlog so we can revisit as a p2.

pjenvey · 2019-12-13T23:41:33Z

Along w/ the 12 mutations per UPDATE mentioned above, I want to note here that I see touch_collection's UPDATE likely costs 4 mutations vs what we assumed was 1 (adca8d6).

I've confirmed this (as much as possible) from the following test (w/ the redundant touch_collection and batch delete disabled):

4 touch_collection (UPDATE)
4 touch_collection (UPDATE)
19992 post_bsos/batch commit (UPDATE) of 1666 bsos (1666 * 12)
total: 20000

That succeeds. Adding an extra INSERT afterwards into a test table with 1 column gets us to 20001 and fails.

and remove the potential, redundant touch_collection on batch commit recalculating the mutations (see adca8d6): - 4 for touch_collection: always an UPDATE on batch commit (due to pretouch_collection). we thought this cost 1 but it's probably 4 (see #333) - 19992 (1666 max_total_records * 12, for the UPDATE case, see #333) for the mix of post_bsos and batch_commit - 2 for batch delete (1 for the delete and 1 BsoExpiry update) total: 19998 Closes #333

pjenvey · 2019-12-14T00:36:41Z

@tublitzed let's get #377 deployed and confirmed on Mon. then confirm the new limit over to mozilla-mobile/firefox-ios#5896

tublitzed · 2019-12-17T19:43:08Z

@pjenvey I was never able to repro the issue again after the original fix went out. Can you think of any way we can test to confirm the new 1666 limit resolved the issue?

pjenvey · 2019-12-17T19:54:35Z

@pjenvey I was never able to repro the issue again after the original fix went out. Can you think of any way we can test to confirm the new 1666 limit resolved the issue

We can at least confirm the recurring sentry events stop on stage/prod. So far so good (they've stopped), but let's give prod some more time just in case.

tublitzed added the p1 Stuff we gotta do before we ship! label Nov 14, 2019

tublitzed added the bug Something isn't working label Nov 14, 2019

tublitzed assigned pjenvey Nov 15, 2019

pjenvey mentioned this issue Nov 16, 2019

fix: correct max_total_records #339

Merged

4 tasks

jrconlin closed this as completed in #339 Nov 16, 2019

tublitzed reopened this Nov 19, 2019

pjenvey mentioned this issue Nov 26, 2019

fix: optimize batch commit mutations #353

Merged

tublitzed closed this as completed in #353 Dec 3, 2019

pjenvey reopened this Dec 5, 2019

pjenvey mentioned this issue Dec 10, 2019

feat: add temporary sentry tags for the mutation limit issue #373

Merged

pjenvey added a commit that referenced this issue Dec 11, 2019

feat: add temporary sentry tags for the mutation limit issue

9847121

a pile of of hacks which we'll hopefully revert shortly Issue #333

tublitzed added the 8 Estimate - xl - Moderately complex, medium effort, some uncertainty. label Dec 12, 2019

pjenvey mentioned this issue Dec 13, 2019

fix: lower max_total_records per batch_commit_update costs #377

Merged

jrconlin closed this as completed in #377 Dec 14, 2019

pjenvey mentioned this issue Dec 31, 2019

Remove temporary sentry tags for the mutation limit issue #389

Closed

pjenvey mentioned this issue Oct 20, 2020

0.7.1 batch commit can exceed 20k max mutations #869

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seeing spanner 20k mutation limit errors for inserts in production #333

Seeing spanner 20k mutation limit errors for inserts in production #333

tublitzed commented Nov 14, 2019 •

edited

Loading

tublitzed commented Nov 15, 2019

pjenvey commented Nov 15, 2019 •

edited

Loading

jrconlin commented Nov 15, 2019 via email

tublitzed commented Nov 15, 2019

pjenvey commented Nov 15, 2019

tublitzed commented Nov 15, 2019

jrconlin commented Nov 15, 2019

tublitzed commented Nov 19, 2019 •

edited

Loading

pjenvey commented Nov 19, 2019

tublitzed commented Nov 20, 2019

pjenvey commented Nov 20, 2019

tublitzed commented Nov 21, 2019 •

edited

Loading

tublitzed commented Nov 21, 2019

tublitzed commented Nov 21, 2019

tublitzed commented Nov 21, 2019

tublitzed commented Nov 21, 2019

tublitzed commented Nov 26, 2019

pjenvey commented Dec 5, 2019

pjenvey commented Dec 12, 2019

tublitzed commented Dec 13, 2019

pjenvey commented Dec 13, 2019

pjenvey commented Dec 14, 2019

tublitzed commented Dec 17, 2019 •

edited

Loading

pjenvey commented Dec 17, 2019

Seeing spanner 20k mutation limit errors for inserts in production #333

Seeing spanner 20k mutation limit errors for inserts in production #333

Comments

tublitzed commented Nov 14, 2019 • edited Loading

tublitzed commented Nov 15, 2019

pjenvey commented Nov 15, 2019 • edited Loading

jrconlin commented Nov 15, 2019 via email

tublitzed commented Nov 15, 2019

pjenvey commented Nov 15, 2019

tublitzed commented Nov 15, 2019

jrconlin commented Nov 15, 2019

tublitzed commented Nov 19, 2019 • edited Loading

pjenvey commented Nov 19, 2019

tublitzed commented Nov 20, 2019

pjenvey commented Nov 20, 2019

tublitzed commented Nov 21, 2019 • edited Loading

tublitzed commented Nov 21, 2019

tublitzed commented Nov 21, 2019

tublitzed commented Nov 21, 2019

tublitzed commented Nov 21, 2019

tublitzed commented Nov 26, 2019

pjenvey commented Dec 5, 2019

pjenvey commented Dec 12, 2019

tublitzed commented Dec 13, 2019

pjenvey commented Dec 13, 2019

pjenvey commented Dec 14, 2019

tublitzed commented Dec 17, 2019 • edited Loading

pjenvey commented Dec 17, 2019

tublitzed commented Nov 14, 2019 •

edited

Loading

pjenvey commented Nov 15, 2019 •

edited

Loading

tublitzed commented Nov 19, 2019 •

edited

Loading

tublitzed commented Nov 21, 2019 •

edited

Loading

tublitzed commented Dec 17, 2019 •

edited

Loading