Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeing spanner 20k mutation limit errors for inserts in production #333

Closed
tublitzed opened this issue Nov 14, 2019 · 25 comments · Fixed by #339, #353 or #377
Closed

Seeing spanner 20k mutation limit errors for inserts in production #333

tublitzed opened this issue Nov 14, 2019 · 25 comments · Fixed by #339, #353 or #377
Assignees
Labels
8 Estimate - xl - Moderately complex, medium effort, some uncertainty. bug Something isn't working p1 Stuff we gotta do before we ship!

Comments

@tublitzed
Copy link
Contributor

tublitzed commented Nov 14, 2019

I have a large collection of bookmarks(3078 bookmarks and 546 folders) that all of a sudden is failing to sync in production.

Sync id: 130387415. Via bob, I got the following from the production logs:

A database error occurred: RpcFailure(RpcStatus { status: RpcStatusCode(3), details: Some("The transaction contains too many mutations. Insert and update operations count with the multiplicity of the number of columns they affect. For example, inserting values into one key column and four non-key columns count as five mutations total for the insert. Delete and delete range operations count as one mutation regardless of the number of columns affected. The total mutation count includes any changes to indexes that the transaction generates. Please reduce the number of writes, or use fewer indexes. (Maximum number: 20000)") }), status: 500 }

Here's a link to a sample anonymized collection of this same size.

@tublitzed tublitzed added the p1 Stuff we gotta do before we ship! label Nov 14, 2019
@tublitzed tublitzed added the bug Something isn't working label Nov 14, 2019
@tublitzed
Copy link
Contributor Author

What's interesting here is that I hit the Spanner 20k limit with only ~3500 items...

@pjenvey
Copy link
Member

pjenvey commented Nov 15, 2019

The 20k here translates into 2000 items (1 mutation per each column. bsos table has 8 columns + 2 extra mutations for our secondary indexes).

It looks like the sync client will print the server's limits in its log, can you search your log for "max_total_records"? Let's confirm your client's seeing a value of "2000".

E.g. (from my log, probably non-spanner current sync prod):
1573781473132 Sync.Engine.Tabs TRACE new PostQueue config (after defaults): : {"max_request_bytes":2101248,"max_record_payload_bytes":2097152,"max_post_bytes":2101248,"max_post_records":100,"max_total_bytes":209715200,"max_total_records":10000}

@jrconlin
Copy link
Member

jrconlin commented Nov 15, 2019 via email

@tublitzed
Copy link
Contributor Author

@pjenvey confirmed that my max_total_records is getting set correctly to 2k:

{"max_post_bytes":2097152,"max_post_records":100,"max_record_payload_bytes":2097152,"max_request_bytes":2101248,"max_total_bytes":209715200,"max_total_records":2000}

@pjenvey
Copy link
Member

pjenvey commented Nov 15, 2019

Batch commit also deletes the batch upon completion, adding an extra mutation or two. So our max_total_records should probably be 1998.

We need a db_test ensuring this limit, it should create 2 batches: of the max size and max size + 1, ensuring commit returns what we expect.

If it's longer running it could be disabled by default w/ #[ignore] but we have no spanner on CI to ensure it doesn't fail over time :(

@tublitzed
Copy link
Contributor Author

Would it be worth also making a change to make this value configurable without a deployment (ie, maybe set the default here but have it be able to be overridden with an env var or something like that?

That way if something changes that's outside of our control (ie, since this 20k limit is Spanner internals) we don't need to do a deploy to fix.

@jrconlin
Copy link
Member

@tublitzed We made it a config value.
https://github.com/mozilla-services/syncstorage-rs/pull/319/files
I use a config file here because I couldn't figure out a way to specify the sub value as an environment var. I believe that @pjenvey noted a way that it could be done.

We also enforce the hard spanner limit (not as a config, because it's a hard limit)
https://github.com/mozilla-services/syncstorage-rs/pull/324/files

pjenvey added a commit that referenced this issue Nov 16, 2019
reduce to 1999 (19990 mutations) accounting for batch commit's extra
mutations: touch_collection (+1) and batch delete (+2: for delete +
BsoExpiry update)

schema: don't restrict bso_id in the schema (this was already applied to
stage and fakeprod)

and cleanup old NLL restriction gone in newer rust versions

Closes #333
@tublitzed
Copy link
Contributor Author

tublitzed commented Nov 19, 2019

Hey, @pjenvey - I'm gonna re-open this one....I'm still seeing the same issue in production with the 0.1.7 release. Just verified now with same sync id and collection of bookmarks. I'm seeing the exact same errors in my local sync logs as before, ie something along these lines:

1574205202429	Sync.Collection	DEBUG	POST fail 503 https://sync-1-us-west1-g.sync.services.mozilla.com/1.5/130387415/storage/bookmarks?batch=90c260b6ef3d434291b4bae1066dda24&commit=true
1574205202429	Sync.Collection	WARN	POST request to https://sync-1-us-west1-g.sync.services.mozilla.com/1.5/130387415/storage/bookmarks?batch=90c260b6ef3d434291b4bae1066dda24&commit=true failed with status 503
1574205202429	Sync.Engine.Bookmarks	DEBUG	Uploading records failed: [object Object]
1574205202445	Sync.Engine.Bookmarks.BookmarkSyncUtils	DEBUG	pushChanges: Processed change records: {"weak":0,"skipped":3645,"updated":0}
1574205202445	Sync.ErrorHandler	DEBUG	Got Retry-After: 3600
1574205202445	Sync.Status	DEBUG	Status.sync: success.sync => error.sync.reason.serverMaintenance
1574205202445	Sync.Status	DEBUG	Status.service: success.status_ok => error.sync.failed
1574205202445	Sync.SyncScheduler	DEBUG	Got backoff notification: 3600000ms

@tublitzed tublitzed reopened this Nov 19, 2019
@pjenvey
Copy link
Member

pjenvey commented Nov 19, 2019

@tublitzed if you still have TRACE statements, can you confirm the same error log shows a info/configuration for this server: -> "max_total_records":1999?

@tublitzed
Copy link
Contributor Author

@pjenvey I do have TRACE statements in the logs, however I'm not seeing anything at all this time for max_total_records (which I was seeing before). Are you able to repro this on your end with a similarly large collection of bookmarks?

@pjenvey
Copy link
Member

pjenvey commented Nov 20, 2019

I'm unable to Restore bookmarks from the large_anon_collection.json linked above for some reason.

While I try to work that out, could you try restarting your browser? mhammond mentioned that would reset any potential stale state here, if that's the issue.

@tublitzed
Copy link
Contributor Author

tublitzed commented Nov 21, 2019

ah, shoot. @pjenvey I'm seeing the same thing with those anonymized bookmarks(ie, cannot restore). I perhaps went too far with my anonymizing script and butchered something that actually matters. Let me try adjusting to kill just the urls/titles and will comment back here once I've confirmed that I can actually use those to restore.

(Also, restarted browser, no-go).

@tublitzed
Copy link
Contributor Author

Ok, @pjenvey - try again. I just replaced the file and fixed the issue with the anonymizing logic. Confirmed that I now can import them locally:
Screen Shot 2019-11-21 at 2 48 25 PM

@tublitzed
Copy link
Contributor Author

I thought I'd keep poking a bit to verify that my anon bookmarks do indeed also break when pointed at spanner. Here's a few interesting bits:

  1. The first time, I get a successful sync log (even though when you open it, you'll see similar failures) -
    succcess_sync.txt

  2. All subsequent sync attempts yield only sync_error logs.
    error_sync.txt

  3. All subsequent syncs also take the same time as the first sync. Which is weird, since they should in theory only be syncing the diff, and should be faster, y? Curious based on Initial Rust project with file structure layout #1 and feat: initial organization, some db code, and the full request path #2 above if this points to something that's broken in PUT that's not happening in the POST.

@tublitzed
Copy link
Contributor Author

Also just confirmed in Sentry that still I'm triggering this error with my non-anon bookmarks, FWIW.

@tublitzed
Copy link
Contributor Author

Here's logs with TRACE enabled:

sync_error_with_trace.txt
sync_success_with_trace.txt

pjenvey added a commit that referenced this issue Nov 26, 2019
write the pending bsos immediately when committing a batch instead of
queueing them in the batch_bsos, so we don't pay twice for their
mutations

also

- optimize batch append to insert all items in one write
- interleave BsoExpiry w/ user_collections

Closes #333, #318
@tublitzed
Copy link
Contributor Author

Sentry issue: SYNCSTORAGE-PROD-1

@pjenvey
Copy link
Member

pjenvey commented Dec 5, 2019

There's another occurrance of this today on prod, 0.1.8

@pjenvey pjenvey reopened this Dec 5, 2019
pjenvey added a commit that referenced this issue Dec 11, 2019
a pile of of hacks which we'll hopefully revert shortly

Issue #333
@pjenvey
Copy link
Member

pjenvey commented Dec 12, 2019

The sentry info from #373 gave us a lead for the latest iteration of this: https://sentry.prod.mozaws.net/operations/syncstorage-prod/issues/6845127/?query=is:unresolved

A nightly desktop client triggers it with a max size batch of predominantly bso updates vs inserts for its history collection (which seems like an unlikely but certainly possible scenario, modifying > 1999 items in your history at one time).

bsos INSERTs cost 10 mutations per row w/ the table having 8 columns and 2 secondary indices.

UPDATES touch 1-4 rows, so one would expect an UPDATE to cost somewhere between 6 or less (for each row plus secondary indices) or possibly even upwards of 10, matching the INSERT.

Instead I'm seeing UPDATES cost potentially 12 mutations. Given a batch of updates touching all 4 rows, UPDATE maxes out at 1666. It's not clear why there's a +2 mutation tax.

A quick but unfortunate fix would be to further lower max_total_records towards 1666 (with extra padding for the additional operations, see #333).

Let's seek an explanation from Google, which will probably require preparing an easily reproducible case for them to look at.

@tublitzed tublitzed added the 8 Estimate - xl - Moderately complex, medium effort, some uncertainty. label Dec 12, 2019
@tublitzed
Copy link
Contributor Author

@pjenvey let's go ahead and make that quickfix as a separate task and then we can keep this issue open in the backlog so we can revisit as a p2.

@pjenvey
Copy link
Member

pjenvey commented Dec 13, 2019

Along w/ the 12 mutations per UPDATE mentioned above, I want to note here that I see touch_collection's UPDATE likely costs 4 mutations vs what we assumed was 1 (adca8d6).

I've confirmed this (as much as possible) from the following test (w/ the redundant touch_collection and batch delete disabled):

  • 4 touch_collection (UPDATE)
  • 4 touch_collection (UPDATE)
  • 19992 post_bsos/batch commit (UPDATE) of 1666 bsos (1666 * 12)
    total: 20000

That succeeds. Adding an extra INSERT afterwards into a test table with 1 column gets us to 20001 and fails.

pjenvey added a commit that referenced this issue Dec 13, 2019
and remove the potential, redundant touch_collection on batch commit

recalculating the mutations (see adca8d6):

- 4 for touch_collection: always an UPDATE on batch commit (due to
  pretouch_collection). we thought this cost 1 but it's probably 4 (see
  #333)
- 19992 (1666 max_total_records * 12, for the UPDATE case, see #333) for
  the mix of post_bsos and batch_commit
- 2 for batch delete (1 for the delete and 1 BsoExpiry update)

total: 19998

Closes #333
@pjenvey
Copy link
Member

pjenvey commented Dec 14, 2019

@tublitzed let's get #377 deployed and confirmed on Mon. then confirm the new limit over to mozilla-mobile/firefox-ios#5896

@tublitzed
Copy link
Contributor Author

tublitzed commented Dec 17, 2019

@pjenvey I was never able to repro the issue again after the original fix went out. Can you think of any way we can test to confirm the new 1666 limit resolved the issue?

@pjenvey
Copy link
Member

pjenvey commented Dec 17, 2019

@pjenvey I was never able to repro the issue again after the original fix went out. Can you think of any way we can test to confirm the new 1666 limit resolved the issue

We can at least confirm the recurring sentry events stop on stage/prod. So far so good (they've stopped), but let's give prod some more time just in case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8 Estimate - xl - Moderately complex, medium effort, some uncertainty. bug Something isn't working p1 Stuff we gotta do before we ship!
Projects
None yet
3 participants