-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seeing spanner 20k mutation limit errors for inserts in production #333
Comments
What's interesting here is that I hit the Spanner 20k limit with only ~3500 items... |
The 20k here translates into 2000 items (1 mutation per each column. bsos table has 8 columns + 2 extra mutations for our secondary indexes). It looks like the sync client will print the server's limits in its log, can you search your log for "max_total_records"? Let's confirm your client's seeing a value of "2000". E.g. (from my log, probably non-spanner current sync prod): |
Ah, I thought that it was one mutation per row too, not per column. That
really changes how I view things. We pretty much want to reduce the number
of columns to the bare minimum, but I get the feeling we'd get hit on data
transfer then.
…On Fri, Nov 15, 2019 at 9:30 AM Philip Jenvey ***@***.***> wrote:
The 20k here translates into 2000 items (1 mutation per each column. bsos
table has 8 columns + 2 extra mutations for our secondary indexes).
It looks like the sync client will print the server's limits in its log,
can you search your log for "max_total_records"? Let's confirm your
client's seeing a value of "2000".
E.g. (from my log, probably non-spanner current sync prod)
1573781473132 Sync.Engine.Tabs TRACE new PostQueue config (after defaults): : {"max_request_bytes":2101248,"max_record_payload_bytes":2097152,"max_post_bytes":2101248,"max_post_records":100,"max_total_bytes":209715200,"max_total_records":10000}
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#333?email_source=notifications&email_token=AAAIXKZMOMPKXR2JQNDZBMLQT3MCVA5CNFSM4JNMUJO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEGETYI#issuecomment-554453473>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAIXK7TRNTFPEY4CTVVAVLQT3MCVANCNFSM4JNMUJOQ>
.
|
@pjenvey confirmed that my
|
Batch commit also deletes the batch upon completion, adding an extra mutation or two. So our max_total_records should probably be 1998. We need a db_test ensuring this limit, it should create 2 batches: of the max size and max size + 1, ensuring commit returns what we expect. If it's longer running it could be disabled by default w/ #[ignore] but we have no spanner on CI to ensure it doesn't fail over time :( |
Would it be worth also making a change to make this value configurable without a deployment (ie, maybe set the default here but have it be able to be overridden with an env var or something like that? That way if something changes that's outside of our control (ie, since this 20k limit is Spanner internals) we don't need to do a deploy to fix. |
@tublitzed We made it a config value. We also enforce the hard spanner limit (not as a config, because it's a hard limit) |
reduce to 1999 (19990 mutations) accounting for batch commit's extra mutations: touch_collection (+1) and batch delete (+2: for delete + BsoExpiry update) schema: don't restrict bso_id in the schema (this was already applied to stage and fakeprod) and cleanup old NLL restriction gone in newer rust versions Closes #333
Hey, @pjenvey - I'm gonna re-open this one....I'm still seeing the same issue in production with the 0.1.7 release. Just verified now with same sync id and collection of bookmarks. I'm seeing the exact same errors in my local sync logs as before, ie something along these lines:
|
@tublitzed if you still have TRACE statements, can you confirm the same error log shows a |
@pjenvey I do have TRACE statements in the logs, however I'm not seeing anything at all this time for max_total_records (which I was seeing before). Are you able to repro this on your end with a similarly large collection of bookmarks? |
I'm unable to Restore bookmarks from the While I try to work that out, could you try restarting your browser? mhammond mentioned that would reset any potential stale state here, if that's the issue. |
ah, shoot. @pjenvey I'm seeing the same thing with those anonymized bookmarks(ie, cannot restore). I perhaps went too far with my anonymizing script and butchered something that actually matters. Let me try adjusting to kill just the urls/titles and will comment back here once I've confirmed that I can actually use those to restore. (Also, restarted browser, no-go). |
Ok, @pjenvey - try again. I just replaced the file and fixed the issue with the anonymizing logic. Confirmed that I now can import them locally: |
I thought I'd keep poking a bit to verify that my anon bookmarks do indeed also break when pointed at spanner. Here's a few interesting bits:
|
Also just confirmed in Sentry that still I'm triggering this error with my non-anon bookmarks, FWIW. |
Here's logs with TRACE enabled: |
Sentry issue: SYNCSTORAGE-PROD-1 |
There's another occurrance of this today on prod, 0.1.8 |
a pile of of hacks which we'll hopefully revert shortly Issue #333
The sentry info from #373 gave us a lead for the latest iteration of this: https://sentry.prod.mozaws.net/operations/syncstorage-prod/issues/6845127/?query=is:unresolved A nightly desktop client triggers it with a max size batch of predominantly bso updates vs inserts for its history collection (which seems like an unlikely but certainly possible scenario, modifying > 1999 items in your history at one time).
UPDATES touch 1-4 rows, so one would expect an UPDATE to cost somewhere between 6 or less (for each row plus secondary indices) or possibly even upwards of 10, matching the INSERT. Instead I'm seeing UPDATES cost potentially 12 mutations. Given a batch of updates touching all 4 rows, UPDATE maxes out at 1666. It's not clear why there's a +2 mutation tax. A quick but unfortunate fix would be to further lower Let's seek an explanation from Google, which will probably require preparing an easily reproducible case for them to look at. |
@pjenvey let's go ahead and make that quickfix as a separate task and then we can keep this issue open in the backlog so we can revisit as a |
Along w/ the 12 mutations per UPDATE mentioned above, I want to note here that I see touch_collection's UPDATE likely costs 4 mutations vs what we assumed was 1 (adca8d6). I've confirmed this (as much as possible) from the following test (w/ the redundant touch_collection and batch delete disabled):
That succeeds. Adding an extra INSERT afterwards into a test table with 1 column gets us to 20001 and fails. |
and remove the potential, redundant touch_collection on batch commit recalculating the mutations (see adca8d6): - 4 for touch_collection: always an UPDATE on batch commit (due to pretouch_collection). we thought this cost 1 but it's probably 4 (see #333) - 19992 (1666 max_total_records * 12, for the UPDATE case, see #333) for the mix of post_bsos and batch_commit - 2 for batch delete (1 for the delete and 1 BsoExpiry update) total: 19998 Closes #333
@tublitzed let's get #377 deployed and confirmed on Mon. then confirm the new limit over to mozilla-mobile/firefox-ios#5896 |
We can at least confirm the recurring sentry events stop on stage/prod. So far so good (they've stopped), but let's give prod some more time just in case. |
I have a large collection of bookmarks(3078 bookmarks and 546 folders) that all of a sudden is failing to sync in production.
Sync id:
130387415
. Via bob, I got the following from the production logs:A database error occurred: RpcFailure(RpcStatus { status: RpcStatusCode(3), details: Some("The transaction contains too many mutations. Insert and update operations count with the multiplicity of the number of columns they affect. For example, inserting values into one key column and four non-key columns count as five mutations total for the insert. Delete and delete range operations count as one mutation regardless of the number of columns affected. The total mutation count includes any changes to indexes that the transaction generates. Please reduce the number of writes, or use fewer indexes. (Maximum number: 20000)") }), status: 500 }
Here's a link to a sample anonymized collection of this same size.
The text was updated successfully, but these errors were encountered: