Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adopt the new user_id scheme for the Spanner backend #170

Closed
pjenvey opened this issue Jul 12, 2019 · 7 comments
Closed

Adopt the new user_id scheme for the Spanner backend #170

pjenvey opened this issue Jul 12, 2019 · 7 comments
Assignees
Labels
5 Estimate - l - Moderately complex, will require some effort but clearly defined. bug Something isn't working p1 Stuff we gotta do before we ship!

Comments

@pjenvey
Copy link
Member

pjenvey commented Jul 12, 2019

Mysql uses the "legacy" user_id (an autoincremented id from the token server). Spanner requires a non serial user_id primary key: (discussion).

Our HawkPayload hasn't stubbed these values out yet (fxa_uid/fxa_kid and we'll likely utilize hash_fxa_uid for metrics). They need to be added and the tests should be adjusted to include such values (if they don't already).

Finally Spanner should switch from legacy_id over to the new values.

@pjenvey pjenvey added bug Something isn't working p1 Stuff we gotta do before we ship! labels Jul 12, 2019
@pjenvey
Copy link
Member Author

pjenvey commented Aug 14, 2019

Additionally the db_tests should somehow setup non-legacy HawkIdentifiers when testing the spanner backend

@jrconlin
Copy link
Member

Is this value ever exposed? (Wondering if we could use an augmented time value or something since we may not have a central location to draw from.)

@pjenvey
Copy link
Member Author

pjenvey commented Aug 14, 2019

Assuming you mean the legacy_id: various client implementations assume an integer user_id. We need to continue providing one for them and they'd continue using it for e.g. constructing the {uid} parameter in our sync storage urls.

But underneath the Spanner impl. would use the new uuid-style id from the Hawk header for its user_id.

(We currently have a centralized location to draw the legacy ids from -- the tokenserver -- but we want the tokenserver to eventually go away)

@rfk
Copy link
Contributor

rfk commented Sep 20, 2019

But underneath the Spanner impl. would use the new uuid-style id from the Hawk header for its user_id.

So, there's one thing to be careful of which I'm not sure is clear from the discussion in mozilla-services/tokenserver#124, so I'm going to re-iterate it here just in case.

The tokenserver currently changes the numeric uid for a user whenever their FxA encryption key changes. Because the current storage nodes index their storage by this numeric userid, the effect is that the user's storage gets "wiped" automagically when their encryption key changes - they start writing into a different bucket on the storage node that doesn't have any of the old data in it encrypted under the old key.

There is also a background job on the tokenserver which periodically asks the storage nodes to delete data stored under numberic userids that have become obsolete.

If we switch to indexing data storage by the (stable and immutable) fxa_uid instead, we'll need to make sure that we have some other way to:

  1. Separate data encrypted with different encryption keys into distinct storage buckets
  2. Purge old data encrypted with old encryption keys as a background cleanup job

I don't recall if we settled on a precise scheme for how to achieve those two things in a durable-storage world. The tokenserver sends an fxa_kid field that uniquely identifies the user's current encryption key and should be helpful here.

@pjenvey
Copy link
Member Author

pjenvey commented Sep 20, 2019

We've simply combined the 2 values to act as storage's userid:

https://github.com/mozilla-services/server-syncstorage/blob/fee296f/syncstorage/storage/spanner.py#L82

We should consider including fxa_kid as a separate column (but part of the primary key) instead, the Python version's concatenating of the values into userid was mostly for the sake of expediency

@rfk
Copy link
Contributor

rfk commented Sep 21, 2019

Great, thanks you!

We should consider including fxa_kid as a separate column

👍; since the kid contains a leading timestamp, you should be able to sort on this column in order to detect and delete data encrypted with older keys. Issue mozilla-services/tokenserver#124 is in service of ensuring that that timestamp doesn't change if it doesn't have to, I'll cycle back to looking at the outstanding work there this week.

@rfk
Copy link
Contributor

rfk commented Sep 24, 2019

FWIW I left a couple of notes in mozilla-services/tokenserver#124 about how the kid field currently works and how I think it could be improved, to try to make the context there clearer.

@pjenvey pjenvey self-assigned this Sep 26, 2019
@tublitzed tublitzed assigned jrconlin and unassigned pjenvey Sep 26, 2019
@tublitzed tublitzed added 2 Estimate - s - This is a small change with clearly defined parameters. 5 Estimate - l - Moderately complex, will require some effort but clearly defined. and removed 2 Estimate - s - This is a small change with clearly defined parameters. labels Sep 27, 2019
jrconlin added a commit that referenced this issue Sep 27, 2019
* May want to use alternate "sync_kid" db.
  DSN: spanner://projects/sync-spanner-dev-225401/instances/spanner-test/databases/sync_kid
* switched over index to use "fxa_id" (kid), and pass HawkParameter when possible

Issue #170
@jrconlin jrconlin mentioned this issue Oct 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 Estimate - l - Moderately complex, will require some effort but clearly defined. bug Something isn't working p1 Stuff we gotta do before we ship!
Projects
None yet
Development

No branches or pull requests

4 participants