Improve self-hosted ⇒ SaaS conversion alongside EU rollout #153

azaslavsky · 2023-06-26T23:17:00Z

Problem

The current conversion rate for users migrating (henceforth referred to as “relocating”, to differentiate it from normal database migrations) from self-hosted is unideal: less than 1% of users who enter the funnel successfully become Sentry SaaS customers.

The relocation also places a heavy toll on ops support, as each relocation must be carried out manually. The volume of these support efforts is expected to increase greatly with the debut of EU-region support in the second half of this year. For users that are already on SaaS Sentry, a similar relocation may need to occur as they leverage the hybrid cloud effort to move regions.

Finally, the current method of relocation, using an ad-hoc, manually executed script as a second “backdoor” method of importing, is untested and difficult to maintain. This has resulted in subtle schema skew bugs🔒 that have taken significant effort to fix in the past, and could have been much more damaging had they not been caught quickly.

Goals

There are several goal targets:

Increase the funnel conversion rate to >=3% by the end of Q4.
Retire the entire load-it-up.py script, and the process/playbook that surrounds it.
Achieve the above in such a way that production data on both the exporting and importing instances are not corrupted or compromised.
Unify and test the implementation: rather than having two poorly tested implementations, we should have one well-tested path for all import/export operations.
Bonus: Make the relocation process completely user-driven, thereby totally removing ops from the loop in the normal case, by the end of Q3.

Non-Goals

There are a number of potential future improvements we are explicitly not optimizing for in this first pass. This is not to say that we won’t be interested in circling back and implementing them after the relocation pipeline is healthy and running (see Potential Future Work below), just that they are not strictly in scope for the first milestone.

Increasing the scope of relocatable artifacts: Currently, some items are not in scope for relocation, particularly issues and events. That is, when you relocate, you keep your users/orgs/projects, but lose your issues/events, among others. It would certainly be desirable to implement this eventually and make relocation completely “seamless”, but is not necessary for a first pass. Every goal listed in this document (a single, well-tested implementation, decreased ops burden, ensuring data correctness, etc) is a necessary prerequisite to supporting issue and event relocation, so let’s focus on walking before we can run.
One-click and server-to-server relocation: While it is tempting in the long term to make relocations entirely self-serve, up to and including directly connecting the self-hosted server to the SaaS backend and gradually shifting users over without disrupting event flow, these are much bigger projects that would involve a lot more orchestration and robustness. Having users manually move JSON blobs is okay to start.
Enabling merging or update operations: The purpose here is to get existing self-hosted and inter-region users up and running on a “fresh” SaaS account as quickly and painlessly as reasonably possible. Doing complex merges or in-place updates/overwrites on existing accounts is out of scope, as the vast majority of users in this funnel are trying to set up a new account. Merge and update operations introduce a lot of stateful edge cases that will be difficult to enumerate and test for.

Assumptions

The main assumption is that the current conversion rate in the funnel is primarily blocked on the slowness and difficulty of the relocation. It is possible, though intuitively unlikely, that we make the relocation much easier and conversion rates do not meaningfully increase.

Another assumption is that the organization-merging functionality of the load-it-up.py script is vestigial and not needed by ops, and that we should therefore prefer to just keep the original organizations (modulo changing slugs) when they appear in a backup.

Proposal

We propose to do the following:

Write a thorough set of test cases for both backup.py and load-it-up.py. In theory, the import_ method on backup.py should have sufficient flexibility to replicate everything that load-it-up.py does (modulo the merging of orgs, see above), so a good end state is to have both scripts pass the same set of tests.
Modify import_ to use .create() instead of .save(), and to call the serializer’s .validate() method before .create() (we may opt to keep the old functionality behind a self-hosted-only flag). This will make import_ an INSERT only script, will ensure that data is validated before being ingested, and will prevent any existing data from being modified on the relocation target.
Once we are confident that load-it-up.py can be retired in favor of the import_ flow on backup.py, and that the backup.py import/export functionality can be used on both SaaS and self-hosted, we will add an API endpoint to perform imports for new accounts. This would probably involve importing to some siloed or otherwise protected “import database” and validating the data, before relocating all of that database’s data to the main SaaS database.
Add a screen during on-boarding (post email-verification) that allows users to upload their exported self-hosted JSON backup (note: these could be quite large, so even with user verification in place, we’ll still need to think a bit about resource limits here). This would hit the endpoint described above, and send the user an email when their relocation succeeds, or otherwise notify them that it failed and open a ticket.
In the case of failure (that is, the user uploaded a JSON backup that could not be validated), we will inform the user and automatically open a support ticket on their behalf.

Risks

The major risk is that by changing the process, which works in its own brittle way at the moment, we introduce production breakages or data corruptions. To mitigate this, great care will need to be taken to ensure that an expansive test suite is provided to guarantee that this process won’t damage data on either the exporting or importing side.

In terms of resources and API design, we are going to be importing and then processing very large JSON blobs, then merging them into production databases. Care will need to be taken to ensure that these operations are all properly secured and throttled, so as not to introduce user-input vulnerabilities, via either malicious intent or simply very large inputs.

Because we are uniting two implementations into one, there is always some risk that some property of one of the implementations will be lost. It is a bit difficult to ascertain how likely this is because of the almost complete lack of tests for both implementations, so we will need to rely on some combination of a new but thorough test suite and user reports to guard against this.

Open Questions

There are some important open questions that will need to be resolved during implementation:

Where exactly will we write data during validation? Will we have a shared “validation” database, or a standalone database spun up for each relocation operation? How will we move the now-validated data into prod - by simply copying rows from the validation database, or by re-running import_ on the validated JSON, but now pointing at prod?
Should we loosen, or remove, import atomicity during validation? Atomicity has the benefit of allowing an all-or-nothing transaction for the entire import, but the downside of potentially locking up a database for large imports. More research is needed to figure out the best path forward.
How will we mitigate potential performance issues when importing large blobs? We probably won’t have much contention for the validation database (and can disable atomicity for it anyway), but bulk moving an entire Sentry instance’s worth of now-validated data into the production database will require some finesse.
How do we import control silo models, which are (generally) globally scoped? There will be collisions here (for example, users that already exist, or org slugs that are already taken), so we’ll need some sort of custom logic to handle this. It’s hard to imagine avoiding writing special import logic for these on a case-by-case basis.

Potential Future Work

All of the non-goals mentioned above (increasing the scope of relocatable artifacts, one-click server-to-server integration, and more customizable and precise relocation operations) are on the table as we move forward. In particular, it would be very nice to get to an end state where users start a relocation (either self-hosted -> SaaS, or SaaS region-to-region via hybrid cloud), and we seamlessly move 100% of their region-siloed data over in a way that is almost entirely opaque to them. This could include temporarily forwarding events that occur while the relocation is taking place, and carefully handing over control between the source and target of the relocation, so that from the user perspective, the whole operation is “one click and wait for a confirmation email” easy.

Q3 Milestones

Give feedback

Scope for migration tool #154
Enable JSON comparation of Sentry models #155

4 of 4
Add thorough test suite for backup.py #156

2 of 2
Ensure that new backup tests work with load-it-up as well #158
Add toggle for "INSERT-only" vs "INSERT_OR_UPDATE" modes on import #170
Cleanly support PK and FK remapping when importing #171
Implement "backup scopes" to give more import/export granularity #166
Support org slug mapping #182
Add org-based filtering to import/export #167
Add __include_in_export__ support for any models needed for first-pass of relocation #172
MILESTONE 1 DONE: All import/export functionality works locally
User reconciliation logic for importer User export scoped models #181
Change self-hosted backup/restore API #183
Release new backup/restore commands to self-hosted #184
Properly handle unique=True field collisions on import #193
Modify user model to support "unclaimed" SaaS users loaded from an import #192
Create "validation database" docker container #168
Validation pipeline works on cloudbuild #199
MILESTONE 2 DONE: Imports are properly validated using production services
Add "custom" relocation scope for models that, situationally, fit best in one or another #186
Findings should be clearer and more focused #201
Import works across split databases #185
Relocation import and export works across RPC boundaries #196
Properly handle atomicity for import/export #202
Design and create models for storing active relocations #204
Get validation database pipeline working behind API endpoint #203
Support export encryption and corresponding import decryption #207
Release test for import from last two versions before current #197
Implement API endpoint on Sentry SaaS to support migration imports #169
MILESTONE 3 DONE: feature deployed behind API endpoint in limited-availability
Options

Not Yet

Give feedback

The text was updated successfully, but these errors were encountered:

chadwhitacre · 2023-07-18T20:01:28Z

@azaslavsky to reconcile the two requirements docs (one🔒, two🔒) and update the description for this ticket.
@gauthamcs and @azaslavsky to identify SE stakeholder.
@chadwhitacre to set up weekly meeting and modify Slack channel.
Open conversation threads:
- validation workflow for prod
- including issues in scope

chadwhitacre · 2023-08-03T20:42:10Z

After talking with PMM, I've added a new stretch goal for this project: to get the self-hosted broadcast system running. It would be great in general to be able to send "What's New" messages to self-hosted users (we could announce new versions, for example, especially out-of-band releases, as well as the Self-hosted Sesh). This could also entail understanding why the beacon data is so off, since the beacon and broadcasts are related.

chadwhitacre · 2023-10-23T20:43:47Z

Design: Relocation-Specific Models

chadwhitacre · 2023-11-13T21:59:54Z

Chatted with @azaslavsky, I'm going to help with recruiting self-hosted users to help us develop and test out this process, I'll start my prospecting on getsentry/sentry#49564. Likely end up working with SE on this as well.

chadwhitacre · 2024-01-08T19:31:43Z

Talked on OSPO team meeting ... EU is a big rollout, if it makes sense let's ship to US first as a soft-launch and take it to EU when that's fully ready.

chadwhitacre · 2024-04-30T17:09:21Z

Relocation is live in US and EU. 👍

azaslavsky added the Type: Epic label Jun 26, 2023

azaslavsky self-assigned this Jun 26, 2023

This was referenced Jun 27, 2023

FY2024 Q2 #109

Closed

FY2024 Q3 #110

Closed

chadwhitacre changed the title ~~Self-hosted => SaaS Migration Tool~~ Self-hosted => SaaS Relocation Tool Jul 18, 2023

chadwhitacre changed the title ~~Self-hosted => SaaS Relocation Tool~~ Improve self-hosted ⇒ SaaS conversion alongside EU rollout Jul 24, 2023

chadwhitacre changed the title ~~Improve self-hosted ⇒ SaaS conversion alongside EU rollout~~ 🛠️ Improve self-hosted ⇒ SaaS conversion alongside EU rollout Jul 24, 2023

chadwhitacre changed the title ~~🛠️ Improve self-hosted ⇒ SaaS conversion alongside EU rollout~~ 🧰 Improve self-hosted ⇒ SaaS conversion alongside EU rollout Jul 24, 2023

azaslavsky mentioned this issue Jul 27, 2023

Scope for migration tool #154

Closed

azaslavsky mentioned this issue Aug 9, 2023

User reconciliation logic for importer User export scoped models #181

Closed

This was referenced Aug 24, 2023

Problem while trying to backup (legacy) getsentry/self-hosted#2353

Closed

Relocation tool cleanup #190

Open

Backup is not restored getsentry/self-hosted#2366

Closed

chadwhitacre mentioned this issue Sep 8, 2023

Include releases in backup getsentry/sentry#23947

Open

azaslavsky mentioned this issue Sep 13, 2023

Restore doesn't properly restore everything getsentry/self-hosted#2394

Closed

chadwhitacre mentioned this issue Oct 13, 2023

FY2024 Q4 #189

Closed

26 tasks

chadwhitacre changed the title ~~🧰 Improve self-hosted ⇒ SaaS conversion alongside EU rollout~~ Improve self-hosted ⇒ SaaS conversion alongside EU rollout Oct 27, 2023

chadwhitacre mentioned this issue Apr 30, 2024

Roll out relocation feature #244

Closed

6 tasks

chadwhitacre closed this as completed Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve self-hosted ⇒ SaaS conversion alongside EU rollout #153

Improve self-hosted ⇒ SaaS conversion alongside EU rollout #153

azaslavsky commented Jun 26, 2023 •

edited

Loading

Q3 Milestones

Q4 Workstreams

Not Yet

chadwhitacre commented Jul 18, 2023

chadwhitacre commented Aug 3, 2023 •

edited

Loading

chadwhitacre commented Oct 23, 2023

chadwhitacre commented Nov 13, 2023

chadwhitacre commented Jan 8, 2024

chadwhitacre commented Apr 30, 2024

Improve self-hosted ⇒ SaaS conversion alongside EU rollout #153

Improve self-hosted ⇒ SaaS conversion alongside EU rollout #153

Comments

azaslavsky commented Jun 26, 2023 • edited Loading

Problem

Goals

Non-Goals

Assumptions

Proposal

Risks

Open Questions

Potential Future Work

Q3 Milestones

Q4 Workstreams

Not Yet

chadwhitacre commented Jul 18, 2023

chadwhitacre commented Aug 3, 2023 • edited Loading

chadwhitacre commented Oct 23, 2023

chadwhitacre commented Nov 13, 2023

chadwhitacre commented Jan 8, 2024

chadwhitacre commented Apr 30, 2024

azaslavsky commented Jun 26, 2023 •

edited

Loading

chadwhitacre commented Aug 3, 2023 •

edited

Loading