Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve self-hosted ⇒ SaaS conversion alongside EU rollout #153

Closed
42 of 55 tasks
Tracked by #109
azaslavsky opened this issue Jun 26, 2023 · 6 comments
Closed
42 of 55 tasks
Tracked by #109

Improve self-hosted ⇒ SaaS conversion alongside EU rollout #153

azaslavsky opened this issue Jun 26, 2023 · 6 comments
Assignees

Comments

@azaslavsky
Copy link

azaslavsky commented Jun 26, 2023

Problem

The current conversion rate for users migrating (henceforth referred to as “relocating”, to differentiate it from normal database migrations) from self-hosted is unideal: less than 1% of users who enter the funnel successfully become Sentry SaaS customers.

The relocation also places a heavy toll on ops support, as each relocation must be carried out manually. The volume of these support efforts is expected to increase greatly with the debut of EU-region support in the second half of this year. For users that are already on SaaS Sentry, a similar relocation may need to occur as they leverage the hybrid cloud effort to move regions.

Finally, the current method of relocation, using an ad-hoc, manually executed script as a second “backdoor” method of importing, is untested and difficult to maintain. This has resulted in subtle schema skew bugs🔒 that have taken significant effort to fix in the past, and could have been much more damaging had they not been caught quickly.

Goals

There are several goal targets:

  • Increase the funnel conversion rate to >=3% by the end of Q4.
  • Retire the entire load-it-up.py script, and the process/playbook that surrounds it.
  • Achieve the above in such a way that production data on both the exporting and importing instances are not corrupted or compromised.
  • Unify and test the implementation: rather than having two poorly tested implementations, we should have one well-tested path for all import/export operations.
  • Bonus: Make the relocation process completely user-driven, thereby totally removing ops from the loop in the normal case, by the end of Q3.

Non-Goals

There are a number of potential future improvements we are explicitly not optimizing for in this first pass. This is not to say that we won’t be interested in circling back and implementing them after the relocation pipeline is healthy and running (see Potential Future Work below), just that they are not strictly in scope for the first milestone.

  • Increasing the scope of relocatable artifacts: Currently, some items are not in scope for relocation, particularly issues and events. That is, when you relocate, you keep your users/orgs/projects, but lose your issues/events, among others. It would certainly be desirable to implement this eventually and make relocation completely “seamless”, but is not necessary for a first pass. Every goal listed in this document (a single, well-tested implementation, decreased ops burden, ensuring data correctness, etc) is a necessary prerequisite to supporting issue and event relocation, so let’s focus on walking before we can run.

  • One-click and server-to-server relocation: While it is tempting in the long term to make relocations entirely self-serve, up to and including directly connecting the self-hosted server to the SaaS backend and gradually shifting users over without disrupting event flow, these are much bigger projects that would involve a lot more orchestration and robustness. Having users manually move JSON blobs is okay to start.

  • Enabling merging or update operations: The purpose here is to get existing self-hosted and inter-region users up and running on a “fresh” SaaS account as quickly and painlessly as reasonably possible. Doing complex merges or in-place updates/overwrites on existing accounts is out of scope, as the vast majority of users in this funnel are trying to set up a new account. Merge and update operations introduce a lot of stateful edge cases that will be difficult to enumerate and test for.

Assumptions

The main assumption is that the current conversion rate in the funnel is primarily blocked on the slowness and difficulty of the relocation. It is possible, though intuitively unlikely, that we make the relocation much easier and conversion rates do not meaningfully increase.

Another assumption is that the organization-merging functionality of the load-it-up.py script is vestigial and not needed by ops, and that we should therefore prefer to just keep the original organizations (modulo changing slugs) when they appear in a backup.

Proposal

We propose to do the following:

  1. Write a thorough set of test cases for both backup.py and load-it-up.py. In theory, the import_ method on backup.py should have sufficient flexibility to replicate everything that load-it-up.py does (modulo the merging of orgs, see above), so a good end state is to have both scripts pass the same set of tests.

  2. Modify import_ to use .create() instead of .save(), and to call the serializer’s .validate() method before .create() (we may opt to keep the old functionality behind a self-hosted-only flag). This will make import_ an INSERT only script, will ensure that data is validated before being ingested, and will prevent any existing data from being modified on the relocation target.

  3. Once we are confident that load-it-up.py can be retired in favor of the import_ flow on backup.py, and that the backup.py import/export functionality can be used on both SaaS and self-hosted, we will add an API endpoint to perform imports for new accounts. This would probably involve importing to some siloed or otherwise protected “import database” and validating the data, before relocating all of that database’s data to the main SaaS database.

  4. Add a screen during on-boarding (post email-verification) that allows users to upload their exported self-hosted JSON backup (note: these could be quite large, so even with user verification in place, we’ll still need to think a bit about resource limits here). This would hit the endpoint described above, and send the user an email when their relocation succeeds, or otherwise notify them that it failed and open a ticket.

  5. In the case of failure (that is, the user uploaded a JSON backup that could not be validated), we will inform the user and automatically open a support ticket on their behalf.

Risks

The major risk is that by changing the process, which works in its own brittle way at the moment, we introduce production breakages or data corruptions. To mitigate this, great care will need to be taken to ensure that an expansive test suite is provided to guarantee that this process won’t damage data on either the exporting or importing side.

In terms of resources and API design, we are going to be importing and then processing very large JSON blobs, then merging them into production databases. Care will need to be taken to ensure that these operations are all properly secured and throttled, so as not to introduce user-input vulnerabilities, via either malicious intent or simply very large inputs.

Because we are uniting two implementations into one, there is always some risk that some property of one of the implementations will be lost. It is a bit difficult to ascertain how likely this is because of the almost complete lack of tests for both implementations, so we will need to rely on some combination of a new but thorough test suite and user reports to guard against this.

Open Questions

There are some important open questions that will need to be resolved during implementation:

  • Where exactly will we write data during validation? Will we have a shared “validation” database, or a standalone database spun up for each relocation operation? How will we move the now-validated data into prod - by simply copying rows from the validation database, or by re-running import_ on the validated JSON, but now pointing at prod?

  • Should we loosen, or remove, import atomicity during validation? Atomicity has the benefit of allowing an all-or-nothing transaction for the entire import, but the downside of potentially locking up a database for large imports. More research is needed to figure out the best path forward.

  • How will we mitigate potential performance issues when importing large blobs? We probably won’t have much contention for the validation database (and can disable atomicity for it anyway), but bulk moving an entire Sentry instance’s worth of now-validated data into the production database will require some finesse.

  • How do we import control silo models, which are (generally) globally scoped? There will be collisions here (for example, users that already exist, or org slugs that are already taken), so we’ll need some sort of custom logic to handle this. It’s hard to imagine avoiding writing special import logic for these on a case-by-case basis.

Potential Future Work

All of the non-goals mentioned above (increasing the scope of relocatable artifacts, one-click server-to-server integration, and more customizable and precise relocation operations) are on the table as we move forward. In particular, it would be very nice to get to an end state where users start a relocation (either self-hosted -> SaaS, or SaaS region-to-region via hybrid cloud), and we seamlessly move 100% of their region-siloed data over in a way that is almost entirely opaque to them. This could include temporarily forwarding events that occur while the relocation is taking place, and carefully handing over control between the source and target of the relocation, so that from the user perspective, the whole operation is “one click and wait for a confirmation email” easy.


Q3 Milestones

Preview Give feedback
  1. 4 of 4
  2. 2 of 2

Q4 Workstreams

Preview Give feedback

Not Yet

Preview Give feedback
@azaslavsky azaslavsky self-assigned this Jun 26, 2023
This was referenced Jun 27, 2023
@chadwhitacre chadwhitacre changed the title Self-hosted => SaaS Migration Tool Self-hosted => SaaS Relocation Tool Jul 18, 2023
@chadwhitacre
Copy link
Member

  • @azaslavsky to reconcile the two requirements docs (one🔒, two🔒) and update the description for this ticket.
  • @gauthamcs and @azaslavsky to identify SE stakeholder.
  • @chadwhitacre to set up weekly meeting and modify Slack channel.
  • Open conversation threads:
    • validation workflow for prod
    • including issues in scope

@chadwhitacre chadwhitacre changed the title Self-hosted => SaaS Relocation Tool Improve self-hosted ⇒ SaaS conversion alongside EU rollout Jul 24, 2023
@chadwhitacre chadwhitacre changed the title Improve self-hosted ⇒ SaaS conversion alongside EU rollout 🛠️ Improve self-hosted ⇒ SaaS conversion alongside EU rollout Jul 24, 2023
@chadwhitacre chadwhitacre changed the title 🛠️ Improve self-hosted ⇒ SaaS conversion alongside EU rollout 🧰 Improve self-hosted ⇒ SaaS conversion alongside EU rollout Jul 24, 2023
@chadwhitacre
Copy link
Member

chadwhitacre commented Aug 3, 2023

After talking with PMM, I've added a new stretch goal for this project: to get the self-hosted broadcast system running. It would be great in general to be able to send "What's New" messages to self-hosted users (we could announce new versions, for example, especially out-of-band releases, as well as the Self-hosted Sesh). This could also entail understanding why the beacon data is so off, since the beacon and broadcasts are related.

@chadwhitacre
Copy link
Member

@chadwhitacre chadwhitacre changed the title 🧰 Improve self-hosted ⇒ SaaS conversion alongside EU rollout Improve self-hosted ⇒ SaaS conversion alongside EU rollout Oct 27, 2023
@chadwhitacre
Copy link
Member

Chatted with @azaslavsky, I'm going to help with recruiting self-hosted users to help us develop and test out this process, I'll start my prospecting on getsentry/sentry#49564. Likely end up working with SE on this as well.

@chadwhitacre
Copy link
Member

Talked on OSPO team meeting ... EU is a big rollout, if it makes sense let's ship to US first as a soft-launch and take it to EU when that's fully ready.

@chadwhitacre
Copy link
Member

Relocation is live in US and EU. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants