-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve self-hosted ⇒ SaaS conversion alongside EU rollout #153
Comments
|
After talking with PMM, I've added a new stretch goal for this project: to get the self-hosted broadcast system running. It would be great in general to be able to send "What's New" messages to self-hosted users (we could announce new versions, for example, especially out-of-band releases, as well as the Self-hosted Sesh). This could also entail understanding why the beacon data is so off, since the beacon and broadcasts are related. |
Chatted with @azaslavsky, I'm going to help with recruiting self-hosted users to help us develop and test out this process, I'll start my prospecting on getsentry/sentry#49564. Likely end up working with SE on this as well. |
Talked on OSPO team meeting ... EU is a big rollout, if it makes sense let's ship to US first as a soft-launch and take it to EU when that's fully ready. |
Relocation is live in US and EU. 👍 |
Problem
The current conversion rate for users migrating (henceforth referred to as “relocating”, to differentiate it from normal database migrations) from self-hosted is unideal: less than 1% of users who enter the funnel successfully become Sentry SaaS customers.
The relocation also places a heavy toll on ops support, as each relocation must be carried out manually. The volume of these support efforts is expected to increase greatly with the debut of EU-region support in the second half of this year. For users that are already on SaaS Sentry, a similar relocation may need to occur as they leverage the hybrid cloud effort to move regions.
Finally, the current method of relocation, using an ad-hoc, manually executed script as a second “backdoor” method of importing, is untested and difficult to maintain. This has resulted in subtle schema skew bugs🔒 that have taken significant effort to fix in the past, and could have been much more damaging had they not been caught quickly.
Goals
There are several goal targets:
load-it-up.py
script, and the process/playbook that surrounds it.Non-Goals
There are a number of potential future improvements we are explicitly not optimizing for in this first pass. This is not to say that we won’t be interested in circling back and implementing them after the relocation pipeline is healthy and running (see Potential Future Work below), just that they are not strictly in scope for the first milestone.
Increasing the scope of relocatable artifacts: Currently, some items are not in scope for relocation, particularly issues and events. That is, when you relocate, you keep your users/orgs/projects, but lose your issues/events, among others. It would certainly be desirable to implement this eventually and make relocation completely “seamless”, but is not necessary for a first pass. Every goal listed in this document (a single, well-tested implementation, decreased ops burden, ensuring data correctness, etc) is a necessary prerequisite to supporting issue and event relocation, so let’s focus on walking before we can run.
One-click and server-to-server relocation: While it is tempting in the long term to make relocations entirely self-serve, up to and including directly connecting the self-hosted server to the SaaS backend and gradually shifting users over without disrupting event flow, these are much bigger projects that would involve a lot more orchestration and robustness. Having users manually move JSON blobs is okay to start.
Enabling merging or update operations: The purpose here is to get existing self-hosted and inter-region users up and running on a “fresh” SaaS account as quickly and painlessly as reasonably possible. Doing complex merges or in-place updates/overwrites on existing accounts is out of scope, as the vast majority of users in this funnel are trying to set up a new account. Merge and update operations introduce a lot of stateful edge cases that will be difficult to enumerate and test for.
Assumptions
The main assumption is that the current conversion rate in the funnel is primarily blocked on the slowness and difficulty of the relocation. It is possible, though intuitively unlikely, that we make the relocation much easier and conversion rates do not meaningfully increase.
Another assumption is that the organization-merging functionality of the
load-it-up.py
script is vestigial and not needed by ops, and that we should therefore prefer to just keep the original organizations (modulo changing slugs) when they appear in a backup.Proposal
We propose to do the following:
Write a thorough set of test cases for both
backup.py
andload-it-up.py
. In theory, theimport_
method onbackup.py
should have sufficient flexibility to replicate everything thatload-it-up.py
does (modulo the merging of orgs, see above), so a good end state is to have both scripts pass the same set of tests.Modify
import_
to use.create()
instead of.save()
, and to call the serializer’s.validate()
method before.create()
(we may opt to keep the old functionality behind a self-hosted-only flag). This will makeimport_
an INSERT only script, will ensure that data is validated before being ingested, and will prevent any existing data from being modified on the relocation target.Once we are confident that
load-it-up.py
can be retired in favor of theimport_
flow onbackup.py
, and that thebackup.py
import/export functionality can be used on both SaaS and self-hosted, we will add an API endpoint to perform imports for new accounts. This would probably involve importing to some siloed or otherwise protected “import database” and validating the data, before relocating all of that database’s data to the main SaaS database.Add a screen during on-boarding (post email-verification) that allows users to upload their exported self-hosted JSON backup (note: these could be quite large, so even with user verification in place, we’ll still need to think a bit about resource limits here). This would hit the endpoint described above, and send the user an email when their relocation succeeds, or otherwise notify them that it failed and open a ticket.
In the case of failure (that is, the user uploaded a JSON backup that could not be validated), we will inform the user and automatically open a support ticket on their behalf.
Risks
The major risk is that by changing the process, which works in its own brittle way at the moment, we introduce production breakages or data corruptions. To mitigate this, great care will need to be taken to ensure that an expansive test suite is provided to guarantee that this process won’t damage data on either the exporting or importing side.
In terms of resources and API design, we are going to be importing and then processing very large JSON blobs, then merging them into production databases. Care will need to be taken to ensure that these operations are all properly secured and throttled, so as not to introduce user-input vulnerabilities, via either malicious intent or simply very large inputs.
Because we are uniting two implementations into one, there is always some risk that some property of one of the implementations will be lost. It is a bit difficult to ascertain how likely this is because of the almost complete lack of tests for both implementations, so we will need to rely on some combination of a new but thorough test suite and user reports to guard against this.
Open Questions
There are some important open questions that will need to be resolved during implementation:
Where exactly will we write data during validation? Will we have a shared “validation” database, or a standalone database spun up for each relocation operation? How will we move the now-validated data into prod - by simply copying rows from the validation database, or by re-running
import_
on the validated JSON, but now pointing at prod?Should we loosen, or remove, import atomicity during validation? Atomicity has the benefit of allowing an all-or-nothing transaction for the entire import, but the downside of potentially locking up a database for large imports. More research is needed to figure out the best path forward.
How will we mitigate potential performance issues when importing large blobs? We probably won’t have much contention for the validation database (and can disable atomicity for it anyway), but bulk moving an entire Sentry instance’s worth of now-validated data into the production database will require some finesse.
How do we import control silo models, which are (generally) globally scoped? There will be collisions here (for example, users that already exist, or org slugs that are already taken), so we’ll need some sort of custom logic to handle this. It’s hard to imagine avoiding writing special import logic for these on a case-by-case basis.
Potential Future Work
All of the non-goals mentioned above (increasing the scope of relocatable artifacts, one-click server-to-server integration, and more customizable and precise relocation operations) are on the table as we move forward. In particular, it would be very nice to get to an end state where users start a relocation (either self-hosted -> SaaS, or SaaS region-to-region via hybrid cloud), and we seamlessly move 100% of their region-siloed data over in a way that is almost entirely opaque to them. This could include temporarily forwarding events that occur while the relocation is taking place, and carefully handing over control between the source and target of the relocation, so that from the user perspective, the whole operation is “one click and wait for a confirmation email” easy.
Q3 Milestones
User
export scoped models #181unique=True
field collisions on import #193Q4 Workstreams
Not Yet
The text was updated successfully, but these errors were encountered: