-
Notifications
You must be signed in to change notification settings - Fork 50
Add zero-downtime deployments & data transformations guide #1082
Conversation
API Developer Docs Preview: Ready https://wordpress.github.io/openverse-api/_preview/1082 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite eye-opening. Thank you for starting to write this, Sara. The way of how zero-downtime deployments work makes so much sense after reading it! (I was a bit worried at the begging of the API ECS-ification project for this part TBH). I wonder if you are reflecting on it from pure previous experience or if you know more documentation and resources on this topic. I'd love to read more.
api/docs/guides/data-migrations.md
Outdated
|
||
## Django management command data migrations | ||
|
||
### Why use management commands for data migrations instead of SQL? | ||
|
||
Django comes with a data migration feature built in that allows executing data transformations using SQL. If you want to move data between two columns, it is trivial to do so with SQL and Django makes it just as easy. [Documentation for this Django feature is available here](https://docs.djangoproject.com/en/4.1/topics/migrations/#data-migrations). | ||
|
||
When considering the potential issues with using SQL data migrations with our current deployment strategy, keep in mind the following details: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a doubt about this section. The title and subtitle make it sound like Django management commands are the solution to the problems described here, but I'm not sure how that is the case. After all, DB-related commands are mostly plain SQL under the hood.
I can see the Django commands as an easier way to handle migrations automatically for our future improved CI/CD pipeline but how are they supposed to solve the problems of time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to find a way to clarify the reasons why to use management commands at the start, in a clearer and more concise way. I'll summarise it here though, and you can let me know if there are still doubts that need to be clarified and addressed.
The reason why management commands solve the issue at hand (long-running data migrations), isn't because they make data migrations shorter or not use SQL. Of course, as you said, they still use SQL as they'll still be manipulating the database via the same tools that a migration file would. The difference is that a management command allows us to break up the SQL commands to iterate over the set of data that needs to be transformed and critically not at application startup. Additionally, they have the following features:
- They can be stopped and started to allow for other more critical operations to take precedence or to recover from unexpected errors
- They can be metered to prevent creating significant read/write lag that would hurt the overall application performance
Both of these might be doable in pure SQL (maybe through a stored procedure, for example), but it's more effort, harder to test and have standardised tools, and I don't think we have great tools at the moment for creating or maintaining stored procedures in the API anyway.
Management commands solve these problems in a familiar setting, with testing tools that we already know how to use and for which we have invested time into building utilities like data fixtures and helpers.
The main benefit, though, is being able to break up the transformation into smaller, iterable chunks, that can be executed outside of application startup time (which we can't do with Django migrations without making significant complications to our deployment workflow).
This is based on previous experience. I actually haven't read any additional documents describing this process for Django, but the overall difficulty with database schema changes is a well known issue with automated zero-downtime deployments. There are lots of resources online discussing the issue and describing scenarios similar to the column name change example I share in the document. Doing a quick search, I can't find anyone describing the process here using management commands. The only tool that Django has for dealing with part of this issue (data loss due to an unexpected error during a long-running data transformation) is to set migrations to be non-atomic: https://docs.djangoproject.com/en/4.1/howto/writing-migrations/#non-atomic-migrations Here's a guide about zero-downtime migrations in Django, but focused exclusively on adding/removing tables and columns, rather than things like long-running data transformations that you can put into Django SQL migrations: https://gist.github.com/majackson/493c3d6d4476914ca9da63f84247407b It's a useful resource still though, as it gives a good list of steps for each of some very common situations, so I'll include a link to it as well. |
@krysal In response to your questions, I realised that it might make the document make more sense (motivations wise) to reframe it as a general document about zero-downtime deployments with data transformations as a special case. I changed the language to more clearly distinguish between a "Django migration based data transformation" and a "management command based data transformation", primarily by switching to use "data transformation" rather than "data migration" as the generic term. Hopefully this helps clarify what was already there, but I am undrafting now and am eager to hear further thoughts on whether the suggested guidelines for data transformations make more sense and what I need to further clarify (or scrap entirely 😅) |
775bed7
to
c358e2d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is seriously excellent! Thanks for making it a bit more generic too 🙂 I think one other piece that could be helpful here is an example or template Django management command which performs a data transformation as described in the document. I don't think it necessarily needs to be part of this PR, but I know for myself it'd be a useful reference when making the data transformations (especially if we end up deleting the code once the transformation is run). Thanks also for your example regarding column renaming, it feels appropriate particularly for #719 😅
That's a great idea and something I'd love to do. I think there are some clever ways to make some generic tools or a base class that gives the outline. But also, yes, I think it's sufficiently complex work to have as a separate issue, if that is okay with other reviewers. |
Co-authored-by: Madison Swain-Bowden <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read it carefully, and I must reiterate, this is an excellent piece of documentation! ⭐️ Thanks a lot for writing it. It makes clear the need for Django management commands for data migrations and an outline of how to do it.
I think it may include these two benefits you mentioned in the issue description too:
- Can be throttled to prevent overwhelming database load;
- The transformation can be unit tested including more easily testing data edge cases that might be easy to forget about (and even harder to handle) in a regular SQL data migration;
Also, +1 for an example or template with throttling, but this may wait even for when the need arises.
1. Once the data transformation is complete, deploy a new version of the | ||
application that removes the old column and the fallback reads to it and only | ||
uses the new column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's pertinent to add this note here or in a new step.
- Once the data transformation is complete, deploy a new version of the
application that removes the old column and the fallback reads to it and only
uses the new column. Also, add the corresponding constraints for the said column if required, e.g. non-nullable, default value, etc.
- Migrations are run _at the time of deployment_ by the first instance of the | ||
new version of the application that runs in the pool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we maybe link the code here? I know it can drift in the future, but it will help to know that this can be configured as well.
- Migrations are run _at the time of deployment_ by the first instance of the | |
new version of the application that runs in the pool. | |
- Migrations are run _at the time of deployment_ by the first instance of the | |
new version of the application that runs in the pool (when is configured for that with the [`DJANGO_MIGRATE_DB_ON_STARTUP`](https://github.com/WordPress/openverse-api/blob/main/api/env.docker#L15) variable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be the case once ECS is deployed, so maybe I'll make a note about that instead. The variable would be set in our infrastructure anyway, which we can't link to from this document, so just linking to the variable doesn't give a ton of new information to the reader, but it's probably easier to figure out if we've transitioned to the ECS deployment yet.
Thanks @krysal. I added the two additional benefits to the document, not sure why I forgot to include those, but they are important ones. I also added a clarification about the "automatic" migration running that should be more timeless than either of our initial suggestions. At some point in the future we can remove it, but the information will never be inaccurate (unless we stop zero-downtime deployments 😅) |
Fixes
Fixes #1030 by @sarayourfriend
Description
I am still working on this and there are significant sections and details still missing that I want to add before undrafting this. I'll update the PR description when I undraft the PR.
Testing Instructions
Checklist
Update index.md
).main
) ora parent feature branch.
errors.
Developer Certificate of Origin
Developer Certificate of Origin