Add zero-downtime deployments & data transformations guide #1082

sarayourfriend · 2023-01-13T06:09:00Z

Fixes

Description

I am still working on this and there are significant sections and details still missing that I want to add before undrafting this. I'll update the PR description when I undraft the PR.

Testing Instructions

Checklist

My pull request has a descriptive title (not a vague title like
Update index.md).
My pull request targets the default branch of the repository (main) or
a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible
errors.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

github-actions · 2023-01-13T06:10:07Z

API Developer Docs Preview: Ready

https://wordpress.github.io/openverse-api/_preview/1082

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

krysal

This is quite eye-opening. Thank you for starting to write this, Sara. The way of how zero-downtime deployments work makes so much sense after reading it! (I was a bit worried at the begging of the API ECS-ification project for this part TBH). I wonder if you are reflecting on it from pure previous experience or if you know more documentation and resources on this topic. I'd love to read more.

krysal · 2023-01-27T15:01:08Z

api/docs/guides/data-migrations.md

+
+## Django management command data migrations
+
+### Why use management commands for data migrations instead of SQL?
+
+Django comes with a data migration feature built in that allows executing data transformations using SQL. If you want to move data between two columns, it is trivial to do so with SQL and Django makes it just as easy. [Documentation for this Django feature is available here](https://docs.djangoproject.com/en/4.1/topics/migrations/#data-migrations).
+
+When considering the potential issues with using SQL data migrations with our current deployment strategy, keep in mind the following details:
+


I have a doubt about this section. The title and subtitle make it sound like Django management commands are the solution to the problems described here, but I'm not sure how that is the case. After all, DB-related commands are mostly plain SQL under the hood.
I can see the Django commands as an easier way to handle migrations automatically for our future improved CI/CD pipeline but how are they supposed to solve the problems of time?

I will try to find a way to clarify the reasons why to use management commands at the start, in a clearer and more concise way. I'll summarise it here though, and you can let me know if there are still doubts that need to be clarified and addressed.

The reason why management commands solve the issue at hand (long-running data migrations), isn't because they make data migrations shorter or not use SQL. Of course, as you said, they still use SQL as they'll still be manipulating the database via the same tools that a migration file would. The difference is that a management command allows us to break up the SQL commands to iterate over the set of data that needs to be transformed and critically not at application startup. Additionally, they have the following features:

They can be stopped and started to allow for other more critical operations to take precedence or to recover from unexpected errors

They can be metered to prevent creating significant read/write lag that would hurt the overall application performance

Both of these might be doable in pure SQL (maybe through a stored procedure, for example), but it's more effort, harder to test and have standardised tools, and I don't think we have great tools at the moment for creating or maintaining stored procedures in the API anyway.

Management commands solve these problems in a familiar setting, with testing tools that we already know how to use and for which we have invested time into building utilities like data fixtures and helpers.

The main benefit, though, is being able to break up the transformation into smaller, iterable chunks, that can be executed outside of application startup time (which we can't do with Django migrations without making significant complications to our deployment workflow).

sarayourfriend · 2023-01-29T21:38:33Z

This is based on previous experience. I actually haven't read any additional documents describing this process for Django, but the overall difficulty with database schema changes is a well known issue with automated zero-downtime deployments. There are lots of resources online discussing the issue and describing scenarios similar to the column name change example I share in the document.

Doing a quick search, I can't find anyone describing the process here using management commands. The only tool that Django has for dealing with part of this issue (data loss due to an unexpected error during a long-running data transformation) is to set migrations to be non-atomic: https://docs.djangoproject.com/en/4.1/howto/writing-migrations/#non-atomic-migrations

Here's a guide about zero-downtime migrations in Django, but focused exclusively on adding/removing tables and columns, rather than things like long-running data transformations that you can put into Django SQL migrations: https://gist.github.com/majackson/493c3d6d4476914ca9da63f84247407b

It's a useful resource still though, as it gives a good list of steps for each of some very common situations, so I'll include a link to it as well.

sarayourfriend · 2023-01-30T03:46:00Z

@krysal In response to your questions, I realised that it might make the document make more sense (motivations wise) to reframe it as a general document about zero-downtime deployments with data transformations as a special case. I changed the language to more clearly distinguish between a "Django migration based data transformation" and a "management command based data transformation", primarily by switching to use "data transformation" rather than "data migration" as the generic term. Hopefully this helps clarify what was already there, but I am undrafting now and am eager to hear further thoughts on whether the suggested guidelines for data transformations make more sense and what I need to further clarify (or scrap entirely 😅)

AetherUnbound

This is seriously excellent! Thanks for making it a bit more generic too 🙂 I think one other piece that could be helpful here is an example or template Django management command which performs a data transformation as described in the document. I don't think it necessarily needs to be part of this PR, but I know for myself it'd be a useful reference when making the data transformations (especially if we end up deleting the code once the transformation is run). Thanks also for your example regarding column renaming, it feels appropriate particularly for #719 😅

api/docs/guides/zero-downtime-database-management.md

sarayourfriend · 2023-01-31T23:26:50Z

I think one other piece that could be helpful here is an example or template Django management command which performs a data transformation as described in the document.

That's a great idea and something I'd love to do. I think there are some clever ways to make some generic tools or a base class that gives the outline.

But also, yes, I think it's sufficiently complex work to have as a separate issue, if that is okay with other reviewers.

Co-authored-by: Madison Swain-Bowden <[email protected]>

krysal

I read it carefully, and I must reiterate, this is an excellent piece of documentation! ⭐️ Thanks a lot for writing it. It makes clear the need for Django management commands for data migrations and an outline of how to do it.

I think it may include these two benefits you mentioned in the issue description too:

Can be throttled to prevent overwhelming database load;

The transformation can be unit tested including more easily testing data edge cases that might be easy to forget about (and even harder to handle) in a regular SQL data migration;

Also, +1 for an example or template with throttling, but this may wait even for when the need arises.

krysal · 2023-02-07T23:35:10Z

api/docs/guides/zero-downtime-database-management.md

+1. Once the data transformation is complete, deploy a new version of the
+   application that removes the old column and the fallback reads to it and only
+   uses the new column.


I wonder if it's pertinent to add this note here or in a new step.

Once the data transformation is complete, deploy a new version of the
application that removes the old column and the fallback reads to it and only
uses the new column. Also, add the corresponding constraints for the said column if required, e.g. non-nullable, default value, etc.

krysal · 2023-02-07T23:49:21Z

api/docs/guides/zero-downtime-database-management.md

+- Migrations are run _at the time of deployment_ by the first instance of the
+  new version of the application that runs in the pool.


Can we maybe link the code here? I know it can drift in the future, but it will help to know that this can be configured as well.

Suggested change

- Migrations are run _at the time of deployment_ by the first instance of the

new version of the application that runs in the pool.

- Migrations are run _at the time of deployment_ by the first instance of the

new version of the application that runs in the pool (when is configured for that with the [`DJANGO_MIGRATE_DB_ON_STARTUP`](https://github.com/WordPress/openverse-api/blob/main/api/env.docker#L15) variable)

This will be the case once ECS is deployed, so maybe I'll make a note about that instead. The variable would be set in our infrastructure anyway, which we can't link to from this document, so just linking to the variable doesn't give a ton of new information to the reader, but it's probably easier to figure out if we've transitioned to the ECS deployment yet.

sarayourfriend · 2023-02-08T03:35:15Z

Thanks @krysal. I added the two additional benefits to the document, not sure why I forgot to include those, but they are important ones. I also added a clarification about the "automatic" migration running that should be more timeless than either of our initial suggestions. At some point in the future we can remove it, but the information will never be inaccurate (unless we stop zero-downtime deployments 😅)

openverse-bot added 🌟 goal: addition Addition of new feature 📄 aspect: text Concerns the textual material in the repository 🟩 priority: low Low priority and doesn't need to be rushed labels Jan 13, 2023

krysal reviewed Jan 27, 2023

View reviewed changes

sarayourfriend changed the title ~~Add data migration guide~~ Add zero-downtime deployments & data transformations guide Jan 30, 2023

sarayourfriend marked this pull request as ready for review January 30, 2023 03:46

sarayourfriend requested a review from a team as a code owner January 30, 2023 03:46

sarayourfriend requested review from krysal and AetherUnbound January 30, 2023 03:46

sarayourfriend added 3 commits January 30, 2023 14:46

Add incomplete draft of data migration guidelines

720d96d

Add more details about data migrations

8751cea

Refocus document on zero-downtime deployments generally

c358e2d

sarayourfriend force-pushed the add/data-migration-documentation branch from 775bed7 to c358e2d Compare January 30, 2023 03:46

AetherUnbound approved these changes Jan 31, 2023

View reviewed changes

api/docs/guides/zero-downtime-database-management.md Outdated Show resolved Hide resolved

api/docs/guides/zero-downtime-database-management.md Outdated Show resolved Hide resolved

AetherUnbound mentioned this pull request Jan 31, 2023

Rename nsfw_reports to nsfw_reports_image #719

Closed

1 task

sarayourfriend and others added 2 commits February 1, 2023 10:27

Update api/docs/guides/zero-downtime-database-management.md

fe92af9

Co-authored-by: Madison Swain-Bowden <[email protected]>

Remove outstanding comment

2ab50cc

krysal approved these changes Feb 8, 2023

View reviewed changes

Add additional clarifications from @krysal

7269ed8

sarayourfriend merged commit 2450c38 into main Feb 8, 2023

sarayourfriend deleted the add/data-migration-documentation branch February 8, 2023 19:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add zero-downtime deployments & data transformations guide #1082

Add zero-downtime deployments & data transformations guide #1082

sarayourfriend commented Jan 13, 2023

github-actions bot commented Jan 13, 2023 •

edited

Loading

krysal left a comment

krysal Jan 27, 2023

sarayourfriend Jan 29, 2023 •

edited

Loading

sarayourfriend commented Jan 29, 2023

sarayourfriend commented Jan 30, 2023

AetherUnbound left a comment

sarayourfriend commented Jan 31, 2023

krysal left a comment

krysal Feb 7, 2023

krysal Feb 7, 2023

sarayourfriend Feb 8, 2023

sarayourfriend commented Feb 8, 2023

		- Migrations are run _at the time of deployment_ by the first instance of the
		new version of the application that runs in the pool.

Add zero-downtime deployments & data transformations guide #1082

Add zero-downtime deployments & data transformations guide #1082

Conversation

sarayourfriend commented Jan 13, 2023

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

github-actions bot commented Jan 13, 2023 • edited Loading

krysal left a comment

Choose a reason for hiding this comment

krysal Jan 27, 2023

Choose a reason for hiding this comment

sarayourfriend Jan 29, 2023 • edited Loading

Choose a reason for hiding this comment

sarayourfriend commented Jan 29, 2023

sarayourfriend commented Jan 30, 2023

AetherUnbound left a comment

Choose a reason for hiding this comment

sarayourfriend commented Jan 31, 2023

krysal left a comment

Choose a reason for hiding this comment

krysal Feb 7, 2023

Choose a reason for hiding this comment

krysal Feb 7, 2023

Choose a reason for hiding this comment

sarayourfriend Feb 8, 2023

Choose a reason for hiding this comment

sarayourfriend commented Feb 8, 2023

github-actions bot commented Jan 13, 2023 •

edited

Loading

sarayourfriend Jan 29, 2023 •

edited

Loading