diff --git a/doc/md/versioned/down.mdx b/doc/md/versioned/down.mdx index 65f3cb2f51a..d22bf26e4c0 100644 --- a/doc/md/versioned/down.mdx +++ b/doc/md/versioned/down.mdx @@ -154,7 +154,7 @@ atlas migrate down \ 3\. After downgrading your database to the desired version, you can safely delete the migration file `20240305171146.sql` -from the migration directory and then run `atlas migrate hash` to update the `atlas.sum` file. +from the migration directory by running `atlas migrate rm 20240305171146`. 4\. After the file was deleted and the database downgraded, you can generate a new migration using the `atlas migrate diff` command with the optional `--edit` flag to open the generated file in your default editor. diff --git a/doc/website/blog/2024-04-01-migrate-down.mdx b/doc/website/blog/2024-04-01-migrate-down.mdx new file mode 100644 index 00000000000..27771b65ffd --- /dev/null +++ b/doc/website/blog/2024-04-01-migrate-down.mdx @@ -0,0 +1,394 @@ +--- +title: The Myth of Down Migrations; Introducing Atlas Migrate Down +authors: a8m +tags: [down migrations, rollback, undo migrations, revert migrations, migrate down] +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +### TL;DR + +Ever since my first job as a junior engineer, the seniors on my team told me that whenever I make a schema change +I must write the corresponding "down migration", so it can be reverted at a later time if needed. But what if that advice, +while well-intentioned, deserves a second look? + +Today, I want to argue that contrary to popular belief, down migration files are actually a bad idea and should be actively avoided. + +In the final section, I'll introduce an alternative that may sound completely contradictory: the new `migrate down` command. I will explain the +thought process behind its creation and show examples of how to use it. + +### Background + +Since the beginning of my career, I have worked in teams where, whenever it came to database migrations, +we were writing "down files" (ending with the `.down.sql` file extension). This was considered good practice and +an example of how a "well-organized project should be." + +Over the years, as my career shifted to focus mainly on infrastructure and database tooling +in large software projects (at companies like [Meta](https://meta.com)), I had the opportunity to question +this practice and the reasoning behind it. + +Down migrations were an odd thing. In my entire career, working on projects with thousands of down files, I never applied +them on a real environment. As simple as that: not even once. + +Furthermore, since we have started Atlas and to this very day, we have interviewed countless software +engineers from virtually every industry. In all of these interviews, we have only met with _a single team_ that routinely applied down +files in production (and even they were not happy with how it worked). + +Why is that? Why is it that down files are so popular, yet so rarely used? Let's dive in. + +#### Down migrations are the naively optimistic plan for a grim and unexpected world + +Down migrations are supposed to be the "undo" counterpart of the "up" migration. Why do "undo" buttons exist? +Because mistakes happen, things fail, and then we want a way to quickly and safely revert them. +Database migrations are considered something we should do with caution, they are super risky! So, it makes sense +to have a plan for reverting them, right? + +But consider this: when we write a down file, we are essentially writing a script that will be executed in the future +to revert the changes we are about to make. This script is written _before_ the changes are applied, and it is based on +the assumption that the changes will be applied correctly. But what if they are not? + +When do we need to revert a migration? When it fails. But if it fails, it means that the database might be in an unknown state. +It is quite likely that the database is not in the state that the down file expects it to be. For example, if the "up" migration was supposed to add +two columns, the down file would be written to remove these two columns. But what if the migration was partially applied and only one +column was added? Running the down file would fail, and we would be stuck in an unknown state. + +#### Rolling back additive changes is a destructive operation + +When you are working on a local database, without real traffic, having the up/down mechanism for migrations +might feel like hitting Undo and Redo in your favorite text editor. But in a real environment, it is not the case. + +If you successfully rolled out a migration that added a column to a table, and then decided to revert it, its inverse +operation (`DROP COLUMN`) does not merely remove the column. It deletes all the data in that column. Re-applying the +migration would not bring back the data, as it was lost when the column was dropped. + +For this reason, teams that want to temporarily deploy a previous version of the application, usually do not revert the +database changes, because doing so will result in data loss for their users. Instead, they need to assess the situation +on the ground and figure out some other way to handle the situation. + +#### Down migrations are incompatible with modern deployment practices + +Many modern deployment practices like Continuous Delivery (CD) and GitOps advocate for the +software delivery process to be automated and repeatable. This means that the deployment process should be +deterministic and should not require manual intervention. A common way of doing this is to have a pipeline that +receives a commit, and then automatically deploys the build artifacts from that commit to the target environment. + +As it is very rare to encounter a project with a 0% change failure rate, rolling back a deployment is a common scenario. + +In theory, rolling back a deployment should be as simple as deploying the previous version of the application. When +it comes to versions of our application code, this works perfectly. We pull the container image that corresponds to +the previous version, and we deploy it. + +But what about the database? When we pull artifacts from a previous version, they do not contain the down files that +are needed to revert the database changes back to the necessary schema - they were only created in a future commit! + +For this reason, rollbacks to versions that require reverting database changes are usually done manually, going against +the efforts to automate the deployment process by modern deployment practices. + +### How do teams work around this? + +In previous companies I worked for, we faced the same challenges. The tools we used to manage our database migrations +advocated for down migrations, but we never used them. Instead, we had to develop some practices to support +a safe and automated way of deploying database changes. Here are some of the practices we used: + +#### Migration Rollbacks + +When we worked with PostgreSQL, we always tried to make migrations transactional and made sure to isolate the DDLs that prevent it, +like `CREATE INDEX CONCURRENTLY`, to separate migrations. In case the deployment failed, for instance, due to a +[data-dependent](/lint/analyzers#data-dependent-changes) change, the entire migration was rolled back, and the application +was not promoted to the next version. By doing this, we avoided the need to run down migrations, as the database was left in +the same state as bit was before the deployment. + +#### Non-transactional DDLs + +When we worked with MySQL, which I really like as a database but hate when it come to migrations, it was challenging. +Since MySQL [does not support](https://dev.mysql.com/doc/refman/8.3/en/implicit-commit.html) transactional DDLs, failures +were more complex to handle. In case the migration contains more than one DDL and unexpectedly failed in the middle, +because of a constraint violation or another error, we were stuck in an intermediate state that couldn't be +automatically reverted by applying a "revert file". + +Most of the time, it required special handling and expertise in the data and product. We mainly preferred fixing +the data and moving forward rather than dropping or altering the changes that were applied - which was also impossible if the +migration introduced [destructive changes](/lint/analyzers#destructive-changes) (e.g., `DROP` commands). + +#### Making changes Backwards Compatible + +A common practice in schema migrations is to make them backwards compatible (BC). We stuck to this approach, and also made +it the default behavior in [Ent](https://github.com/ent/ent). When schema changes are BC, applying them before starting a +deployment should not affect older instances of the app, and they should continue to work without any issues (in rolling deployments, +there is a period where two versions of the app are running at the same time). + +When there is a need to revert a deployment, the previous version of the app remains fully functional without any issues - +if you are an Ent user, this is one of the reasons we avoid `SELECT *` in Ent. Using `SELECT *` can also break the BC +for additive changes, like adding a new column, as the application expects to retrieve N columns but unexpectedly receives N+1. + +### Deciding Atlas would not support down migrations + +When we started Atlas, we had the opportunity to design a new tool from scratch. Seeing as "down files" never helped us +solve failures in production, from the very beginning of Atlas, Rotem and I agreed that down files should not be generated - except for +cases where users use Atlas to generate migrations for other tools that expect these files, such as Flyway or golang-migrate. + +#### Listening to community feedback + +Immediately after Atlas' initial release some two years ago, we started receiving feedback from the community +that put this decision in question. The main questions were: _"Why doesn't Atlas support down migrations?"_ and _"How do I +revert local changes?"_. + +Whenever the opportunity came to engage in such discussions, we eagerly participated and even pursued verbal discussions +to better understand the use cases. The feedback and the motivation behind these questions were mainly: + +1. It is challenging to experiment with local changes without some way to revert them. +2. There is a need to reset dev, staging or test-like environments to a specific schema version. + +#### Declarative Roll-forward + +Considering this feedback and the use cases, we went back to the drawing board. We came up with an approach that was +primarily about improving developer ergonomics and was in line with the declarative approach that we were advocating +for with Atlas. We named this approach "declarative roll-forward". + +Albeit, it was not a "down migration" in the traditional sense, it helped to revert applied migrations in an +automated way. The concept is based on a three-step process: + +1. Use `atlas schema apply` to plan a declarative migration, using a target revision as the desired state: + + ```shell + atlas schema apply \ + --url "mysql://root:pass@localhost:3306/example" \ + --to "file://migrations?version=20220925094437" \ + --dev-url "docker://mysql/8/example" \ + --exclude "atlas_schema_revisions" + ``` + This step requires excluding the `atlas_schema_revisions` table, which tracks the applied migrations, to avoid + deleting it when reverting the schema. + +2. Review the generated plan and apply it to the database. + +3. Use the `atlas migrate set` command to update the revisions table to the desired version: + + ```shell + atlas migrate set 20220925094437 \ + --url "mysql://root:pass@localhost:3306/example" \ + --dir "file://migrations" + ``` + +This worked for the defined use cases. However, we felt that our workaround was a bit clunky as it required a +three-step process to achieve the result. We agreed to revisit this decision in the future. + +### Revisiting the down migrations + +In recent months, the question of down migrations was raised again by a few of our customers, and we dove into it again with them. +I always try to approach these discussions with an open mind, and listen to the different points of view and use cases that I personally +haven't encountered before. + +Our discussions highlighted the need for a more elegant and automated way to perform deployment rollbacks in remote environments. The solution +should address situations where applied migrations need to be reverted, regardless of their success, failure, or partial application, +which could leave the database in an unknown state. + +The solution needs to be **automated**, **correct**, and **reviewable**, as it could involve data deletion. The solution can't be +the "down files", because although their generation can be automated by Atlas and reviewed in the PR stage, they cannot guarantee +correctness when applied to the database at runtime. + +After weeks of design and experimentation, we introduced a new command to Atlas named `migrate down`. + +### Introducing: `migrate down` + +The `atlas migrate down` command allows reverting applied migrations. Unlike the traditional approach, where down files +are "pre-planned", Atlas computes a migration plan based on the current state of the database. Atlas reverts previously +applied migrations and executes them until the desired version is reached, regardless of the state of the latest applied +migration — whether it succeeded, failed, or was partially applied and left the database in an unknown version. + +By default, Atlas generates and executes a set of [pre-migration checks](/versioned/checks) to ensure the computed plan +does not introduce data deletion. Users can review the plan and execute the checks before the plan is applied to the +database by using the `--dry-run` flag or the Cloud as described below. Let's see it in action on local databases: + +#### Reverting locally applied migrations + +
+ +Assuming a migration file named `20240305171146.sql` was last applied to the database and needs to be +reverted. Before deleting it, run the `atlas migrate down` to revert the last applied migration: + +
+ +