Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sequential numbering with postgres or redshift schemas #63

Closed
rkulla opened this issue Jul 21, 2017 · 28 comments
Closed

sequential numbering with postgres or redshift schemas #63

rkulla opened this issue Jul 21, 2017 · 28 comments

Comments

@rkulla
Copy link

rkulla commented Jul 21, 2017

I'm just started using this fork because it has support for Redshift. However, I use schemas to organize tables with databases like redshift and postgres but the new sequential numbering of files, instead of timestamps, means that I have to put all my migration files in the same directory, otherwise the counter will start again at 00001 for each directory.

Say I have a single database in redshift or postgres, and it has multiple schemas CREATE SCHEMA foo; and CREATE SCHEMA bar;. If I wanted to organize my goose migrations by folders, I might have:

foo/
bar/

If i then ran cd foo && goose create add_cars sql then cd ../bar && goose_create add_trains sql, goose will create foo/00001_add_cars.sql and bar/00001_add_trains.sql. So running:

$ cd foo
$ goose postgres "user=rkulla dbname=postgres sslmode=disable" up
goose: no migrations to run. current version: 1
$ cd ../bar
$ goose postgres "user=rkulla dbname=postgres sslmode=disable" up
goose: no migrations to run. current version: 1

would only apply the migration under foo/ but not the one under bar/, because both migrations start with 00001_ and postgres or redshift only get ONE goose_db_version table to share amongst the different schema names. This makes them harder to organize unless I do goose create add_cars_foo sql' and goose create add_trains_bar sql, then run commands like ls *_foo.sqlandls *_bar.sql`.

It's not that big of a deal right now I guess, but I'm wondering if there's a better way and if it's really worth not using timestamps. IIRC, Ruby on Rails's migration feature used to use sequential numbers but developers complained because it caused a lot of conflicts when 2 different developers working on the same project made separate changes but both generated a migration with the same number. So Rails switched to UTC timestamps.

@duffn
Copy link
Contributor

duffn commented Jul 21, 2017

I also prefer the timestamp naming convention. I'd suggest that both are offered in goose. If you prefer sequential, then default to that, but offer UTC timestamps with create NAME sql ts perhaps.

@VojtechVitek
Copy link
Collaborator

VojtechVitek commented Jul 21, 2017

@rkulla sounds like a bug. Goose binary should see the subdirectories and give you latest_version_number++ properly. Anyone, who's using sub-dirs too and wants to solve this?


I'm strongly against Timestamp-based versioning, since it causes much worse unpredictable schema conflicts than sequential versioning. See discussion in #27.

In short:

Given two developers work on different git branches and they both create migrations that change the schema in "account" table. Let's say they both rename a column, change a default value and change an index constraint, for example.

Sequential versions

  1. First dev merges his PR to master. Travis passes.
  2. Second dev's Pull Request's Travis job will start blowing up on a goose migration name conflict (given the Travis builds off the PR branch merged into master using the "Build branch updates" Travis feature). The job will fail on conflicting version 0005:

0005_first_developer_account_changes.sql
0006_first_developer_some_more_changes.sql

0005_second_developer_different_account_changes.sql

Thus, the second dev will need to resolve the conflict (rename the migration) and will probably rebase against master to update his PR.

Timestamp-based versions

2017...08_first_developer_account_changes.sql
2017...15_first_developer_some_more_changes.sql
^ gets merged and deployed first

2017...03_second_developer_different_account_changes.sql
^ gets merged second ............. and will be ignored (because its version name is not higher than the latest applied version)

** Note that this is how goose work by design from its very beginnings in https://bitbucket.org/liamstask/goose/ repo.


As you can see, the sequential versions provide more safety during the development cycle. (Even though they might seem annoying and unnecessary from the beginning.)

But more importantly, they provide predictable order of schema migrations.

The predictable order has more benefits, though:

  1. You can start storing the initial (pre-migration) and final (after-migration) schemas as golden files in your git repository. Very handy for PR reviews! With timestamp-based versions, you'd have to resolve git conflicts in these files anyway.
  2. Also, you can configure your CI (Travis) to test goose up && goose reset && goose up as part of your test suite. From that point, you'll always make sure that your database schema is in a good shape, even when you need to roll-back for some reason. With timestamp-based versions, some migrations could be silently ignored .. and this test would not blow up.

Hope this makes sense.

I'm not saying sequential versioning is perfect - but it has some real benefits over timestamp-based versions.

I'd prefer if we thought about some tooling around conflicts (ie. bump version of all files that were not applied yet to resolve conflicts or something), rather than falling back to timestamps.

@rkulla
Copy link
Author

rkulla commented Jul 21, 2017

@VojtechVitek I see. Thanks for the quick response.

I'm not particularly adamant about timestamps if the sequential versioning would work with sub-dirs. I do have some older timestamp based migrations that were made with liamstask/goose that would now be a mix of timestamps and sequential numbers. Do you think that would pose a problem if i don't rework them as sequential numbers, for people "migrating" from the old goose to this fork?

@VojtechVitek
Copy link
Collaborator

@rkulla it will still work. But goose create from this repo won't give you timestamp by default, though.

If you want to switch to sequential versions, you can always bump the latest migration to something like 10000000001 (something higher than latest timestamp) and go from there.

@pawelma
Copy link

pawelma commented Aug 22, 2017

@VojtechVitek IMO timestamp based migrations are a lot better than sequential ones. Major issue with sequential ones are collisions after merge. Why can't you check all timestamp versions against goose_db_version? In #27 someone provided argument with initial migration as example but in real world this case is minor and conflicts after merge are real problem. Logic and storage schemat should be discussed between developers, not migration numbers.

@VojtechVitek
Copy link
Collaborator

@pawelma please read my comment at #63 (comment).

Timestamp based approach doesn't guarantee ordering, in which the migrations get run. I think of DB schema migrations as of git tree with well defined history. Git history is based on commit hashes rather than on timestamps for the very same reason.

Problem is not when you merge two branches together, but in which order you run the DB migrations from these branches. Version-based naming prevents any conflicts, as it errors out on version collisions. I totally understand how annoying is to rename files manually. I don't like it either. But we don't have anything better right now. I wish we could just pick up the order from git history one day. Or automate the renaming somehow.

I'll give you one more example with timestamp-based approach:
Imagine two developers working on their branches with 5 migrations each and merging their work to master branch at the same time. After the merge, the master branch will have 10 new timestamp-based migrations that don't have order defined. If these developers changed schema of some related DB tables, you'll get schema errors (ie. first developer renamed column or added new index but the other one just wanted to enforce NOT NULL or remove a foreign key). This is something that version-based approach avoids.

@pawelma
Copy link

pawelma commented Aug 22, 2017

@VojtechVitek I read it earlier and it didn't convince me :)

We can use any prefix which provides uniqueness and some sort of order. The order should matter but should not take over implementation. The order should matter in a longer period but in short time it shouldn't be forced. As you mentioned git model is a tree, history there can be linear but by default, it isn't and developers take care to resolve conflicts or make history linear.

I agree with the problem described in your example but sequential versioning doesn't solve it. Imagin same issue with sequential versioning. Developer tries to merge his changes into master, CI tests are failing because there is another migration with the same number - so developer fixes the number and I can bet that in 90% (or even more) cases he doesn't check the code. If there is no restriction for 'merge only when tests pass' he may merge broken code. Let's assume that there is such restriction or he waits for another test execution - the result is red again because the original error was hidden by migration framework (instead of 1 debugging step he had to go through 2). What I want to say: You are not able to protect developers from mistakes but you can make their lives easier. Good practices like well-written tests, working CI, etc... should solve such problems.

@VojtechVitek
Copy link
Collaborator

but by default, it isn't and developers take care to resolve conflicts or make history linear.

Yea, it would be nice if we could get the version number automatically somehow, like in git. Technically, goose binary could look into git repository and recognize number of commits since the initial commit for each migration merged into master. git rev-list --count $(git rev-list --max-parents=0 HEAD | tail -n 1).. or something like that. But I don't really have time to explore this further at this moment, it's not that easy.

Sequential versioning enforces consistency of the DB schema on each environment. It's a stability decision. You probably want to have the same DB state on your localhost, in your CI tests, and then on staging/production.

You are not able to protect developers from mistakes but you can make their lives easier.

I can rely on CI tests that enforce consistent and predictable DB schema, thus protecting developers and production devops from issues on production. That's imho more important than forcing them to rename bunch of files.

Yes, it is annoying. But how can we do it better?

Unfortunately, timestamps are not the answer.


However, if you still prefer timestamp-based approach, you can simply edit this line: https://github.com/pressly/goose/blob/master/create.go#L22 and build your custom goose binary.
Custom binaries are easy to set up and allow you to write Go migrations: See https://github.com/pressly/goose/tree/master/examples/go-migrations

@toudi
Copy link

toudi commented Feb 18, 2018

could a hinting solution be concidered? Django uses sequential number migrations, however it also allows one to specify dependencies. Therefore by adding one line:

-- +goose depends-on 0004_foo

-- +goose Up
....
-- +goose Down
....

goose could know how to arange the history.

@croaker8
Copy link

I am running into serious problems when we have development going on on multiple branches. Scripts get created with the same sequence number on multiple branches. Then there is a conflict after a merge and deployment fails. And even worse if a different branch gets deployed with different numbering the state is hard to decipher. Not sure we will continue using goose.

@mvrhov
Copy link

mvrhov commented Jul 25, 2018

@croaker8 : May I suggest that you add a timestamp into the sql/go file automatically so you can resolve the sequence problem. Even though you would have the time based sequences there is no guarantee that the migrations would be in order. And you would still run into the same problems e.g manually solving the order of them except there would be no merge conflicts.

Now on the other hand what @toudi suggested would solve this problem partially as the migrations would still be numbered out of the order and there would still be conflicts with merge however they might not be run out of order.

@eaglemoor
Copy link

eaglemoor commented Sep 27, 2018

We can use full name of file as version. It's give more information witch revision you apply. And the risk to create the same filename very low. And we don't need use timestamp in filename.

About prefix for sort. It's not a problem, we will have key '0001_create_user', '0001_create_fiends', some people can't create depending migration without parent migration.
If we have filename /migration/mike/0001_create_user and /migration/alex/0001_create_news all migrations will be apply, because they have different key

@eaglemoor
Copy link

And don't forget. All migration and branch, before release, merge in release branch and apply on staging step by step. If developers catch error on this step, they will roll back and try resolve problem localy.

And goose have well api goose status who show sorted migration list.

I think that goose need give to people simple tools for create migration and show clear logic how it work. And don't need resolved 0.0001% cases, which people need resolve by hand, because they wrong use goose and git

@eaglemoor
Copy link

eaglemoor commented Sep 27, 2018

Or, if you don't want destroy logic this number prefix, you need panic (exit 1), if you see 2 version 0003_.... with different filename. Then people will see this and will change filename.

In this moment, you just ignore different migration this different people with different filename, but with 0003_. prefix

@toudi
Copy link

toudi commented Sep 29, 2018

Hi guys. I created a proof of concept with migration dependencies support and revisions (where the revision can be anything) and two types of inputs - SQL and fizz. Could this be a possible solution to this problem? I'm quite new to golang so I'd love your insight. There's one hardcoded path in there, but I just wanted to know what you'd think about the API

https://github.com/toudi/gorpheus

@eaglemoor eaglemoor mentioned this issue Oct 10, 2018
@VojtechVitek
Copy link
Collaborator

We came up with an interesting approach internally in Pressly with @1vn & @diogogmt :

  1. Use timestamps in development branch, to avoid annoying file renaming & conflicts on rebase
  2. Rename those timestamps-based filenames to serial numbers on git merge via simple merge bot to enforce order of the migrations and thus protect the state of DB schema on all environments

We're working on a proof on concept right now, and we might merge it back here, if there's enough interest. It's a tool that can be either run manually on master branch, or automatically in the CI/CD pipeline or via a "merge bot".

@eaglemoor
Copy link

And what about CVS, Subversion, Bazaar and Mercurial ?)

@eaglemoor
Copy link

I think timestamp (or version number) + filename will easy good solution problem. And u don't need (and don't) use some dependencies of other tools. No ?

@VojtechVitek
Copy link
Collaborator

@eaglemoor The solution we're thinking of doesn't depend on git exclusively. You can run the same thing against mercurial/CVS repo in your central "production" branch.

@VojtechVitek
Copy link
Collaborator

The timestamp migration order problem explained:

  1. Let's assume we have two developers working on fairly big development branches:

joe branch:

migrations/2018-09-15-12:00:00_joe_1.sql
migrations/2018-10-10-12:00:00_joe_2.sql

alice branch:

migrations/2018-10-01-12:00:00_alice_1.sql
migrations/2018-10-18-12:00:00_alice_2.sql
  1. Joe's branch get merged into master branch first:

master branch:

+migrations/2018-09-15-12:00:00_joe_1.sql
+migrations/2018-10-10-12:00:00_joe_2.sql
  1. The master branch gets deployed to STAGING environment:
$ goose up
$ goose: migrating db environment 'staging', current version: 0, target: 2
$ OK    2018-09-15-12:00:00_joe_1.sql
$ OK    2018-10-10-12:00:00_joe_2.sql
  1. Alice's branch gets merged into master branch right after:

master branch:

 migrations/2018-09-15-12:00:00_joe_1.sql
+migrations/2018-10-01-12:00:00_alice_1.sql
 migrations/2018-10-10-12:00:00_joe_2.sql
+migrations/2018-10-18-12:00:00_alice_2.sql

^ Note the order of these migrations. Goose orders migration files alphabetically.

  1. The QA reports about a regression on STAGING, a hotfix must be provided before going to PRODUCTION

  2. A hotfix is provided and merged into master

  3. The master branch gets deployed to STAGING environment:

$ goose up
$ goose: migrating db environment 'staging', current version: 0, target: 0
$ OK    2018-10-02-12:00:00_alice_2.sql
  1. QA team reports that STAGING looks good, we can go to production!

  2. Hah, no one noticed that 2018-10-01-12:00:00_alice_1.sql migration was not applied on STAGING environment!!

  3. The master branch gets deployed to PRODUCTION environment:

$ goose up
$ goose: migrating db environment 'staging', current version: 0, target: 2
$ OK    2018-09-15-12:00:00_joe_1.sql
$ OK    2018-10-01-12:00:00_alice_1.sql
$ OK    2018-10-02-12:00:00_joe_1.sql
$ OK    2018-10-18-12:00:00_alice_2.sql
  1. The release engineering reports a NOW bug. The application is down after the PRODUCTION deployment.

  2. After hours of debugging, someone notices a serious bug in 2018-10-01-12:00:00_alice_1.sql that was never executed and tested on STAGING. Oooops.


Now, the above story can be even worse, if you have multiple production environments, as we do in Pressly. We have a separate databases in Germany, Canada, US etc. because of data residency regulations. It's impossible to keep all the environments in sync, if we don't use sequential ordering.

The migration order mismatch can cause both

  • schema mismatches (ie. unexpected missing column, index, or a table .. you name it)
  • data inconsistencies (ie. UPDATE users SET deleted_at=NOW() WHERE id IN (23, 93, 442) that was never run)

@VojtechVitek
Copy link
Collaborator

Fixed in #120.

@VojtechVitek
Copy link
Collaborator

VojtechVitek commented Nov 6, 2018

Released as v2.4.0.

@mvrhov
Copy link

mvrhov commented Nov 7, 2018

IMO this would be better solved with migrations depending on another migration.

@VojtechVitek
Copy link
Collaborator

VojtechVitek commented Nov 7, 2018

@mvrhov I disagree. Dependencies would be much more complex to implement & and also harder to maintain from the user perspective.

Dependencies between migrations would also lock you in, since you wouldn't be able to remove any older migration files that are referenced from the newer migrations. You'd have to think about all of that when maintaining the code.

In Pressly, we remove old migrations from our codebase every couple months and leave only the latest ~5 migrations. Regular clean-up keeps our codebase small and clean.

@mvrhov
Copy link

mvrhov commented Nov 7, 2018

Well removing older migrations might not be an option. e.g. It depend on how many copies of your product do you have. Also you need older migrations for someone to set up the dev environment. Unless you are scrubbing sensitive data from database an your devs are then using almost a production base.. But If you run large saas, then this is not an option.

@VojtechVitek
Copy link
Collaborator

Well removing older migrations might not be an option. e.g. It depend on how many copies of your product do you have.
But If you run large saas, then this is not an option.

We run a large SAAS. We have five productions in five different AWS regions and millions of users in each copy of the product. And yes, we still do remove old migrations. Out of ~600 migrations, we have about 50 in our codebase right now.

Also you need older migrations for someone to set up the dev environment.

Let me explain what we do in order to remove the old migrations in Pressly:

We maintain two DB schema files in our codebase that are the "source" of truth for the initial state of the DB:

  1. schema.sql
    The latest schema of the application, which is used in all application tests & in our CI/CD pipeline. This file gets updated regularly in Pull Requests that have migrations changing the schema.

  2. schema-pre-migration.sql
    An older schema, which represents the state of the DB before running any migrations currently present in the codebase. If you import schema-pre-migration.sql into a DB and run goose up, you should end up with the identical state to schema.sql file. We actually run this check as a "DB migration test" in our CI/CD pipeline and run a diff between the current schema and the schema.sql file to ensure consistency and to catch any discrepancies.

    When we remove old migration files, we have to update the schema-pre-migration.sql, of course. We pretty much "freeze" the schema.sql to the certain point of time.

Let me know if you have any questions. This could be probably worth a blog post :)

@eeshugerman
Copy link

eeshugerman commented Apr 2, 2019

@VojtechVitek

  1. The QA reports about a regression on STAGING, a hotfix must be provided before going to PRODUCTION
  2. A hotfix is provided and merged into master

What is the significance of points 5 and 6? Wouldn't the "versioning problem" still be present without this bit? Just want to make sure I understand this correctly as we're finally getting around to updating our CI/CD to use hybrid versioning.

@eliaperantoni
Copy link

@VojtechVitek Is the fix subcommand intended to be used in a development environment? Say me and a cowerker are sharing a develop branch in which we merge our changes .Would running goose fix after each merge in this branch be bad? So far I can only imagine it beeing beneficial for keeping a consistent and predictable order right from develop, even before mergin in master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants