-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation Plan: Investigate the use of alembic for openledger migrations #1836
Comments
Related, I noticed that a few DAGs, like:
aren't meant to be scheduled, but rather run manually after sql changes have been made to related tables. Might it make sense for these to be archived and their functionality replaced with some helper code that could be run inside/alongside of migrations? |
The idea of a "migration DAG" is feeling more and more reasonable... 🤔 |
Alembic is good. It's also a part of the ingestion server packages on the API side so the added consistency is another plus. |
We might be able to leverage this project for our functions/views: https://github.com/olirice/alembic_utils |
Nice! |
Can you clarify what "migration DAG" would entail? Would it just be a DAG to run alembic migrations when new ones exist? If so, that sounds good. If the idea would be to implement a migration system based on SQL DDL files, I would strongly suggest against it for the following reasons:
If the DAG merely runs Alembic periodically or on command, that sounds fine and seems like a great way to solidify the production implementation and shield it from needing direct user input. However, the local environment also requires the same schema for applications that can currently run independent of the catalog. Unless we're willing to break that independence (feels like an important conversation we should have elsewhere before making that decision) we shouldn't rely exclusively on a new DAG or set of DAGs to manage migrations. I'd also like to suggest a higher prioritisation of this issue based on the reasons given in #2364. Especially we embark on the data normalisation and other projects that may change the schema in any way, we need a more secure and consistent way to apply this that simultaneously doesn't involve running potentially destructive queries by hand in production (even copy/pasted ones) and that solves the local environment issue. |
Yes, I was thinking of a DAG which would run whichever migration system we end up using. I similarly want to avoid a home-brewed migration system! Similar to Airflow, we may also add the migration step to the entrypoint for the catalog service. That would require deployments of the catalog in order to apply those changes, but it would make certain that the local environment is always up-to-date. I think paired with a DAG for executing the same migrations in production without requiring a restart, that would cover all our needs. |
That seems like a fine way to manage the migrations anyway. However, it would force us to follow zero-downtime principles more closely, which I don't think we think of for the catalog db at the moment. But that sounds good. Do you think this issue should be an implementation plan, @AetherUnbound? I think so, but if you think it's straightforward enough maybe just updating the issue with specific instructions to use Alembic, codify the existing DDL, configure the local environment to run them in the entrypoint of the |
You're right, we're a ways away from a zero-downtime deployment on the catalog. However, the catalog database is one that Airflow interacts with, not necessarily one that it depends on the same way that the API depends on its database. I don't think it should be necessary to deploy the service when the dag-sync script we have deploys updates to the DAG files (which might use that database) every minute. Certainly migrating on deploy would also be necessary, but we can trigger migrations without having to stop all running DAGs.
I think an implementation plan would be a great idea for this, especially to get feedback from folks and make sure we haven't missed anything! I'll convert the issue 🙂 |
Current Situation
We have the DDL files for the openledger database defined in a series of SQL files in the
local_postgres
Docker image folder.Suggested Improvement
For future migrations, it might be useful to have the database managed by a system like
alembic
1 for tracking migrations.We will want to alter the entrypoint for the catalog to run these migrations alongside the migrations for Airflow. We will also want a DAG which can run the migrations, so the production database can be migrated without needing to stop all running DAGs and deploy the catalog.
Benefit
A migration manager could help us determine which migrations need to be applied to the database and what its current state is. It can also help us roll back should we need to revert to a previous change.
Alternatives
Alembic is the only tool I've used in the past (besides Django's migration tooling, which isn't really applicable here). There also appears to be another tool called
yoyo-migrations
Additional context
See #2346 for further information
Footnotes
https://alembic.sqlalchemy.org/en/latest/ ↩
The text was updated successfully, but these errors were encountered: