Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation Plan: Investigate the use of alembic for openledger migrations #1836

Open
AetherUnbound opened this issue Oct 8, 2021 · 9 comments
Labels
💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟨 priority: medium Not blocking but should be addressed soon 🧭 project: implementation plan An implementation plan for a project 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@AetherUnbound
Copy link
Collaborator

AetherUnbound commented Oct 8, 2021

Current Situation

We have the DDL files for the openledger database defined in a series of SQL files in the local_postgres Docker image folder.

Suggested Improvement

For future migrations, it might be useful to have the database managed by a system like alembic1 for tracking migrations.

We will want to alter the entrypoint for the catalog to run these migrations alongside the migrations for Airflow. We will also want a DAG which can run the migrations, so the production database can be migrated without needing to stop all running DAGs and deploy the catalog.

Benefit

A migration manager could help us determine which migrations need to be applied to the database and what its current state is. It can also help us roll back should we need to revert to a previous change.

Alternatives

Alembic is the only tool I've used in the past (besides Django's migration tooling, which isn't really applicable here). There also appears to be another tool called yoyo-migrations

Additional context

See #2346 for further information

Footnotes

  1. https://alembic.sqlalchemy.org/en/latest/

@AetherUnbound AetherUnbound added 🟨 priority: medium Not blocking but should be addressed soon 💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users labels Oct 8, 2021
@zackkrida
Copy link
Member

zackkrida commented Nov 2, 2021

Related, I noticed that a few DAGs, like:

  • openverse_catalog/dags/database/recreate_image_popularity_calculation.py
  • openverse_catalog/dags/database/recreate_audio_popularity_calculation.py

aren't meant to be scheduled, but rather run manually after sql changes have been made to related tables. Might it make sense for these to be archived and their functionality replaced with some helper code that could be run inside/alongside of migrations?

@AetherUnbound
Copy link
Collaborator Author

The idea of a "migration DAG" is feeling more and more reasonable... 🤔

@dhruvkb
Copy link
Member

dhruvkb commented Nov 2, 2021

Alembic is good. It's also a part of the ingestion server packages on the API side so the added consistency is another plus.

@AetherUnbound
Copy link
Collaborator Author

We might be able to leverage this project for our functions/views: https://github.com/olirice/alembic_utils

@zackkrida
Copy link
Member

Nice!

@krysal krysal mentioned this issue Oct 7, 2022
29 tasks
@obulat obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 24, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Apr 19, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 19, 2023
@sarayourfriend
Copy link
Collaborator

The idea of a "migration DAG" is feeling more and more reasonable...

Can you clarify what "migration DAG" would entail? Would it just be a DAG to run alembic migrations when new ones exist? If so, that sounds good. If the idea would be to implement a migration system based on SQL DDL files, I would strongly suggest against it for the following reasons:

  1. We would need to create and manage our own "migrations table" to maintain migrations state
  2. It would make the local API and ingestion server environments dependent on Airflow for migrations, rather than being able to run Alembic independently to apply the migrations
  3. We would need to implement our own conventions and approach to up/down migrations

If the DAG merely runs Alembic periodically or on command, that sounds fine and seems like a great way to solidify the production implementation and shield it from needing direct user input. However, the local environment also requires the same schema for applications that can currently run independent of the catalog. Unless we're willing to break that independence (feels like an important conversation we should have elsewhere before making that decision) we shouldn't rely exclusively on a new DAG or set of DAGs to manage migrations.

I'd also like to suggest a higher prioritisation of this issue based on the reasons given in #2364. Especially we embark on the data normalisation and other projects that may change the schema in any way, we need a more secure and consistent way to apply this that simultaneously doesn't involve running potentially destructive queries by hand in production (even copy/pasted ones) and that solves the local environment issue.

@sarayourfriend sarayourfriend added 🟧 priority: high Stalls work on the project or its dependents and removed 🟨 priority: medium Not blocking but should be addressed soon labels Jun 12, 2023
@AetherUnbound
Copy link
Collaborator Author

Yes, I was thinking of a DAG which would run whichever migration system we end up using. I similarly want to avoid a home-brewed migration system! Similar to Airflow, we may also add the migration step to the entrypoint for the catalog service. That would require deployments of the catalog in order to apply those changes, but it would make certain that the local environment is always up-to-date. I think paired with a DAG for executing the same migrations in production without requiring a restart, that would cover all our needs.

@sarayourfriend
Copy link
Collaborator

That would require deployments of the catalog in order to apply those changes

That seems like a fine way to manage the migrations anyway. However, it would force us to follow zero-downtime principles more closely, which I don't think we think of for the catalog db at the moment.

But that sounds good.

Do you think this issue should be an implementation plan, @AetherUnbound? I think so, but if you think it's straightforward enough maybe just updating the issue with specific instructions to use Alembic, codify the existing DDL, configure the local environment to run them in the entrypoint of the upstream_db, and create the DAG? Anything else that would warrant a more thoughtful plan?

@AetherUnbound
Copy link
Collaborator Author

That seems like a fine way to manage the migrations anyway. However, it would force us to follow zero-downtime principles more closely, which I don't think we think of for the catalog db at the moment.

You're right, we're a ways away from a zero-downtime deployment on the catalog. However, the catalog database is one that Airflow interacts with, not necessarily one that it depends on the same way that the API depends on its database. I don't think it should be necessary to deploy the service when the dag-sync script we have deploys updates to the DAG files (which might use that database) every minute. Certainly migrating on deploy would also be necessary, but we can trigger migrations without having to stop all running DAGs.

Do you think this issue should be an implementation plan, @AetherUnbound? I think so, but if you think it's straightforward enough maybe just updating the issue with specific instructions to use Alembic, codify the existing DDL, configure the local environment to run them in the entrypoint of the upstream_db, and create the DAG? Anything else that would warrant a more thoughtful plan?

I think an implementation plan would be a great idea for this, especially to get feedback from folks and make sure we haven't missed anything! I'll convert the issue 🙂

@AetherUnbound AetherUnbound changed the title [Infrastructure] Investigate the use of alembic for openledger migrations Implementation Plan: Investigate the use of alembic for openledger migrations Jun 15, 2023
@AetherUnbound AetherUnbound added the 🧭 project: implementation plan An implementation plan for a project label Jun 15, 2023
@krysal krysal added 🟨 priority: medium Not blocking but should be addressed soon and removed 🟧 priority: high Stalls work on the project or its dependents labels Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟨 priority: medium Not blocking but should be addressed soon 🧭 project: implementation plan An implementation plan for a project 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Status: 📋 Backlog
Development

No branches or pull requests

6 participants