Implementation Plan: Investigate the use of alembic for openledger migrations #1836

AetherUnbound · 2021-10-08T16:24:25Z

Current Situation

We have the DDL files for the openledger database defined in a series of SQL files in the local_postgres Docker image folder.

Suggested Improvement

For future migrations, it might be useful to have the database managed by a system like alembic¹ for tracking migrations.

We will want to alter the entrypoint for the catalog to run these migrations alongside the migrations for Airflow. We will also want a DAG which can run the migrations, so the production database can be migrated without needing to stop all running DAGs and deploy the catalog.

Benefit

A migration manager could help us determine which migrations need to be applied to the database and what its current state is. It can also help us roll back should we need to revert to a previous change.

Alternatives

Alembic is the only tool I've used in the past (besides Django's migration tooling, which isn't really applicable here). There also appears to be another tool called yoyo-migrations

Additional context

See #2346 for further information

https://alembic.sqlalchemy.org/en/latest/ ↩

The text was updated successfully, but these errors were encountered:

zackkrida · 2021-11-02T00:04:24Z

Related, I noticed that a few DAGs, like:

openverse_catalog/dags/database/recreate_image_popularity_calculation.py
openverse_catalog/dags/database/recreate_audio_popularity_calculation.py

aren't meant to be scheduled, but rather run manually after sql changes have been made to related tables. Might it make sense for these to be archived and their functionality replaced with some helper code that could be run inside/alongside of migrations?

AetherUnbound · 2021-11-02T00:29:14Z

The idea of a "migration DAG" is feeling more and more reasonable... 🤔

dhruvkb · 2021-11-02T07:21:22Z

Alembic is good. It's also a part of the ingestion server packages on the API side so the added consistency is another plus.

AetherUnbound · 2021-11-11T00:50:34Z

We might be able to leverage this project for our functions/views: https://github.com/olirice/alembic_utils

zackkrida · 2021-11-11T01:48:48Z

Nice!

sarayourfriend · 2023-06-12T23:57:00Z

The idea of a "migration DAG" is feeling more and more reasonable...

Can you clarify what "migration DAG" would entail? Would it just be a DAG to run alembic migrations when new ones exist? If so, that sounds good. If the idea would be to implement a migration system based on SQL DDL files, I would strongly suggest against it for the following reasons:

We would need to create and manage our own "migrations table" to maintain migrations state
It would make the local API and ingestion server environments dependent on Airflow for migrations, rather than being able to run Alembic independently to apply the migrations
We would need to implement our own conventions and approach to up/down migrations

If the DAG merely runs Alembic periodically or on command, that sounds fine and seems like a great way to solidify the production implementation and shield it from needing direct user input. However, the local environment also requires the same schema for applications that can currently run independent of the catalog. Unless we're willing to break that independence (feels like an important conversation we should have elsewhere before making that decision) we shouldn't rely exclusively on a new DAG or set of DAGs to manage migrations.

I'd also like to suggest a higher prioritisation of this issue based on the reasons given in #2364. Especially we embark on the data normalisation and other projects that may change the schema in any way, we need a more secure and consistent way to apply this that simultaneously doesn't involve running potentially destructive queries by hand in production (even copy/pasted ones) and that solves the local environment issue.

AetherUnbound · 2023-06-14T21:16:50Z

Yes, I was thinking of a DAG which would run whichever migration system we end up using. I similarly want to avoid a home-brewed migration system! Similar to Airflow, we may also add the migration step to the entrypoint for the catalog service. That would require deployments of the catalog in order to apply those changes, but it would make certain that the local environment is always up-to-date. I think paired with a DAG for executing the same migrations in production without requiring a restart, that would cover all our needs.

sarayourfriend · 2023-06-15T00:11:23Z

That would require deployments of the catalog in order to apply those changes

That seems like a fine way to manage the migrations anyway. However, it would force us to follow zero-downtime principles more closely, which I don't think we think of for the catalog db at the moment.

But that sounds good.

Do you think this issue should be an implementation plan, @AetherUnbound? I think so, but if you think it's straightforward enough maybe just updating the issue with specific instructions to use Alembic, codify the existing DDL, configure the local environment to run them in the entrypoint of the upstream_db, and create the DAG? Anything else that would warrant a more thoughtful plan?

AetherUnbound · 2023-06-15T03:04:55Z

That seems like a fine way to manage the migrations anyway. However, it would force us to follow zero-downtime principles more closely, which I don't think we think of for the catalog db at the moment.

You're right, we're a ways away from a zero-downtime deployment on the catalog. However, the catalog database is one that Airflow interacts with, not necessarily one that it depends on the same way that the API depends on its database. I don't think it should be necessary to deploy the service when the dag-sync script we have deploys updates to the DAG files (which might use that database) every minute. Certainly migrating on deploy would also be necessary, but we can trigger migrations without having to stop all running DAGs.

Do you think this issue should be an implementation plan, @AetherUnbound? I think so, but if you think it's straightforward enough maybe just updating the issue with specific instructions to use Alembic, codify the existing DDL, configure the local environment to run them in the entrypoint of the upstream_db, and create the DAG? Anything else that would warrant a more thoughtful plan?

I think an implementation plan would be a great idea for this, especially to get feedback from folks and make sure we haven't missed anything! I'll convert the issue 🙂

AetherUnbound added 🟨 priority: medium Not blocking but should be addressed soon 💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users labels Oct 8, 2021

AetherUnbound mentioned this issue Apr 17, 2023

Set airflow user password in DDL from environment #1695

Open

1 task

AetherUnbound mentioned this issue Apr 17, 2023

[RFC] Catalog & API 3D Model Support #1651

Closed

15 tasks

krysal mentioned this issue Oct 7, 2022

Normalize data models #244

Open

29 tasks

obulat mentioned this issue Feb 17, 2023

Catalog Database Migration Plan #416

Open

3 tasks

obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 24, 2023

github-project-automation bot added this to Openverse Backlog Apr 19, 2023

github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Apr 19, 2023

obulat transferred this issue from WordPress/openverse-catalog Apr 19, 2023

AetherUnbound mentioned this issue Jun 9, 2023

Use a migrations management tool for upstream_db and thus for production database #2364

Closed

sarayourfriend added 🟧 priority: high Stalls work on the project or its dependents and removed 🟨 priority: medium Not blocking but should be addressed soon labels Jun 12, 2023

AetherUnbound changed the title ~~[Infrastructure] Investigate the use of alembic for openledger migrations~~ Implementation Plan: Investigate the use of alembic for openledger migrations Jun 15, 2023

AetherUnbound added the 🧭 project: implementation plan An implementation plan for a project label Jun 15, 2023

sarayourfriend mentioned this issue Aug 28, 2023

Remove popularity constants view #2883

Merged

8 tasks

krysal added 🟨 priority: medium Not blocking but should be addressed soon and removed 🟧 priority: high Stalls work on the project or its dependents labels Dec 5, 2023

sarayourfriend mentioned this issue May 21, 2024

Convert longer media varchar fields to text in the catalog db #4357

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation Plan: Investigate the use of alembic for openledger migrations #1836

Implementation Plan: Investigate the use of alembic for openledger migrations #1836

AetherUnbound commented Oct 8, 2021 •

edited

Loading

zackkrida commented Nov 2, 2021 •

edited

Loading

AetherUnbound commented Nov 2, 2021

dhruvkb commented Nov 2, 2021

AetherUnbound commented Nov 11, 2021

zackkrida commented Nov 11, 2021

sarayourfriend commented Jun 12, 2023

AetherUnbound commented Jun 14, 2023

sarayourfriend commented Jun 15, 2023

AetherUnbound commented Jun 15, 2023

Implementation Plan: Investigate the use of alembic for openledger migrations #1836

Implementation Plan: Investigate the use of alembic for openledger migrations #1836

Comments

AetherUnbound commented Oct 8, 2021 • edited Loading

Current Situation

Suggested Improvement

Benefit

Alternatives

Additional context

Footnotes

zackkrida commented Nov 2, 2021 • edited Loading

AetherUnbound commented Nov 2, 2021

dhruvkb commented Nov 2, 2021

AetherUnbound commented Nov 11, 2021

zackkrida commented Nov 11, 2021

sarayourfriend commented Jun 12, 2023

AetherUnbound commented Jun 14, 2023

sarayourfriend commented Jun 15, 2023

AetherUnbound commented Jun 15, 2023

AetherUnbound commented Oct 8, 2021 •

edited

Loading

zackkrida commented Nov 2, 2021 •

edited

Loading