Staging database restore DAG: base restore #2099

AetherUnbound · 2023-05-13T00:37:15Z

Fixes

This is a step towards #1989 but does not address it completely (steps 12+ from the implementation plan will be added in follow-up PRs).

Description

This PR adds the initial restore-from-snapshot steps of the staging database restore DAG. See the linked IP above for the full definition.

I recommend reviewing in the following order:

constants.py
utils.py
staging_database_restore.py
staging_database_restore_dag.py

Since this doesn't have the last required steps for the DAG, we won't enable it once this is merged. I felt that splitting this up into several PRs was a better route to take though, otherwise the full DAG would have been huge!

Testing Instructions

Tests should pass locally. You can also set the SKIP_STAGING_DATABASE_RESTORE Airflow Variable to true and kick off the DAG to see that it should skip everything.

Warning
If you want to test the actual restore operations, you can do so by setting the following environment variables:
AWS_RDS_CONN_ID=aws_rds
AIRFLOW_CONN_AWS_RDS=aws://<access-key>:<access-secret>@?region_name=us-east-1

This will run the restore & delete operations, so be advised!

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

stacimc

🥳 This looks great! No blocking feedback. Code lgtm and I did test locally with SKIP_STAGING_DATABASE_RESTORE. Splitting this into another PR with the next steps makes total sense to me as well.

I learned so much about TaskFlow reviewing this 😄 Very, very cool.

catalog/dags/database/staging_database_restore/staging_database_restore_dag.py

stacimc · 2023-05-16T17:53:10Z

catalog/dags/database/staging_database_restore/staging_database_restore_dag.py

+    should_skip >> staging_details
+
+    restore_snapshot = restore_staging_from_snapshot(latest_snapshot, staging_details)
+    ensure_snapshot_ready >> restore_snapshot


This all works as is, but I'm curious why we didn't need to explicitly set either:

latest_snapshot >> ensure_snapshot_ready

or

staging_details >> restore_snapshot

This is a little sneaky, behind the scenes stuff from the TaskFlow API 😅 the restore_snapshot = restore_staging_from_snapshot(latest_snapshot, staging_details) line creates an inherent dependency relationship from latest_snapshot and staging_details to restore_snapshot, without us having to explicitly define that relationship with the >> operator. Same for ensure_snapshot_ready - it has latest_snapshot as one of its params, so Airflow knows implicitly it has to run that step first!

stacimc · 2023-05-16T17:55:17Z

catalog/dags/database/staging_database_restore/staging_database_restore.py

+        raise ValueError(f"No staging DB found for {constants.STAGING_IDENTIFIER}")
+    staging_db = instances[0]
+    # While it might be tempting to log this information, it contains sensitive
+    # values. Instead, we'll select only the information we need, then log that.


Is filtering out sensitive values only important for logging? Ie, could we only restrict logging to REQUIRED_DB_INFO, but use everything to create the new db? Would that provide any extra resilience to configuration drift, or can we expect that REQUIRED_DB_INFO will never need to change?

I think there are a few reasons why the current approach might be best:

Easy prevention of logging secrets accidentally

REQUIRED_DB_INFO isn't likely to change on the features that we care about at least

There are some values that are returned from the details API that have to have their shape changed before they can be fed into the create from snapshot API (see the lines below on subnet group name and VPC security group IDs). There are dozens of values returned from the former API, so trying to manage the appropriate transformations on all of them so they can be fed into the latter API seems unnecessary at this point IMO.

stacimc · 2023-05-16T17:57:07Z

catalog/dags/database/staging_database_restore/staging_database_restore.py

+
+
+@task.short_circuit
+def skip_restore(should_skip: bool = False) -> bool:


Why does this take should_skip as an argument? Will a value ever be passed this way?

It won't be passed in normal control flow, but it's useful for testing! See: https://github.com/WordPress/openverse/pull/2099/files#diff-cbfa38a24878311bad02d5a1140c5bc6ab29e4e8a468c0c7e90433cdaf45a904R23-R24

stacimc · 2023-05-16T17:57:20Z

catalog/dags/database/staging_database_restore/staging_database_restore.py

+log = logging.getLogger(__name__)
+
+
+@task.short_circuit


Very cool, TIL!

Stumbled across this one by accident while looking for something else, couldn't help but use it!

stacimc · 2023-05-16T17:59:28Z

catalog/dags/database/staging_database_restore/staging_database_restore.py

+        )
+    )
+    if not should_continue:
+        notify_slack.function(


stacimc · 2023-05-16T18:02:54Z

catalog/dags/database/staging_database_restore/utils.py

+
+    @functools.wraps(func)
+    def wrapped(*args, **kwargs):
+        rds_hook = kwargs.pop("rds_hook", None) or RdsHook(


stacimc · 2023-05-16T18:05:17Z

catalog/dags/database/staging_database_restore/utils.py

+    if db_identifier not in constants.SAFE_TO_MUTATE:
+        raise ValueError(
+            f"The target function must be called with the staging database "
+            f"identifier ({constants.STAGING_IDENTIFIER}), not {db_identifier}"


This might be a little confusing, since the TEMP identifiers also work, right? Maybe The target function must be called with a non-production database identifier?

Oh shoot, that's a great point. This function went through a few iterations (initially it was only staging, then SAFE_TO_MUTATE came about). I'll update this!

stacimc · 2023-05-16T18:09:24Z

catalog/tests/dags/database/staging_database_restore/test_staging_database_restore.py

+        staging_database_restore.slack, "send_message"
+    ) as mock_send_message:
+        actual = staging_database_restore.skip_restore.function(should_skip)
+        assert actual == (not should_skip)


Should we add tests that mock the SKIP_STAGING_DATABASE_RESTORE variable?

I think since that's a Variable.get with a simple or comparison between that and should_skip, that testing the conditions there is sufficient. This seemed easier than mocking the Variable.get step, but I can change it to that kind of test and remove the should_skip param if you think that's best!

stacimc · 2023-05-16T18:10:31Z

catalog/tests/dags/database/staging_database_restore/test_staging_database_restore.py

+    rule = TriggerRule.NONE_FAILED
+    with DAG(dag_id="test_make_rename_task_group", start_date=datetime(1970, 1, 1)):
+        group = staging_database_restore.make_rename_task_group("dibble", "crim", rule)
+    assert group.group_id == "rename_dibble_to_crim"


stacimc

🥳 This looks great! No blocking feedback. Code lgtm and I did test locally with SKIP_STAGING_DATABASE_RESTORE. Splitting this into another PR with the next steps makes total sense to me as well.

I learned so much about TaskFlow reviewing this 😄 Very, very cool.

obulat

Tested this locally, and it skipped all of the steps :)
Oh, and got scared at first of all the errors when I set the variable value to True (capitalized as in Python). So everything worked as expected.

Co-authored-by: Staci Mullins <[email protected]>

AetherUnbound added 22 commits May 11, 2023 16:42

Add AWS_CONN_ID to constants.py

c6f5a07

Add first few tasks, DAG skeleton

05191e5

Extract AWS_RDS_CONN_ID, pull available check into sensor

5f9c762

Make a submodule constants file

f3dbd38

Add check staging utility function

b8502fd

Add step to get staging DB details

4bd03c8

Use a format string for the identifier names

62078ba

Add staging restore step

1cdb3c9

Add the await creation step

8ddf498

Add rename instance function, move constants, notify outage

56af8ca

Move rename into task group, add rename staging to old

5efd289

Add more retries to sensor after rename

5731465

Employ a shortcircuit operator instead of custom skip

939dcc1

Add failure handling on rename, add old db deletion after run

4eba5bc

Improve slack messaging

6fe70ad

Add test_utils, rename ensure_staging function

ca0459e

Allow RDS hook to be overridden

6d6ed95

Add tests for major functionality

82d0fa7

Update types

4dcbf06

Fix initial slack message

e373836

Add instructions about local testing

1a99bf7

Generate DAG docs

71069cf

AetherUnbound requested a review from a team as a code owner May 13, 2023 00:37

AetherUnbound requested review from obulat and stacimc May 13, 2023 00:37

github-actions bot added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label May 13, 2023

AetherUnbound added 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository labels May 15, 2023

stacimc approved these changes May 16, 2023

View reviewed changes

obulat approved these changes May 19, 2023

View reviewed changes

AetherUnbound mentioned this pull request May 23, 2023

Fix various issues that existed in the rotate_db_snapshots DAG #2158

Merged

8 tasks

AetherUnbound and others added 2 commits May 23, 2023 15:34

Fix spelling

a08ea01

Co-authored-by: Staci Mullins <[email protected]>

Update exception message

a8e4289

AetherUnbound merged commit 988381d into main May 23, 2023

AetherUnbound deleted the feature/staging-database-recreation-dag branch May 23, 2023 23:35

This was referenced May 26, 2023

Truncate OAuth tables after staging DB restore #2207

Merged

Deploy staging API after staging DB restore #2211

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Staging database restore DAG: base restore #2099

Staging database restore DAG: base restore #2099

AetherUnbound commented May 13, 2023

stacimc left a comment

stacimc May 16, 2023

AetherUnbound May 23, 2023

stacimc May 16, 2023

AetherUnbound May 23, 2023 •

edited

Loading

stacimc May 16, 2023

AetherUnbound May 23, 2023

stacimc May 16, 2023

AetherUnbound May 23, 2023

stacimc May 16, 2023

stacimc May 16, 2023

stacimc May 16, 2023

AetherUnbound May 23, 2023

stacimc May 16, 2023

AetherUnbound May 23, 2023

stacimc May 16, 2023

stacimc left a comment

obulat left a comment



		@task.short_circuit
		def skip_restore(should_skip: bool = False) -> bool:

Staging database restore DAG: base restore #2099

Staging database restore DAG: base restore #2099

Conversation

AetherUnbound commented May 13, 2023

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

stacimc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AetherUnbound May 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stacimc left a comment

Choose a reason for hiding this comment

obulat left a comment

Choose a reason for hiding this comment

AetherUnbound May 23, 2023 •

edited

Loading