Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use infrequent access when uploading provider TSVs #3810

Merged
merged 4 commits into from
Feb 27, 2024

Conversation

sarayourfriend
Copy link
Collaborator

@sarayourfriend sarayourfriend commented Feb 20, 2024

Fixes

Related to #1787 by @obulat

This does not close the issue, as we still need to run the s3 command noted in the issue to backfill the storage class onto all existing objects.

Description

Updates the provider ingestion workflow DAG to use the standard infrequent access storage class for provider TSVs.

I've documented the rationale behind setting the storage class eagerly, despite immediately accessing it, in the code. I really do not think it makes any sense at all to use a new task to set the storage class down to IA after the DAG is finished, and despite the 10th of a US penny spent on the retrieval fee, we will still save >99.99% of the money we. I really think the difference is incalculable and not worth the code complexity. The note about increased energy usage of an additional task (and all the requests it would need to send) helps further justify the fact, even though it doesn't save on our AWS bill.

The configuration is stored on the provider workflow, to allow for theoretical flexibility in setting different storage classes for different providers. We could use single-zone storage for non-dated DAGs, for example, as we always recreate the data 100% of the time on every DAG run, so it wouldn't present an issue if we needed to fully reload that provider's data if the AZ holding the data failed or went temporarily offline.

This does not close the issue, as we still need to run the s3 command noted in the issue to backfill the storage class onto all existing objects.

I also did not bother changing the storage class for staging objects because (a) the loader workflow DAG is retired and (b) those are immediately deleted when the DAG is successfully anyway.

As noted in env.template and the provider ingestion configuration data class, minio does not support the AWS storage classes aside from STANDARD (the default). Therefore, our local environments need to use a different non-default value. The testing instructions below describe how to verify the behaviour.

Testing Instructions

There are two things to verify in this PR:

  1. That the non-default storage class is used by the DAG
  2. That the local environment is able to pass by using a non-standard storage class compatible with Minio.

For both of these, we'll run the wikimedia commons ingestion workflow, as it's the easiest to work with. For all these steps, we will follow this basic pattern:

  1. Run the Wikimedia DAG
  2. Allow the load_mixed task to pull data until it logs x records retrieved (usually 160)
  3. Mark load_mixed as successful
  4. Observe the behaviour of copy_to_s3 (either success or failure, depending on the cases outlined below)
  5. Observe the final result of the DAG (success or failure, as above)
  6. In cases where we expect a success of the local run, confirm the value in Minio

Log in to your local minio at http://0.0.0.0:5011/ using test_key and test_secret as the username and password respectively.

First, test the behaviour on main, so that you understand what the existing behaviour here is. In short, because the storage class is not specified in our API call to s3, the default is used. For both minio and AWS, this is the "STANDARD" storage class. Run the Wikimedia DAG on main and confirm in minio that the uploaded TSV has the "STANDARD" storage class. Minio does not explicitly list the storage class of an object if it is the default. This is reliable behaviour, as you will see later. If you absolutely want to confirm the storage class of the object, download the "inspect" zip (disable encryption) and open the xl.meta file for the TSV inside the zip, and confirm that the string x-amz-storage-classšSTANDARD is present in the file.

Screenshot showing standard storage class in the meta.xl file for the uploaded TSV

Now that you understand the behaviour on main, checkout the branch, and copy env.template to .env OR (if you don't want to override your .env entirely) copy the new AIRFLOW_VAR_DEFAULT_S3_TSV_STORAGE_CLASS variable into your .env. Run just dup catalog to get the new environment variable in.

Follow the steps above to run the Wikimedia DAG to completion, and expect a successful run. Navigate to the TSV in minio, click on it, and then scroll to the bottom of the sidebar to view the "Metadata" section, and confirm you see REDUCED_REDUNDANCY as the storage class:

Minio dashboard showing non-default storage class

This confirms that the DAG is sending the storage class with the load_file operation as expected. However, we must confirm that the STANDARD_IA, set as default in the code, is used when the Airflow variable isn't present. The variable is currently only meant for testing, though it could be used in the future to avoid a code change if we need to temporarily set the storage class to something else. However, we must ensure that the DAG uses STANDARD_IA by default.

To do this, comment out AIRFLOW_VAR_DEFAULT_S3_TSV_STORAGE_CLASS from your catalog .env file, and run just dup catalog to refresh the environment variable. At this point, if you ran the Wikimedia DAG you would see copy_to_s3 fail on an error of invalid storage class. This is because Minio does not support STANDARD_IA and returns an error during upload. However, it does not specify which storage class is used. You may do so by adding the following code to the top of copy_file_to_s3:

import json
logger.error(json.dumps(extra_args))

That way, when you run the DAG, the logs for copy_to_s3 will show the storage class passed to the hook so you can confirm it is STANDARD_IA instead of some other value.

In this case, there will be nothing to confirm in Minio, as it will have rejected the upload due to the invalid storage class.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • [N/A] I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • [N/A] I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@sarayourfriend sarayourfriend requested review from a team as code owners February 20, 2024 00:40
@github-actions github-actions bot added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 20, 2024
@openverse-bot openverse-bot added the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label Feb 20, 2024
Copy link
Member

@dhruvkb dhruvkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I have some non-blocking comments.

Comment on lines +181 to +185
# incurring standard storage costs. If it absolutely needs to be rationalised,
# consider the amount of energy spent on the extra request to S3 to update the
# storage cost to try to get around a retrieval fee (which, again, will not
# actually cost more, all things considered). Saving that energy could melt
# the glaciers all that much more slowly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chuckled at this! While I feel odd about having such a large, informal comment in the code, thanks for thinking about the glaciers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure whether to leave this in the code or just in the PR description, for basically the reason of it being informal and more about the decision rather than the implementation.

What do you think? I'm comfortable removing it if it isn't helpful to have as in-code documentation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay to leave in! The context for this is important and will be useful to have right alongside the value itself.

@@ -200,6 +200,10 @@ init:
down *flags:
just dc down {{ flags }}

# Take all services down then call the specified app's up recipe. ex.: `just dup catalog` is useful for restarting the catalog with new environment variables
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A shorter comment would help as this is printed as the help output. The example doesn't add much and can be left out, if that helps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also what does "dup" mean, is is an abbreviation for something? I don't get it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an abbreviation for down/up, but looks like a commonly used abbreviation for duplicate.

I'll shorted the comment for sure, if you have a suggestion for an alternative name, happy to change it!

@sarayourfriend sarayourfriend added 🟨 priority: medium Not blocking but should be addressed soon 💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🧱 stack: infra Related to the Terraform config and other infrastructure and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Feb 22, 2024
@sarayourfriend
Copy link
Collaborator Author

Updating the tests for the new parameter now

Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me, thanks for the explanations on everything! One nit but take it or leave it. Local testing worked as expected 🚀

Comment on lines +181 to +185
# incurring standard storage costs. If it absolutely needs to be rationalised,
# consider the amount of energy spent on the extra request to S3 to update the
# storage cost to try to get around a retrieval fee (which, again, will not
# actually cost more, all things considered). Saving that energy could melt
# the glaciers all that much more slowly.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay to leave in! The context for this is important and will be useful to have right alongside the value itself.

catalog/env.template Outdated Show resolved Hide resolved
Co-authored-by: Madison Swain-Bowden <[email protected]>
@sarayourfriend sarayourfriend merged commit 2cffcb9 into main Feb 27, 2024
39 checks passed
@sarayourfriend sarayourfriend deleted the fix/change-s3-storage-class branch February 27, 2024 03:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: infra Related to the Terraform config and other infrastructure
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants