-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use infrequent access when uploading provider TSVs #3810
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I have some non-blocking comments.
# incurring standard storage costs. If it absolutely needs to be rationalised, | ||
# consider the amount of energy spent on the extra request to S3 to update the | ||
# storage cost to try to get around a retrieval fee (which, again, will not | ||
# actually cost more, all things considered). Saving that energy could melt | ||
# the glaciers all that much more slowly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chuckled at this! While I feel odd about having such a large, informal comment in the code, thanks for thinking about the glaciers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure whether to leave this in the code or just in the PR description, for basically the reason of it being informal and more about the decision rather than the implementation.
What do you think? I'm comfortable removing it if it isn't helpful to have as in-code documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay to leave in! The context for this is important and will be useful to have right alongside the value itself.
@@ -200,6 +200,10 @@ init: | |||
down *flags: | |||
just dc down {{ flags }} | |||
|
|||
# Take all services down then call the specified app's up recipe. ex.: `just dup catalog` is useful for restarting the catalog with new environment variables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A shorter comment would help as this is printed as the help output. The example doesn't add much and can be left out, if that helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also what does "dup" mean, is is an abbreviation for something? I don't get it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an abbreviation for down/up
, but looks like a commonly used abbreviation for duplicate
.
I'll shorted the comment for sure, if you have a suggestion for an alternative name, happy to change it!
Updating the tests for the new parameter now |
e9e16d5
to
7307784
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me, thanks for the explanations on everything! One nit but take it or leave it. Local testing worked as expected 🚀
# incurring standard storage costs. If it absolutely needs to be rationalised, | ||
# consider the amount of energy spent on the extra request to S3 to update the | ||
# storage cost to try to get around a retrieval fee (which, again, will not | ||
# actually cost more, all things considered). Saving that energy could melt | ||
# the glaciers all that much more slowly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay to leave in! The context for this is important and will be useful to have right alongside the value itself.
Co-authored-by: Madison Swain-Bowden <[email protected]>
Fixes
Related to #1787 by @obulat
This does not close the issue, as we still need to run the s3 command noted in the issue to backfill the storage class onto all existing objects.
Description
Updates the provider ingestion workflow DAG to use the standard infrequent access storage class for provider TSVs.
I've documented the rationale behind setting the storage class eagerly, despite immediately accessing it, in the code. I really do not think it makes any sense at all to use a new task to set the storage class down to IA after the DAG is finished, and despite the 10th of a US penny spent on the retrieval fee, we will still save >99.99% of the money we. I really think the difference is incalculable and not worth the code complexity. The note about increased energy usage of an additional task (and all the requests it would need to send) helps further justify the fact, even though it doesn't save on our AWS bill.
The configuration is stored on the provider workflow, to allow for theoretical flexibility in setting different storage classes for different providers. We could use single-zone storage for non-dated DAGs, for example, as we always recreate the data 100% of the time on every DAG run, so it wouldn't present an issue if we needed to fully reload that provider's data if the AZ holding the data failed or went temporarily offline.
This does not close the issue, as we still need to run the s3 command noted in the issue to backfill the storage class onto all existing objects.
I also did not bother changing the storage class for staging objects because (a) the loader workflow DAG is retired and (b) those are immediately deleted when the DAG is successfully anyway.
As noted in
env.template
and the provider ingestion configuration data class, minio does not support the AWS storage classes aside fromSTANDARD
(the default). Therefore, our local environments need to use a different non-default value. The testing instructions below describe how to verify the behaviour.Testing Instructions
There are two things to verify in this PR:
For both of these, we'll run the wikimedia commons ingestion workflow, as it's the easiest to work with. For all these steps, we will follow this basic pattern:
load_mixed
task to pull data until it logsx records retrieved
(usually 160)load_mixed
as successfulcopy_to_s3
(either success or failure, depending on the cases outlined below)Log in to your local minio at http://0.0.0.0:5011/ using
test_key
andtest_secret
as the username and password respectively.First, test the behaviour on main, so that you understand what the existing behaviour here is. In short, because the storage class is not specified in our API call to s3, the default is used. For both minio and AWS, this is the "STANDARD" storage class. Run the Wikimedia DAG on
main
and confirm in minio that the uploaded TSV has the "STANDARD" storage class. Minio does not explicitly list the storage class of an object if it is the default. This is reliable behaviour, as you will see later. If you absolutely want to confirm the storage class of the object, download the "inspect" zip (disable encryption) and open thexl.meta
file for the TSV inside the zip, and confirm that the stringx-amz-storage-classšSTANDARD
is present in the file.Now that you understand the behaviour on main, checkout the branch, and copy
env.template
to.env
OR (if you don't want to override your.env
entirely) copy the newAIRFLOW_VAR_DEFAULT_S3_TSV_STORAGE_CLASS
variable into your .env. Runjust dup catalog
to get the new environment variable in.Follow the steps above to run the Wikimedia DAG to completion, and expect a successful run. Navigate to the TSV in minio, click on it, and then scroll to the bottom of the sidebar to view the "Metadata" section, and confirm you see
REDUCED_REDUNDANCY
as the storage class:This confirms that the DAG is sending the storage class with the load_file operation as expected. However, we must confirm that the
STANDARD_IA
, set as default in the code, is used when the Airflow variable isn't present. The variable is currently only meant for testing, though it could be used in the future to avoid a code change if we need to temporarily set the storage class to something else. However, we must ensure that the DAG usesSTANDARD_IA
by default.To do this, comment out
AIRFLOW_VAR_DEFAULT_S3_TSV_STORAGE_CLASS
from your catalog .env file, and runjust dup catalog
to refresh the environment variable. At this point, if you ran the Wikimedia DAG you would seecopy_to_s3
fail on an error of invalid storage class. This is because Minio does not supportSTANDARD_IA
and returns an error during upload. However, it does not specify which storage class is used. You may do so by adding the following code to the top ofcopy_file_to_s3
:That way, when you run the DAG, the logs for
copy_to_s3
will show the storage class passed to the hook so you can confirm it isSTANDARD_IA
instead of some other value.In this case, there will be nothing to confirm in Minio, as it will have rejected the upload due to the invalid storage class.
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin