Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Add HEAD requests support to DelayedRequester #865

Merged
merged 8 commits into from
Nov 18, 2022

Conversation

twstokes
Copy link
Contributor

@twstokes twstokes commented Nov 10, 2022

Fixes

Description

This PR adds a head method to the DelayedRequester class and updates file size calls in the StockSnap and WordPress ingesters to use it.

Testing Instructions

See steps @AetherUnbound provided here.

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@openverse-bot openverse-bot added the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label Nov 10, 2022
@twstokes twstokes force-pushed the feature/delayed-requester-head branch from 180d6d2 to 367118f Compare November 10, 2022 17:44
@twstokes twstokes changed the title Add HEAD requests support to DelayedRequester Add HEAD requests support to DelayedRequester Nov 10, 2022
@obulat obulat added 🟩 priority: low Low priority and doesn't need to be rushed ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Nov 12, 2022
@@ -50,23 +49,24 @@ def __init__(self, delay=0, headers=None):
self._last_request = 0
self.session = requests.Session()

def get(self, url, params=None, **kwargs):
def _make_request(self, method, url, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _make_request(self, method, url, **kwargs):
def _make_request(self, method: Literal['get', 'head'], url: str, **kwargs):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 Hi @obulat - thanks for the suggestion. Could you help me understand this literal type constraint better? I was under the impression that the calls to _make_request were passing methods from the Requests module and not string values, but I'm probably missing something.

To experiment I changed the Literal params to something nonsensical (instead of get and head) and just lint and just test passed when I was expecting them to fail.

Also, if you're suggesting that we make the Requests module calls only within _make_request based on the string passed then that makes sense, but I wanted to get confirmation. Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gah, that's right! This is actually of type callable, rather than Literal. I think the way it is presently is fine, but it might be worth adding the generic callable type annotation (trying to match the call signatures of both .head and .get will be a head-ache 😄)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for stepping in, @AetherUnbound, and sorry I couldn't reply sooner, @twstokes.


Required Arguments:

url: URL to make the request as a string.
params: Dictionary of query string params
**kwargs: Optional arguments that will be passed to `requests.get`
**kwargs: Optional arguments that will be passed to the `requests`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstrings params need to be updated, too: add method and remove params (or keep it if it's possible to document a **kwargs parameter).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 8c5d0a2.

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really love this approach! So nice to see you in Openverse, @twstokes!

@AetherUnbound
Copy link
Contributor

Looks like we could fold one more fix into this! WordPress/openverse#1578

@twstokes
Copy link
Contributor Author

Thank you @obulat and @AetherUnbound for the suggestions! I've been AFK but will have a chance to address these soon. 🙇

@twstokes twstokes marked this pull request as ready for review November 16, 2022 18:46
@twstokes twstokes requested a review from a team as a code owner November 16, 2022 18:46
@twstokes
Copy link
Contributor Author

Looks like we could fold one more fix into this! WordPress/openverse#1578

Hey @AetherUnbound! 👋

I attempted to:

  • use self.delayed_requester.head
  • remove the decorator on _get_audio_file_size

here, but ran into some exceptions that may be related to what you've dealt with in the past. At that point I didn't identify a quick fix and felt like I was going outside of the original mission of this PR.

If you have any guidance I'll be glad to wrap it into these changes. Thanks!

@twstokes
Copy link
Contributor Author

I wanted to highlight this comment by @krysal and that I'm not totally confident the responses are the same because I wasn't sure of how to test the HEAD requests directly.

@twstokes twstokes requested review from obulat and removed request for krysal, AetherUnbound and obulat November 16, 2022 22:21
@AetherUnbound
Copy link
Contributor

Thank you @twstokes! And ah, Freesound is indeed a special case. I think we can move forward with this PR without that as a fix, since we'll work to address it in WordPress/openverse-catalog#754 anyway. If you're interested in including it still, I think we'd want to keep the decorator but also add the common.requester.RetriesExceeded exception to the flaky_exceptions list.

Probably the best way to ensure that we're getting the same content back from the requests is to test running each of the altered provider DAGs locally and making sure we get the appropriate filesizes where expected! I will try to run each of those today and share the results 🙂

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran both the Stocksnap and WordPress providers locally (after the suggested changes below) and they looked good! We'll definitely need to address the return issue before this is merged.

If you'd like to try and run this yourself @twstokes, do the following:

  1. Add AIRFLOW_VAR_INGESTION_LIMIT=200 to your .env file
  2. Run just up
  3. Login to Airflow using airflow/airflow as user/pass respectively.
  4. Flip the switch on the stocksnap_workflow DAG and make sure it runs successfully (you can click on the DAG name itself to get to more useful views).

openverse_catalog/dags/common/requester.py Outdated Show resolved Hide resolved
@twstokes
Copy link
Contributor Author

Thanks for all your help @AetherUnbound 😄. I ran both the stocksnap_workflow and wordpress_workflow from Airflow and they succeeded. 🤞

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I have one more suggestion, but after that this is good to go. Thanks for all the iteration on this! 🚀

openverse_catalog/dags/common/requester.py Show resolved Hide resolved
@AetherUnbound AetherUnbound requested a review from obulat November 18, 2022 01:12
Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great documentation and typing!

@obulat obulat merged commit b9389af into WordPress:main Nov 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟩 priority: low Low priority and doesn't need to be rushed
Projects
None yet
4 participants