Refactor Freesound to use ProviderDataIngester #746

AetherUnbound · 2022-09-29T01:21:15Z

Fixes

Fixes WordPress/openverse#1521 by @stacimc

Description

This PR refactors the Freesound provider script to use the ProviderDataIngester class.

It's a pretty standard refactor, aside from some more unique functions that have additional retries built in.

One thing I noticed is that the original provider script is set up to allow processing a subset of data, but it has been run in such a way that it consumes the entire dataset. I think this is fine for now, but as soon as we have a full, successful run, this should be turned into a daily dated DAG. This will be a much more efficient way to run the DAG! Getting a successful run will require some additional error handling for failures that have occurred in the past, which I'm hoping to enumerate in another ticket.

Testing Instructions

just recreate && just test
Run the DAG locally and verify that it respects whatever ingestion limit is set (note that there are 3 licenses it iterates through, so that ingestion limit will be hit 3 times)

Checklist

My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

stacimc

Looking very good! Just a few minor things, mostly the license param -- although I did get data locally, so that's kind of interesting!

I think this is fine for now, but as soon as we have a full, successful run, this should be turned into a daily dated DAG.

This is really exciting! 😍

openverse_catalog/dags/providers/provider_api_scripts/freesound.py

stacimc · 2022-09-29T22:35:56Z

openverse_catalog/dags/providers/provider_api_scripts/freesound.py

-    if main_audio is None:
+    def get_batch_data(self, response_json):
+        if response_json:
+            # Freesound sometimes returns results that are just "None", filter these out


Nice, I like handling this here!

stacimc · 2022-09-29T22:42:43Z

openverse_catalog/dags/providers/provider_api_scripts/freesound.py

+        main_file = self._get_preview_filedata(
+            "preview-hq-mp3", previews["preview-hq-mp3"]
+        )
+        main_file["audio_url"] = main_file.pop("url")


I can see this was in the old code, but I'm not sure why _get_preview_filedata doesn't just name url audio_url?

I'm not 100% sure, but I think it's because the URL field is called url for alt_files, and audio_url for the main audio in the AudioStore. And _get_preview_filedata is mainly used for populating the alt_files, not for the main file.

It's definitely audio_url in the AudioStore for the main file:

openverse-catalog/openverse_catalog/dags/common/storage/audio.py

Line 46 in 4a9c008

audio_url: str,

I checked the database and it does look like we're using url in the alt_files dict:

deploy@localhost:openledger> select url, alt_files from audio where provider='freesound' limit 1; -[ RECORD 1 ]------------------------- url | https://freesound.org/data/previews/100/100041_1578278-hq.mp3 alt_files | [{"url": "https://freesound.org/apiv2/sounds/100041/download/", "bit_rate": "0", "filesize": "5056680", "filetype": "wav", "sample_rate": "44100"}] SELECT 1

I can see that the AudioStore expects audio_url, and that alt_files uses url (notably the audio_url also gets converted to url for the main file in the actual DB which can be a little confusing, I opened #784 for it).

But it looks like _get_preview_filedata is only ever used once, to get the url and other data for the main file? I don't see why it couldn't just return the key as audio_url so it doesn't have to get renamed here. Unless I'm missing another place where it's being used.

OH, I see what you're saying now, apologies! I'm not sure it's necessary either, I'll rework this a bit so it's clearer!

openverse-bot · 2022-10-06T00:00:03Z

Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR:

@krysal
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend¹ days, this PR was updated 4 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)².

@AetherUnbound, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Specifically, Saturday and Sunday. ↩
For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range. ↩

krysal

I let run the DAG for +10 minutes and got nothing, not sure why. I have an API key and confirmed it worked by doing a few requests with curl.

krysal · 2022-10-10T16:08:47Z

openverse_catalog/dags/providers/provider_api_scripts/freesound.py

+    def ingest_records(self, **kwargs):
+        for license_name in [
+            "Attribution",
+            "Attribution Noncommercial",
+            "Creative Commons 0",
+        ]:
+            logger.info(f"Obtaining audio records under license '{license_name}'")
+            super().ingest_records(license_name=license_name, **kwargs)


AFAIK Freesound provides only CC-licensed tracks so is this custom code needed at all? They also mention they used to include the Sampling+ license (now retired) and we have ingested a few of those as well but now will be omitted here.

What do you think?

I'm glad you pointed this out! I looked more into their API's filter notes, and apparently we're using this entirely wrong! The license param is completely ignored in the request (I tried Attribution locally and saw that I got back a ton of different license types within the first 25 results:

In [14]: {x["license"] for x in r[0]} Out[14]: {'http://creativecommons.org/licenses/by-nc/3.0/', 'http://creativecommons.org/licenses/by/3.0/', 'http://creativecommons.org/publicdomain/zero/1.0/', 'https://creativecommons.org/licenses/by-nc/4.0/', 'https://creativecommons.org/licenses/by/4.0/'}

As you mention, I don't think this logic is needed at all (and we obviously aren't even using it presently). I think it'd be best to remove it, but we could also create an issue for the stability milestone to have this filter by license type correctly. Given the nature of the script I don't think there's any advantage to doing so. What do you think @WordPress/openverse-catalog?

It makes sense that it wasn't doing anything, especially considering we were previously using license_name as the param 😮 I guess the intention must have been to add it to the filter? But even so, you're right, it would be unnecessary!

I think removing this as part of this PR is completely fine, because as you point out it doesn't change the behavior of the script at all 😄 Well spotted!

AetherUnbound · 2022-10-10T20:02:01Z

Huh, I was able to run this just fine locally 😮 Can anyone else with an API key confirm that they're receiving records when they run this @WordPress/openverse-catalog?

krysal · 2022-10-10T20:49:32Z

I tried rerunning it for some 20 minutes and got the results! I was too impatient 😆 So looks great! As I mentioned early, I just think we can avoid the whole license loop given all the tracks are CC licensed.

AetherUnbound · 2022-10-10T21:24:24Z

Thanks for checking @krysal! I've gone ahead and removed that logic - if folks have any concerns though, feel free to voice them 🙂

Co-authored-by: Staci Cooper <[email protected]>

krysal

Excellent!

stacimc

🎉 Looks great! 🎉

AetherUnbound requested a review from a team as a code owner September 29, 2022 01:21

AetherUnbound requested review from krysal and stacimc September 29, 2022 01:21

openverse-bot added ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🟨 priority: medium Not blocking but should be addressed soon labels Sep 29, 2022

stacimc suggested changes Sep 29, 2022

View reviewed changes

This was referenced Apr 17, 2023

Use HEAD request in Freesound ingester for file size request WordPress/openverse#1578

Closed

Increase Freesound timeout while DAG is not dated WordPress/openverse#1425

Closed

AetherUnbound requested a review from stacimc October 7, 2022 00:20

zackkrida requested a review from obulat October 7, 2022 15:02

krysal reviewed Oct 10, 2022

View reviewed changes

AetherUnbound and others added 11 commits October 11, 2022 12:07

Initial refactor steps

a6f2cc4

Move functions

4d22780

Refactor other functions

9e6fc93

Update tests

565a942

Add class to workflow list

be193ae

Remove TODO note

0297167

Update DAGs.md

67c0c65

Fixes identified in PR review

3f46cca

Co-authored-by: Staci Cooper <[email protected]>

Simplify some logic

8fd0286

Remove per-license selection logic

12efdb5

Simplify URL acquisition & tests

b04cbe3

AetherUnbound force-pushed the feature/freesound-refactor#586 branch from b094f4f to b04cbe3 Compare October 11, 2022 19:07

krysal approved these changes Oct 11, 2022

View reviewed changes

stacimc approved these changes Oct 12, 2022

View reviewed changes

AetherUnbound merged commit 8f92318 into main Oct 12, 2022

AetherUnbound deleted the feature/freesound-refactor#586 branch October 12, 2022 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Freesound to use ProviderDataIngester #746

Refactor Freesound to use ProviderDataIngester #746

AetherUnbound commented Sep 29, 2022

stacimc left a comment

stacimc Sep 29, 2022

stacimc Sep 29, 2022

obulat Oct 6, 2022

AetherUnbound Oct 7, 2022

stacimc Oct 10, 2022

AetherUnbound Oct 11, 2022

openverse-bot commented Oct 6, 2022

krysal left a comment

krysal Oct 10, 2022

AetherUnbound Oct 10, 2022

stacimc Oct 10, 2022

AetherUnbound commented Oct 10, 2022

krysal commented Oct 10, 2022

AetherUnbound commented Oct 10, 2022

krysal left a comment •

edited

Loading

stacimc left a comment

Refactor Freesound to use ProviderDataIngester #746

Refactor Freesound to use ProviderDataIngester #746

Conversation

AetherUnbound commented Sep 29, 2022

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

stacimc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openverse-bot commented Oct 6, 2022

Footnotes

krysal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AetherUnbound commented Oct 10, 2022

krysal commented Oct 10, 2022

AetherUnbound commented Oct 10, 2022

krysal left a comment • edited Loading

Choose a reason for hiding this comment

stacimc left a comment

Choose a reason for hiding this comment

krysal left a comment •

edited

Loading