Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Refactor Freesound to use ProviderDataIngester #746

Merged
merged 11 commits into from
Oct 12, 2022

Conversation

AetherUnbound
Copy link
Contributor

Fixes

Fixes WordPress/openverse#1521 by @stacimc

Description

This PR refactors the Freesound provider script to use the ProviderDataIngester class.

It's a pretty standard refactor, aside from some more unique functions that have additional retries built in.

One thing I noticed is that the original provider script is set up to allow processing a subset of data, but it has been run in such a way that it consumes the entire dataset. I think this is fine for now, but as soon as we have a full, successful run, this should be turned into a daily dated DAG. This will be a much more efficient way to run the DAG! Getting a successful run will require some additional error handling for failures that have occurred in the past, which I'm hoping to enumerate in another ticket.

Testing Instructions

  1. just recreate && just test
  2. Run the DAG locally and verify that it respects whatever ingestion limit is set (note that there are 3 licenses it iterates through, so that ingestion limit will be hit 3 times)

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@AetherUnbound AetherUnbound requested a review from a team as a code owner September 29, 2022 01:21
@openverse-bot openverse-bot added ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🟨 priority: medium Not blocking but should be addressed soon labels Sep 29, 2022
Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking very good! Just a few minor things, mostly the license param -- although I did get data locally, so that's kind of interesting!

I think this is fine for now, but as soon as we have a full, successful run, this should be turned into a daily dated DAG.

This is really exciting! 😍

if main_audio is None:
def get_batch_data(self, response_json):
if response_json:
# Freesound sometimes returns results that are just "None", filter these out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I like handling this here!

main_file = self._get_preview_filedata(
"preview-hq-mp3", previews["preview-hq-mp3"]
)
main_file["audio_url"] = main_file.pop("url")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see this was in the old code, but I'm not sure why _get_preview_filedata doesn't just name url audio_url?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure, but I think it's because the URL field is called url for alt_files, and audio_url for the main audio in the AudioStore. And _get_preview_filedata is mainly used for populating the alt_files, not for the main file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely audio_url in the AudioStore for the main file:

I checked the database and it does look like we're using url in the alt_files dict:

deploy@localhost:openledger> select url, alt_files from audio where provider='freesound' limit 1;
-[ RECORD 1 ]-------------------------
url       | https://freesound.org/data/previews/100/100041_1578278-hq.mp3
alt_files | [{"url": "https://freesound.org/apiv2/sounds/100041/download/", "bit_rate": "0", "filesize": "5056680", "filetype": "wav", "sample_rate": "44100"}]
SELECT 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that the AudioStore expects audio_url, and that alt_files uses url (notably the audio_url also gets converted to url for the main file in the actual DB which can be a little confusing, I opened #784 for it).

But it looks like _get_preview_filedata is only ever used once, to get the url and other data for the main file? I don't see why it couldn't just return the key as audio_url so it doesn't have to get renamed here. Unless I'm missing another place where it's being used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OH, I see what you're saying now, apologies! I'm not sure it's necessary either, I'll rework this a bit so it's clearer!

@openverse-bot
Copy link
Contributor

Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR:

@krysal
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was updated 4 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)2.

@AetherUnbound, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

@AetherUnbound AetherUnbound requested a review from stacimc October 7, 2022 00:20
@zackkrida zackkrida requested a review from obulat October 7, 2022 15:02
Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I let run the DAG for +10 minutes and got nothing, not sure why. I have an API key and confirmed it worked by doing a few requests with curl.

Comment on lines 55 to 62
def ingest_records(self, **kwargs):
for license_name in [
"Attribution",
"Attribution Noncommercial",
"Creative Commons 0",
]:
logger.info(f"Obtaining audio records under license '{license_name}'")
super().ingest_records(license_name=license_name, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK Freesound provides only CC-licensed tracks so is this custom code needed at all? They also mention they used to include the Sampling+ license (now retired) and we have ingested a few of those as well but now will be omitted here.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad you pointed this out! I looked more into their API's filter notes, and apparently we're using this entirely wrong! The license param is completely ignored in the request (I tried Attribution locally and saw that I got back a ton of different license types within the first 25 results:

In [14]: {x["license"] for x in r[0]}
Out[14]: 
{'http://creativecommons.org/licenses/by-nc/3.0/',
 'http://creativecommons.org/licenses/by/3.0/',
 'http://creativecommons.org/publicdomain/zero/1.0/',
 'https://creativecommons.org/licenses/by-nc/4.0/',
 'https://creativecommons.org/licenses/by/4.0/'}

As you mention, I don't think this logic is needed at all (and we obviously aren't even using it presently). I think it'd be best to remove it, but we could also create an issue for the stability milestone to have this filter by license type correctly. Given the nature of the script I don't think there's any advantage to doing so. What do you think @WordPress/openverse-catalog?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense that it wasn't doing anything, especially considering we were previously using license_name as the param 😮 I guess the intention must have been to add it to the filter? But even so, you're right, it would be unnecessary!

I think removing this as part of this PR is completely fine, because as you point out it doesn't change the behavior of the script at all 😄 Well spotted!

@AetherUnbound
Copy link
Contributor Author

Huh, I was able to run this just fine locally 😮 Can anyone else with an API key confirm that they're receiving records when they run this @WordPress/openverse-catalog?

@krysal
Copy link
Member

krysal commented Oct 10, 2022

I tried rerunning it for some 20 minutes and got the results! I was too impatient 😆 So looks great! As I mentioned early, I just think we can avoid the whole license loop given all the tracks are CC licensed.

@AetherUnbound
Copy link
Contributor Author

Thanks for checking @krysal! I've gone ahead and removed that logic - if folks have any concerns though, feel free to voice them 🙂

@AetherUnbound AetherUnbound force-pushed the feature/freesound-refactor#586 branch from b094f4f to b04cbe3 Compare October 11, 2022 19:07
Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent!

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 Looks great! 🎉

@AetherUnbound AetherUnbound merged commit 8f92318 into main Oct 12, 2022
@AetherUnbound AetherUnbound deleted the feature/freesound-refactor#586 branch October 12, 2022 18:48
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor Freesound to use ProviderDataIngester
5 participants