-
Notifications
You must be signed in to change notification settings - Fork 54
Refactor Freesound to use ProviderDataIngester #746
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking very good! Just a few minor things, mostly the license
param -- although I did get data locally, so that's kind of interesting!
I think this is fine for now, but as soon as we have a full, successful run, this should be turned into a daily dated DAG.
This is really exciting! 😍
openverse_catalog/dags/providers/provider_api_scripts/freesound.py
Outdated
Show resolved
Hide resolved
openverse_catalog/dags/providers/provider_api_scripts/freesound.py
Outdated
Show resolved
Hide resolved
openverse_catalog/dags/providers/provider_api_scripts/freesound.py
Outdated
Show resolved
Hide resolved
openverse_catalog/dags/providers/provider_api_scripts/freesound.py
Outdated
Show resolved
Hide resolved
if main_audio is None: | ||
def get_batch_data(self, response_json): | ||
if response_json: | ||
# Freesound sometimes returns results that are just "None", filter these out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I like handling this here!
main_file = self._get_preview_filedata( | ||
"preview-hq-mp3", previews["preview-hq-mp3"] | ||
) | ||
main_file["audio_url"] = main_file.pop("url") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see this was in the old code, but I'm not sure why _get_preview_filedata
doesn't just name url
audio_url
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure, but I think it's because the URL field is called url
for alt_files
, and audio_url
for the main audio in the AudioStore. And _get_preview_filedata
is mainly used for populating the alt_files
, not for the main
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's definitely audio_url
in the AudioStore
for the main file:
audio_url: str, |
I checked the database and it does look like we're using url
in the alt_files
dict:
deploy@localhost:openledger> select url, alt_files from audio where provider='freesound' limit 1;
-[ RECORD 1 ]-------------------------
url | https://freesound.org/data/previews/100/100041_1578278-hq.mp3
alt_files | [{"url": "https://freesound.org/apiv2/sounds/100041/download/", "bit_rate": "0", "filesize": "5056680", "filetype": "wav", "sample_rate": "44100"}]
SELECT 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see that the AudioStore expects audio_url
, and that alt_files
uses url
(notably the audio_url
also gets converted to url
for the main file in the actual DB which can be a little confusing, I opened #784 for it).
But it looks like _get_preview_filedata
is only ever used once, to get the url and other data for the main file? I don't see why it couldn't just return the key as audio_url
so it doesn't have to get renamed here. Unless I'm missing another place where it's being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OH, I see what you're saying now, apologies! I'm not sure it's necessary either, I'll rework this a bit so it's clearer!
Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR: @krysal Excluding weekend1 days, this PR was updated 4 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)2. @AetherUnbound, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I let run the DAG for +10 minutes and got nothing, not sure why. I have an API key and confirmed it worked by doing a few requests with curl.
def ingest_records(self, **kwargs): | ||
for license_name in [ | ||
"Attribution", | ||
"Attribution Noncommercial", | ||
"Creative Commons 0", | ||
]: | ||
logger.info(f"Obtaining audio records under license '{license_name}'") | ||
super().ingest_records(license_name=license_name, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK Freesound provides only CC-licensed tracks so is this custom code needed at all? They also mention they used to include the Sampling+ license (now retired) and we have ingested a few of those as well but now will be omitted here.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad you pointed this out! I looked more into their API's filter
notes, and apparently we're using this entirely wrong! The license
param is completely ignored in the request (I tried Attribution
locally and saw that I got back a ton of different license types within the first 25 results:
In [14]: {x["license"] for x in r[0]}
Out[14]:
{'http://creativecommons.org/licenses/by-nc/3.0/',
'http://creativecommons.org/licenses/by/3.0/',
'http://creativecommons.org/publicdomain/zero/1.0/',
'https://creativecommons.org/licenses/by-nc/4.0/',
'https://creativecommons.org/licenses/by/4.0/'}
As you mention, I don't think this logic is needed at all (and we obviously aren't even using it presently). I think it'd be best to remove it, but we could also create an issue for the stability milestone to have this filter by license type correctly. Given the nature of the script I don't think there's any advantage to doing so. What do you think @WordPress/openverse-catalog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense that it wasn't doing anything, especially considering we were previously using license_name
as the param 😮 I guess the intention must have been to add it to the filter? But even so, you're right, it would be unnecessary!
I think removing this as part of this PR is completely fine, because as you point out it doesn't change the behavior of the script at all 😄 Well spotted!
Huh, I was able to run this just fine locally 😮 Can anyone else with an API key confirm that they're receiving records when they run this @WordPress/openverse-catalog? |
I tried rerunning it for some 20 minutes and got the results! I was too impatient 😆 So looks great! As I mentioned early, I just think we can avoid the whole license loop given all the tracks are CC licensed. |
Thanks for checking @krysal! I've gone ahead and removed that logic - if folks have any concerns though, feel free to voice them 🙂 |
Co-authored-by: Staci Cooper <[email protected]>
b094f4f
to
b04cbe3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 Looks great! 🎉
Fixes
Fixes WordPress/openverse#1521 by @stacimc
Description
This PR refactors the Freesound provider script to use the
ProviderDataIngester
class.It's a pretty standard refactor, aside from some more unique functions that have additional retries built in.
One thing I noticed is that the original provider script is set up to allow processing a subset of data, but it has been run in such a way that it consumes the entire dataset. I think this is fine for now, but as soon as we have a full, successful run, this should be turned into a daily dated DAG. This will be a much more efficient way to run the DAG! Getting a successful run will require some additional error handling for failures that have occurred in the past, which I'm hoping to enumerate in another ticket.
Testing Instructions
just recreate && just test
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin