Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Add a Nappy provider DAG using ProviderDataIngester #796

Merged
merged 37 commits into from
Jan 16, 2023
Merged

Conversation

zackkrida
Copy link
Member

@zackkrida zackkrida commented Oct 14, 2022

Fixes WordPress/openverse#1445 by @zackkrida

Description

Adds Nappy.co to the Catalog using the new provider DAG template in #790. This is mostly to test that PR. I've always wanted to write a provider DAG and this API was literally handwritten for us, so it was a perfect test case.

Sample results (csv)

identifier	created_on	updated_on	ingestion_type	provider	source	foreign_identifier	foreign_landing_url	url	thumbnail	width	height	filesize	license	license_version	creator	creator_url	title	meta_data	tags	watermarked	last_synced_with_source	removed_from_source	filetype	category
002e16c4-12af-4e60-ab5f-29094e1b8060	2022-10-14 00:13:07.763443+00	2022-10-14 00:13:07.763443+00	provider_api	nappy	nappy	62	https://nappy.co/photo/62/woman-standing-by-door	https://images.nappy.co/uploads/large/201592003677l8k2kzajrojyoftnmsxtcug3qntbhbbvyto9fh26i7s6ludlgm4pevyffjnuuhr1gj1pztzsjqvfbtigdn5jf0ssynibgkny8sl7.jpg					cc0	1.0	eyeforebony	https://nappy.co/eyeforebony	Woman Standing by door	"{""license_url"": ""https://creativecommons.org/publicdomain/zero/1.0/"", ""raw_license_url"": ""https://creativecommons.org/publicdomain/zero/1.0/""}"	"[{""name"": ""people_"", ""provider"": ""nappy""}, {""name"": ""building"", ""provider"": ""nappy""}, {""name"": ""person"", ""provider"": ""nappy""}, {""name"": ""human face"", ""provider"": ""nappy""}, {""name"": ""clothing"", ""provider"": ""nappy""}, {""name"": ""red"", ""provider"": ""nappy""}, {""name"": ""waist"", ""provider"": ""nappy""}, {""name"": ""street fashion"", ""provider"": ""nappy""}, {""name"": ""wall"", ""provider"": ""nappy""}, {""name"": ""shoulder"", ""provider"": ""nappy""}, {""name"": ""trousers"", ""provider"": ""nappy""}, {""name"": ""maroon"", ""provider"": ""nappy""}, {""name"": ""girl"", ""provider"": ""nappy""}, {""name"": ""fashion"", ""provider"": ""nappy""}, {""name"": ""casual dress"", ""provider"": ""nappy""}, {""name"": ""outdoor"", ""provider"": ""nappy""}, {""name"": ""woman"", ""provider"": ""nappy""}, {""name"": ""brick"", ""provider"": ""nappy""}, {""name"": ""standing"", ""provider"": ""nappy""}, {""name"": ""trouser"", ""provider"": ""nappy""}, {""name"": ""painted"", ""provider"": ""nappy""}, {""name"": ""serious"", ""provider"": ""nappy""}, {""name"": ""female"", ""provider"": ""nappy""}, {""name"": ""graffiti"", ""provider"": ""nappy""}, {""name"": ""window"", ""provider"": ""nappy""}, {""name"": ""white"", ""provider"": ""nappy""}, {""name"": ""sweater"", ""provider"": ""nappy""}, {""name"": ""necklace"", ""provider"": ""nappy""}, {""name"": ""door"", ""provider"": ""nappy""}, {""name"": ""afro"", ""provider"": ""nappy""}, {""name"": ""natural"", ""provider"": ""nappy""}, {""name"": ""hair"", ""provider"": ""nappy""}, {""name"": ""looking to the right"", ""provider"": ""nappy""}, {""name"": ""hand on hip"", ""provider"": ""nappy""}, {""name"": ""curls"", ""provider"": ""nappy""}]"	false	2022-10-14 00:13:07.763443+00	false	jpg	photograph
0045f635-d049-44f6-b8c1-f92c87b79275	2022-10-14 00:13:07.763443+00	2022-10-14 00:13:07.763443+00	provider_api	nappy	nappy	376	https://nappy.co/photo/376/woman-texting	https://images.nappy.co/uploads/large/20191208-7z8a1127-scaled-21595702861aw1ngtblnmpvgjtiye84bq9rb2fi42dadyubd5nds7tiboiuigvzfnmgdjbwstnwxbyhxtfulso2nxackh83cf95syga0jxivwcg.jpg					cc0	1.0	NappyStock	https://nappy.co/NappyStock	Woman texting	"{""license_url"": ""https://creativecommons.org/publicdomain/zero/1.0/"", ""raw_license_url"": ""https://creativecommons.org/publicdomain/zero/1.0/""}"	"[{""name"": ""person"", ""provider"": ""nappy""}, {""name"": ""mobile phone"", ""provider"": ""nappy""}, {""name"": ""indoor"", ""provider"": ""nappy""}, {""name"": ""human face"", ""provider"": ""nappy""}, {""name"": ""gadget"", ""provider"": ""nappy""}, {""name"": ""communication device"", ""provider"": ""nappy""}, {""name"": ""portable communications device"", ""provider"": ""nappy""}, {""name"": ""clothing"", ""provider"": ""nappy""}, {""name"": ""wall"", ""provider"": ""nappy""}, {""name"": ""phone"", ""provider"": ""nappy""}, {""name"": ""electronic device"", ""provider"": ""nappy""}, {""name"": ""telephone"", ""provider"": ""nappy""}, {""name"": ""mobile device"", ""provider"": ""nappy""}, {""name"": ""cellphone"", ""provider"": ""nappy""}, {""name"": ""computer"", ""provider"": ""nappy""}, {""name"": ""woman"", ""provider"": ""nappy""}, {""name"": ""text"", ""provider"": ""nappy""}, {""name"": ""texting"", ""provider"": ""nappy""}, {""name"": ""message"", ""provider"": ""nappy""}, {""name"": ""office"", ""provider"": ""nappy""}, {""name"": ""yellow"", ""provider"": ""nappy""}, {""name"": ""green"", ""provider"": ""nappy""}, {""name"": ""olive"", ""provider"": ""nappy""}, {""name"": ""smiling"", ""provider"": ""nappy""}, {""name"": ""smile"", ""provider"": ""nappy""}, {""name"": ""chain"", ""provider"": ""nappy""}, {""name"": ""hair"", ""provider"": ""nappy""}]"	false	2022-10-14 00:13:07.763443+00	false	jpg	photograph
0049dfb9-19f9-4a27-a657-6f194dcededd	2022-10-14 00:13:07.763443+00	2022-10-14 00:13:07.763443+00	provider_api	nappy	nappy	3524	https://nappy.co/photo/3524/girl-with-birthday-cake	https://images.nappy.co/uploads/large/bday-2-18441639516747pa8ckqmvanydoqzigmp0uovjstcdskegxrnxp1quek64okdg3jlc6bly7nk4w1fzgqthgxr6ccafwoqugutooxaqugjfpa99thcl.jpg					cc0	1.0	alyssasieb	https://nappy.co/alyssasieb	Girl with Birthday Cake	"{""license_url"": ""https://creativecommons.org/publicdomain/zero/1.0/"", ""raw_license_url"": ""https://creativecommons.org/publicdomain/zero/1.0/""}"	"[{""name"": ""people_"", ""provider"": ""nappy""}, {""name"": ""human face"", ""provider"": ""nappy""}, {""name"": ""indoor"", ""provider"": ""nappy""}, {""name"": ""smile"", ""provider"": ""nappy""}, {""name"": ""person"", ""provider"": ""nappy""}, {""name"": ""clothing"", ""provider"": ""nappy""}, {""name"": ""wall"", ""provider"": ""nappy""}, {""name"": ""floor"", ""provider"": ""nappy""}, {""name"": ""table"", ""provider"": ""nappy""}, {""name"": ""woman"", ""provider"": ""nappy""}, {""name"": ""girl"", ""provider"": ""nappy""}, {""name"": ""birthday"", ""provider"": ""nappy""}, {""name"": ""cake"", ""provider"": ""nappy""}, {""name"": ""celebrate"", ""provider"": ""nappy""}, {""name"": ""child"", ""provider"": ""nappy""}, {""name"": ""happy"", ""provider"": ""nappy""}]"	false	2022-10-14 00:13:07.763443+00	false	JPG	photograph
0090079c-5133-44c2-8e59-b31f864bdf00	2022-10-14 00:13:07.763443+00	2022-10-14 00:13:07.763443+00	provider_api	nappy	nappy	3755	https://nappy.co/photo/3755/lightbulb	https://images.nappy.co/uploads/large/stock-24-18441643628673dkiku8udymdvic76yjkeue4sknd05xqfut0auyqidws8kfwqzdezjdes3bb3zvnyseanudokyc6lgjpa436tllz8jawlirlejmr1.jpg					cc0	1.0	alyssasieb	https://nappy.co/alyssasieb	Lightbulb	"{""license_url"": ""https://creativecommons.org/publicdomain/zero/1.0/"", ""raw_license_url"": ""https://creativecommons.org/publicdomain/zero/1.0/""}"	"[{""name"": ""person"", ""provider"": ""nappy""}, {""name"": ""wall"", ""provider"": ""nappy""}, {""name"": ""clothing"", ""provider"": ""nappy""}, {""name"": ""human face"", ""provider"": ""nappy""}, {""name"": ""girl"", ""provider"": ""nappy""}, {""name"": ""neutral"", ""provider"": ""nappy""}, {""name"": ""negative space"", ""provider"": ""nappy""}, {""name"": ""simple"", ""provider"": ""nappy""}, {""name"": ""blank"", ""provider"": ""nappy""}, {""name"": ""lightbulb"", ""provider"": ""nappy""}, {""name"": ""light"", ""provider"": ""nappy""}]"	false	2022-10-14 00:13:07.763443+00	false	JPG	photograph

TODO

  • Add tests (completed by @rwidom)
  • Add popularity (completed by @rwidom)

Popularity data

{
  "views": 25783,
  "downloads": 1050,
  "saves": 8,
}

Testing Instructions

just recreate and run the nappy_workflow through the Airflow UI. It should load 2,059 images pretty quickly.

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@@ -13,26 +13,27 @@


# Default provider names
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sorted this list alphabetically, but can revert if desired.

@openverse-bot openverse-bot added the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label Oct 14, 2022
@zackkrida zackkrida changed the title Add a Nappy provider DAG Add a Nappy provider DAG using ProviderDataIngester Oct 14, 2022
@zackkrida
Copy link
Member Author

I'd love preliminary reviews on this, even while it's drafted, @WordPress/openverse-catalog

@zackkrida
Copy link
Member Author

Ooh, and one small issue I'm observing. I'm not seeing the thumbnail_url being saved to the TSVs or database 🤔 Suggestions there would be appreciated!

@stacimc
Copy link
Contributor

stacimc commented Oct 14, 2022

I'm not seeing the thumbnail_url being saved to the TSVs or database 🤔 Suggestions there would be appreciated!

That's right -- we actually hardcode thumbnail_url to None in the ImageStore, in favor of using the thumbnail server. I think this recently caused us trouble with another provider -- maybe SMK? I'll remove thumbnail_url from the suggested fields in my documentation!

@zackkrida
Copy link
Member Author

@stacimc yep, that's right regarding SMK. Maybe I'll stick thumbnail_url in meta_data in case we work out a hotfix on the API side that uses that? Otherwise I can just comment it out.

@stacimc
Copy link
Contributor

stacimc commented Oct 14, 2022

It looks like putting it in meta_data is exactly what @obulat suggested for SMK here! That sounds like a good plan to me :)

@AetherUnbound
Copy link
Contributor

Oh sweet, there are thumbnails available for Rawpixel too but I wasn't sure where to add them either. I'll add that to #795!

@krysal krysal added 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository labels Oct 17, 2022
@AetherUnbound
Copy link
Contributor

Added! Oh, and looking over the PR description it seems like there are a few metrics we could use for popularity. Would you be willing to add those to the meta_data JSON as well? I suppose we could potentially add all 3 then decide on which metric to use after the fact (so we have all available if we wanted to switch). My instinct would be to use downloads as the primary metric if we had to pick one though.

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting this @zackkrida and for picking this up @rwidom! I have a number of suggested changes, mostly removing some files/comment lines. I also have a suggestion about batch size & popularity-related updates.

I ran this locally and it worked great!

Comment on lines 62 to 65
# Hardoded to CC0, the only license Nappy.co uses
license_info = get_license_info(
"https://creativecommons.org/publicdomain/zero/1.0/"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all results are CC0, we should set this as a value on the class or instance and use it there rather than calling this function for every record!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! Totally, yes.

Comment on lines +87 to +91
meta_data = {
"views": data.get("views"),
"saves": data.get("saves"),
"downloads": data.get("downloads"),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these! We'll also want to add downloads to the DDL and the image popularity metrics test!

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whew! This is almost good to go, @rwidom would you be willing to move the _convert_filesize function to a static method and add tests for it as well? I can do that if that's easier 🙂

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, thank you @rwidom and @zackkrida!

Comment on lines +141 to +149
pytest.param("4kB", 4_000, id="happy_kB"),
pytest.param("4MB", 4_000_000, id="happy_MB"),
pytest.param("4GB", 4_000_000_000, id="happy_GB"),
pytest.param("", None, id="empty_string"),
pytest.param([], None, id="not_a_string"),
pytest.param("gibberish", None, id="gibberish"),
pytest.param("10.3kB", 10_300, id="decimal"),
pytest.param("10.12345kB", 10_123, id="rounding"),
pytest.param(" 4 kB ", 4_000, id="extra_spaces"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chef's kiss

@zackkrida zackkrida removed the request for review from stacimc January 3, 2023 23:38
Copy link
Collaborator

@rwidom rwidom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are ready, but do we need someone else to sign off, @zackkrida ?

@rwidom
Copy link
Collaborator

rwidom commented Jan 16, 2023

I think these are ready, but do we need someone else to sign off, @zackkrida ?

Oops, I guess not and that's why you removed the request from Staci! :)

@zackkrida
Copy link
Member Author

@rwidom I removed the Staci request because I was going to review the PR. I have, and I think this is ready to merge! Thank you so much again for finishing the PR.

@zackkrida zackkrida added 💻 aspect: code Concerns the software code in the repository and removed 💻 aspect: code Concerns the software code in the repository labels Jan 16, 2023
@rwidom rwidom merged commit 709a466 into main Jan 16, 2023
@rwidom rwidom deleted the nappy-provider-dag branch January 16, 2023 12:37
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nappy | Beautiful photos of Black and Brown people, for free.
6 participants