Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image source: justtakeitfree.com #1264

Closed
1 task done
Aventurier opened this issue Mar 30, 2023 · 8 comments · Fixed by #2793
Closed
1 task done

Image source: justtakeitfree.com #1264

Aventurier opened this issue Mar 30, 2023 · 8 comments · Fixed by #2793
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟩 priority: low Low priority and doesn't need to be rushed ☁️ provider: images Image provider 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@Aventurier
Copy link

Source Site

https://justtakeitfree.com/

Value Provided

It's an independent project from ukrainian family. We host only our photos.

Licenses Provided

CC BY 4.0

Implementation

  • 🙋 I would be interested in implementing this feature.
@Aventurier Aventurier added 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work 🧹 status: ticket work required Needs more details before it can be worked on labels Mar 30, 2023
@zackkrida zackkrida changed the title <Source name here> Image source: justtakeitfree.com Mar 30, 2023
@obulat obulat added the 🟩 priority: low Low priority and doesn't need to be rushed label Mar 31, 2023
@obulat
Copy link
Contributor

obulat commented Mar 31, 2023

Thank you for the source suggestion, @Aventurier! Do you have an API that Openverse could use to get the images?

@dhruvkb
Copy link
Member

dhruvkb commented Apr 6, 2023

Based on a little digging, the site does not have an API, but they do have a clean markup that can be used to run through and scrape the site. They do provide quite a bit of info (like tags) and all images credit "Justtakeitfree Free Photos" as the author. One thing that's missing is a title. None of the images are titled and only use a numeric ID as the identifier and in places like the HTML <title>, a concatenation of all the tags is used.

I'm not aware of the scraping policy of the catalog and if a REST API is a requirement but this site has a small collection of very high quality images that might make a nice addition to our content.

@dhruvkb dhruvkb added 🌟 goal: addition Addition of new feature 🧱 stack: catalog Related to the catalog and Airflow DAGs and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Apr 6, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Apr 17, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
@obulat obulat added ☁️ provider: images Image provider 💻 aspect: code Concerns the software code in the repository labels Jun 19, 2023
@sarayourfriend sarayourfriend removed the 🧹 status: ticket work required Needs more details before it can be worked on label Jun 27, 2023
@sarayourfriend
Copy link
Collaborator

Without a response from @Aventurier regarding the API, I think we should plan to scrape. There are currently 6 pages of results and around 178 results (based on https://justtakeitfree.com/photo/178/ existing and anything beyond that like https://justtakeitfree.com/photo/179/ and https://justtakeitfree.com/photo/180/ returning a 404, though https://justtakeitfree.com/photo/1/ also 404s). If the DAG requested one page every two seconds it would only take around 6 minutes to ingest the entire provider. We could do that monthly to reduce impact from the scraping.

Seems doable and I think the assumption that we can scrape is safe considering the volume and lack of ToS.

As for the title to use, the site itself appears to use the first tag in the list of tags for the filename when you click the download link. We can do the same.

Regarding the attribution, the author should be as Dhruv mentioned "Justtakeitfree Free Photos". Based on the text of the issue ("We host only our photos") it sounds like that's an appropriate attribution to credit the creators. I dumped EXIF on one of the images and there is nothing to suggest otherwise.

So to clarify the DAG implementation:

  • Make iterative requests to https://justtakeitfree.com/photo/<i>/, incrementing i once every two seconds. Go until 10 404s are returned in a row.
  • DAG should run monthly
  • The license is CC-BY 4.0 for all works
  • Creator is "Justtakeitfree Free Photos" for all works.
  • Provider/source is "justtakeitfree.com"
  • Foreign landing URL is https://justtakeitfree.com/photo/<i>/ for the work
  • URL is https://justtakeitfree.com/photos/<i>_800.jpg for the work (full sized is available on the landing page)
  • No special thumbnail URL, the _800.jpg version is used as the thumbnail on the site

@Aventurier
Copy link
Author

I'm sorry for long pause. Actually I did small API that can search for an images by tag and retrieve information about image.
https://justtakeitfree.com/api/api.php?key=vj45mub435v6bsdf90&query=search&tag=grass
It's an example how to find images by tag (you can use the key that is in example).
If you have any question or you want me to extend an API, please, write.
Thanks.

@sarayourfriend
Copy link
Collaborator

No worries, Aventurier! Thanks for letting us know. Can you share whether there is a way to paginate through the API? For Openverse's catalogue to be able to get all the images, we'd need to be able to use the API to paginate through all the images rather than for just particular tags. Something like:

https://justtakeitfree.com/api/api.php?page=1
https://justtakeitfree.com/api/api.php?page=2
https://justtakeitfree.com/api/api.php?page=3
https://justtakeitfree.com/api/api.php?page=4

etc., without any query terms.

Is the email on the privacy policy page the best location to get in touch regarding a key specifically for Openverse (to avoid the secret leaking publicly)?

@Aventurier
Copy link
Author

Done
https://justtakeitfree.com/api/api.php?key=vj45mub435v6bsdf90&page=1
https://justtakeitfree.com/api/api.php?key=vj45mub435v6bsdf90&page=2

Please, leave me your mail, I will send a new key and then will delete this one

@zackkrida
Copy link
Member

@Aventurier amazing! Thank you so much. You can email us at [email protected] with a new key.

@krysal
Copy link
Member

krysal commented Jul 17, 2023

API key received. Thank you, @Aventurier!

@AetherUnbound AetherUnbound moved this from 📋 Backlog to 🏗 In progress in Openverse Backlog Aug 17, 2023
@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in Openverse Backlog Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟩 priority: low Low priority and doesn't need to be rushed ☁️ provider: images Image provider 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants