Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add (or update) robots.txt and ai.txt to block AI crawlers #4077

Merged
merged 3 commits into from
Apr 16, 2024

Conversation

sarayourfriend
Copy link
Collaborator

@sarayourfriend sarayourfriend commented Apr 9, 2024

Fixes

Fixes #3900 by @zackkrida

Description

Following the advice of the blog post Zack linked (https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/), I've added or updated robots.txt and added ai.txt accordingly, to block all crawlers.

As dicussed in the blog post and in #3916, this isn't a perfect solution, and there are malicious, bad-faith actors who ignore robots.txt. We'll address those using other tools at our disposal.

Docs preview pages:

Testing Instructions

Visit /robots.txt and /ai.txt on each of:

  • the API, run just api/up. Note that robots.txt is no longer incorrectly redirected to robots.txt/ with a trailing slash. semantically the trailing slash makes no sense for a document resource, and in the worst case scenario is plausible deniability for not-entirely-bad-but-still-sketchy actors who won't follow a redirect on robots.txt.
  • the documentation site, run just documentation/live and visit each
  • the frontend, run just p frontend dev and check each

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • [N/A] I added or updated tests for the changes I made (if applicable).
  • [N/A] I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • [N/A] I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@sarayourfriend sarayourfriend requested review from a team as code owners April 9, 2024 03:59
@github-actions github-actions bot added 🧱 stack: api Related to the Django API 🧱 stack: documentation Related to Sphinx documentation 🧱 stack: frontend Related to the Nuxt frontend labels Apr 9, 2024
@openverse-bot openverse-bot added 🟨 priority: medium Not blocking but should be addressed soon ✨ goal: improvement Improvement to an existing user-facing feature 🗄️ aspect: data Concerns the data in our catalog and/or databases 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Apr 9, 2024
@sarayourfriend sarayourfriend added 🕹 aspect: interface Concerns end-users' experience with the software and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work 🗄️ aspect: data Concerns the data in our catalog and/or databases labels Apr 9, 2024
Copy link

github-actions bot commented Apr 9, 2024

Full-stack documentation: https://docs.openverse.org/_preview/4077

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

@openverse-bot openverse-bot added the 🗄️ aspect: data Concerns the data in our catalog and/or databases label Apr 9, 2024
Copy link
Member

@zackkrida zackkrida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I really appreciate including all of our public sites, I totally overlooked that and only considered the frontend. I'd like to try setting Crawl-delay as well but we could do that in a subsequent PR as I'm not exactly sure of an ideal seconds value yet (perhaps 30).

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sarayourfriend sarayourfriend merged commit b2f3fd6 into main Apr 16, 2024
90 checks passed
@sarayourfriend sarayourfriend deleted the add/robots.txt branch April 16, 2024 02:35
@zackkrida zackkrida mentioned this pull request Apr 23, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🗄️ aspect: data Concerns the data in our catalog and/or databases 🕹 aspect: interface Concerns end-users' experience with the software ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API 🧱 stack: documentation Related to Sphinx documentation 🧱 stack: frontend Related to the Nuxt frontend
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

robots.txt for AI-related crawlers and bots
4 participants