Add (or update) robots.txt and ai.txt to block AI crawlers #4077

sarayourfriend · 2024-04-09T03:59:57Z

Fixes

Description

Following the advice of the blog post Zack linked (https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/), I've added or updated robots.txt and added ai.txt accordingly, to block all crawlers.

As dicussed in the blog post and in #3916, this isn't a perfect solution, and there are malicious, bad-faith actors who ignore robots.txt. We'll address those using other tools at our disposal.

Docs preview pages:

Testing Instructions

Visit /robots.txt and /ai.txt on each of:

the API, run just api/up. Note that robots.txt is no longer incorrectly redirected to robots.txt/ with a trailing slash. semantically the trailing slash makes no sense for a document resource, and in the worst case scenario is plausible deniability for not-entirely-bad-but-still-sketchy actors who won't follow a redirect on robots.txt.
the documentation site, run just documentation/live and visit each
the frontend, run just p frontend dev and check each

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
[N/A] I added or updated tests for the changes I made (if applicable).
[N/A] I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
[N/A] I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

github-actions · 2024-04-09T04:14:22Z

Full-stack documentation: https://docs.openverse.org/_preview/4077

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

zackkrida

Great! I really appreciate including all of our public sites, I totally overlooked that and only considered the frontend. I'd like to try setting Crawl-delay as well but we could do that in a subsequent PR as I'm not exactly sure of an ideal seconds value yet (perhaps 30).

krysal

LGTM!

sarayourfriend added 3 commits April 9, 2024 13:50

Add robots.txt and ai.txt to documentation site

f50dfd1

Add ai.txt and remove trailing slash from robots.txt

27e9493

Add robots.txt and ai.txt to frontend

08aa26d

sarayourfriend requested review from a team as code owners April 9, 2024 03:59

sarayourfriend requested review from krysal, dhruvkb and zackkrida April 9, 2024 03:59

github-actions bot added 🧱 stack: api Related to the Django API 🧱 stack: documentation Related to Sphinx documentation 🧱 stack: frontend Related to the Nuxt frontend labels Apr 9, 2024

sarayourfriend added 🕹 aspect: interface Concerns end-users' experience with the software and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work 🗄️ aspect: data Concerns the data in our catalog and/or databases labels Apr 9, 2024

openverse-bot added the 🗄️ aspect: data Concerns the data in our catalog and/or databases label Apr 9, 2024

zackkrida approved these changes Apr 9, 2024

View reviewed changes

krysal approved these changes Apr 11, 2024

View reviewed changes

sarayourfriend merged commit b2f3fd6 into main Apr 16, 2024
90 checks passed

sarayourfriend deleted the add/robots.txt branch April 16, 2024 02:35

zackkrida mentioned this pull request Apr 23, 2024

Fix frontend robots.txt #4186

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add (or update) robots.txt and ai.txt to block AI crawlers #4077

Add (or update) robots.txt and ai.txt to block AI crawlers #4077

sarayourfriend commented Apr 9, 2024 •

edited

Loading

github-actions bot commented Apr 9, 2024

zackkrida left a comment

krysal left a comment

Add (or update) robots.txt and ai.txt to block AI crawlers #4077

Add (or update) robots.txt and ai.txt to block AI crawlers #4077

Conversation

sarayourfriend commented Apr 9, 2024 • edited Loading

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

github-actions bot commented Apr 9, 2024

zackkrida left a comment

Choose a reason for hiding this comment

krysal left a comment

Choose a reason for hiding this comment

sarayourfriend commented Apr 9, 2024 •

edited

Loading