-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add (or update) robots.txt and ai.txt to block AI crawlers #4077
Conversation
Full-stack documentation: https://docs.openverse.org/_preview/4077 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I really appreciate including all of our public sites, I totally overlooked that and only considered the frontend. I'd like to try setting Crawl-delay
as well but we could do that in a subsequent PR as I'm not exactly sure of an ideal seconds value yet (perhaps 30).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Fixes
Fixes #3900 by @zackkrida
Description
Following the advice of the blog post Zack linked (https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/), I've added or updated robots.txt and added ai.txt accordingly, to block all crawlers.
As dicussed in the blog post and in #3916, this isn't a perfect solution, and there are malicious, bad-faith actors who ignore robots.txt. We'll address those using other tools at our disposal.
Docs preview pages:
Testing Instructions
Visit
/robots.txt
and/ai.txt
on each of:just api/up
. Note thatrobots.txt
is no longer incorrectly redirected torobots.txt/
with a trailing slash. semantically the trailing slash makes no sense for a document resource, and in the worst case scenario is plausible deniability for not-entirely-bad-but-still-sketchy actors who won't follow a redirect onrobots.txt
.just documentation/live
and visit eachjust p frontend dev
and check eachChecklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin