-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
robots.txt is cute but 404 #189
Comments
Why? |
It is not necessary. But I thought it would be useful. I used wget to crawl our websites, and for example wget by default attempts to download and respect robots.txt which doesn't exist for most of our websites. It does https://ubuntu.com/robots.txt => https://ubuntu.com/static/files/robots.txt seems to exist and serves search exclude rule & a sitemap. https://snapcraft.io/robots.txt also exists but appears empty. For most of our other domains it does not. I somewhat expected for at least sitemap stanza to be available for all of our human facing websites (ie. maas.io). And the assets.ubuntu.com I expected it to serve things with 'noindex' header or like robots.txt denying everything. Given nothing should need to crawl assets.ubuntu.com?! Or maybe indeed there are no reasons to add more robots.txt beyond the ones we have deployed on ubuntu.com & snapcraft.io. |
It's very clear that you don't need a We could set a standard that everything should have a Unless there's a significant reason why |
Actually there were a couple of other points there worth responding to:
I don't understand why you would think nothing needs to crawl assets.ubuntu.com? The web is generally open and crawlable - it is not a common practice that I'm aware of to go denying crawling of anything that we can't explicitly think of a reason to share. And this feels like an especially weird stance for an open source company to take. We have no problem with people searching Google for our images and PDFs on assets.ubuntu.com, and in some cases I'm sure it's absolutely desired.
This is a better point, although almost completely unrelated to |
The 404 page is very cute.
but should it exist?
https://assets.ubuntu.com/robots.txt
The text was updated successfully, but these errors were encountered: