Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

robots.txt is cute but 404 #189

Closed
xnox opened this issue Feb 16, 2022 · 4 comments
Closed

robots.txt is cute but 404 #189

xnox opened this issue Feb 16, 2022 · 4 comments
Assignees

Comments

@xnox
Copy link

xnox commented Feb 16, 2022

The 404 page is very cute.

but should it exist?

https://assets.ubuntu.com/robots.txt

@nottrobin
Copy link
Contributor

Why?

@xnox
Copy link
Author

xnox commented Feb 21, 2022

It is not necessary. But I thought it would be useful. I used wget to crawl our websites, and for example wget by default attempts to download and respect robots.txt which doesn't exist for most of our websites.

It does https://ubuntu.com/robots.txt => https://ubuntu.com/static/files/robots.txt seems to exist and serves search exclude rule & a sitemap. https://snapcraft.io/robots.txt also exists but appears empty. For most of our other domains it does not.

I somewhat expected for at least sitemap stanza to be available for all of our human facing websites (ie. maas.io). And the assets.ubuntu.com I expected it to serve things with 'noindex' header or like robots.txt denying everything. Given nothing should need to crawl assets.ubuntu.com?!

Or maybe indeed there are no reasons to add more robots.txt beyond the ones we have deployed on ubuntu.com & snapcraft.io.

@nottrobin
Copy link
Contributor

nottrobin commented Feb 25, 2022

It's very clear that you don't need a robots.txt to allow crawling of pages. The default is that sites are crawlable. In fact, the official specification of robots.txt requires every robots.txt file to contain a disallow: property (although it's allowed to be empty) and only have an allow: when they need to override the disallow:.

We could set a standard that everything should have a robots.txt, whether they need to disallow any pages or not, but this feels like unnecessary work and overengineering. The only argument for it that I can think of is to avoid sending (and presumably logging) 404s. But why? 404 is just a status, man, just like 200.

Unless there's a significant reason why robots.txt 404s are costly, I think we should not overcomplicate things and just allow sites to not have robots.txt if they don't have anything to disallow.

@nottrobin
Copy link
Contributor

nottrobin commented Feb 26, 2022

Actually there were a couple of other points there worth responding to:

And the assets.ubuntu.com I expected it to serve things with 'noindex' header or like robots.txt denying everything. Given nothing should need to crawl assets.ubuntu.com?!

I don't understand why you would think nothing needs to crawl assets.ubuntu.com? The web is generally open and crawlable - it is not a common practice that I'm aware of to go denying crawling of anything that we can't explicitly think of a reason to share. And this feels like an especially weird stance for an open source company to take. We have no problem with people searching Google for our images and PDFs on assets.ubuntu.com, and in some cases I'm sure it's absolutely desired.

I somewhat expected for at least sitemap stanza to be available for all of our human facing websites (ie. maas.io).

This is a better point, although almost completely unrelated to robots.txt. Yes, every public site should probably have a sitemap.xml. I've filed canonical/maas.io#713. Do let me know if you come across other sites that should have it and don't. (I don't think assets.ubuntu.com should have one)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants