robots.txt is cute but 404 #189

xnox · 2022-02-16T17:36:35Z

The 404 page is very cute.

but should it exist?

https://assets.ubuntu.com/robots.txt

nottrobin · 2022-02-19T22:17:25Z

Why?

xnox · 2022-02-21T11:02:55Z

It is not necessary. But I thought it would be useful. I used wget to crawl our websites, and for example wget by default attempts to download and respect robots.txt which doesn't exist for most of our websites.

It does https://ubuntu.com/robots.txt => https://ubuntu.com/static/files/robots.txt seems to exist and serves search exclude rule & a sitemap. https://snapcraft.io/robots.txt also exists but appears empty. For most of our other domains it does not.

I somewhat expected for at least sitemap stanza to be available for all of our human facing websites (ie. maas.io). And the assets.ubuntu.com I expected it to serve things with 'noindex' header or like robots.txt denying everything. Given nothing should need to crawl assets.ubuntu.com?!

Or maybe indeed there are no reasons to add more robots.txt beyond the ones we have deployed on ubuntu.com & snapcraft.io.

nottrobin · 2022-02-25T16:15:10Z

It's very clear that you don't need a robots.txt to allow crawling of pages. The default is that sites are crawlable. In fact, the official specification of robots.txt requires every robots.txt file to contain a disallow: property (although it's allowed to be empty) and only have an allow: when they need to override the disallow:.

We could set a standard that everything should have a robots.txt, whether they need to disallow any pages or not, but this feels like unnecessary work and overengineering. The only argument for it that I can think of is to avoid sending (and presumably logging) 404s. But why? 404 is just a status, man, just like 200.

Unless there's a significant reason why robots.txt 404s are costly, I think we should not overcomplicate things and just allow sites to not have robots.txt if they don't have anything to disallow.

nottrobin · 2022-02-26T21:44:02Z

Actually there were a couple of other points there worth responding to:

And the assets.ubuntu.com I expected it to serve things with 'noindex' header or like robots.txt denying everything. Given nothing should need to crawl assets.ubuntu.com?!

I don't understand why you would think nothing needs to crawl assets.ubuntu.com? The web is generally open and crawlable - it is not a common practice that I'm aware of to go denying crawling of anything that we can't explicitly think of a reason to share. And this feels like an especially weird stance for an open source company to take. We have no problem with people searching Google for our images and PDFs on assets.ubuntu.com, and in some cases I'm sure it's absolutely desired.

I somewhat expected for at least sitemap stanza to be available for all of our human facing websites (ie. maas.io).

This is a better point, although almost completely unrelated to robots.txt. Yes, every public site should probably have a sitemap.xml. I've filed canonical/maas.io#713. Do let me know if you come across other sites that should have it and don't. (I don't think assets.ubuntu.com should have one)

cristinadresch assigned nottrobin Feb 18, 2022

cristinadresch added the Priority: Low label Feb 18, 2022

nottrobin closed this as completed Feb 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

robots.txt is cute but 404 #189

robots.txt is cute but 404 #189

xnox commented Feb 16, 2022

nottrobin commented Feb 19, 2022

xnox commented Feb 21, 2022 •

edited

Loading

nottrobin commented Feb 25, 2022 •

edited

Loading

nottrobin commented Feb 26, 2022 •

edited

Loading

robots.txt is cute but 404 #189

robots.txt is cute but 404 #189

Comments

xnox commented Feb 16, 2022

nottrobin commented Feb 19, 2022

xnox commented Feb 21, 2022 • edited Loading

nottrobin commented Feb 25, 2022 • edited Loading

nottrobin commented Feb 26, 2022 • edited Loading

xnox commented Feb 21, 2022 •

edited

Loading

nottrobin commented Feb 25, 2022 •

edited

Loading

nottrobin commented Feb 26, 2022 •

edited

Loading