Add crawl delay #3027

pgwillia · 2022-12-23T17:31:38Z

Context

Once upon a time megaindex.com/crawler caused database use headaches. Let's suggest that crawlers that might not be as smart as Google chill a bit.

https://www.semrush.com/blog/beginners-guide-robots-txt/#crawl-delay-directive suggests that Google will ignore this but that Bing, Yandex and our new friend Megaindex.ru will respect this.

I used Google's robots.txt Tester to evaluate. Gives a warning about the crawl-delay, just letting us know it will be ignored by Googlebot. https://support.google.com/webmasters/answer/48620?hl=en describes how to limit Googlebot if necessary.

What's New

Add crawl-delay to our robots.txt

ConnorSheremeta

I am not too knowledgeable on the particulars of how sites interpret robots.txt so I may be wrong but I think that if Rails.application.secrets.allow_crawlers is false only one of the two rules,Crawl-delay: 5 or Disallow: /, could apply and it's unclear which one will apply potentially making a bot think it is able crawl.

From https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt:

Only one group is valid for a particular crawler ... The order of the groups within the robots.txt file is irrelevant.

If there's more than one specific group declared for a user agent, all the rules from the groups applicable to the specific user agent are combined internally into a single group. User agent specific groups and global groups (*) are not combined.

Also just a small nit regarding the changelog entry missing a link to a PR/issue.

pgwillia · 2022-12-23T20:41:57Z

I am not too knowledgeable on the particulars of how sites interpret robots.txt so I may be wrong but I think that if Rails.application.secrets.allow_crawlers is false only one of the two rules,Crawl-delay: 5 or Disallow: /, could apply and it's unclear which one will apply potentially making a bot think it is able crawl.

From https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt:

Only one group is valid for a particular crawler ... The order of the groups within the robots.txt file is irrelevant.

If there's more than one specific group declared for a user agent, all the rules from the groups applicable to the specific user agent are combined internally into a single group. User agent specific groups and global groups (*) are not combined.

Also just a small nit regarding the changelog entry missing a link to a PR/issue.

Thanks for your thoughtful review. I can see that I rushed this.

I added the Changelog link. I also added conditional logic so that only one Agent block is present.

This got me thinking if it would be worthwhile naming the agents we want to crawl more slowly.

ConnorSheremeta

LGTM!

ConnorSheremeta · 2022-12-23T20:48:50Z

I am not too knowledgeable on the particulars of how sites interpret robots.txt so I may be wrong but I think that if Rails.application.secrets.allow_crawlers is false only one of the two rules,Crawl-delay: 5 or Disallow: /, could apply and it's unclear which one will apply potentially making a bot think it is able crawl.
From https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt:
Only one group is valid for a particular crawler ... The order of the groups within the robots.txt file is irrelevant.
If there's more than one specific group declared for a user agent, all the rules from the groups applicable to the specific user agent are combined internally into a single group. User agent specific groups and global groups (*) are not combined.
Also just a small nit regarding the changelog entry missing a link to a PR/issue.

Thanks for your thoughtful review. I can see that I rushed this.

I added the Changelog link. I also added conditional logic so that only one Agent block is present.

This got me thinking if it would be worthwhile naming the agents we want to crawl more slowly.

Might be worthwhile restricting based on the agent, restricting it to a request every 5 seconds should help a ton as the one IP alone this morning was making about two requests per second

Once upon a time megaindex.com/crawler caused database use headaches. Let's suggest that crawlers that might not be as smart as Google chill a bit. https://www.semrush.com/blog/beginners-guide-robots-txt/#crawl-delay-directive suggests that Google will ignore this but that Bing, Yandex and our new friend Megaindex.ru will respect this.

If enabled, let the agents know to crawl slowly. If disabled, let the agents know not to crawl at all.

MegaIndex.ru was our problem trigger. We can add more if/when we experience them.

ConnorSheremeta · 2023-03-06T21:54:01Z

app/views/robots/robots.text.erb

@@ -3,7 +3,7 @@
 <% if Rails.application.secrets.allow_crawlers %>
 # Once upon a time megaindex.com/crawler caused database use headaches
 # Let's suggest that crawlers that might not be as smart as Google chill a bit
-User-agent: *


I fell like it would be worthwhile to still have a general crawl delay rule for all user-agents just in case we get bombarded with requests in the future from a bot who would've otherwise followed the rule, maybe a delay that's just more relaxed than MegaIndex.ru?

ConnorSheremeta

LGTM!

pgwillia requested review from ConnorSheremeta, jefferya and lagoan as code owners December 23, 2022 17:31

ConnorSheremeta reviewed Dec 23, 2022

View reviewed changes

ConnorSheremeta previously approved these changes Dec 23, 2022

View reviewed changes

pgwillia dismissed ConnorSheremeta’s stale review via c212aaf January 27, 2023 23:51

pgwillia force-pushed the robots_site_delay branch from c212aaf to bc9ab41 Compare January 27, 2023 23:53

pgwillia requested a review from ConnorSheremeta January 27, 2023 23:54

pgwillia added 4 commits February 3, 2023 12:36

Update CHANGELOG.md

e73db01

Use allow_crawlers to chose what to do

b76a4a2

If enabled, let the agents know to crawl slowly. If disabled, let the agents know not to crawl at all.

Naming the agent

c246eec

MegaIndex.ru was our problem trigger. We can add more if/when we experience them.

pgwillia force-pushed the robots_site_delay branch from bc9ab41 to c246eec Compare February 3, 2023 19:37

Merge branch 'master' into robots_site_delay

fd02f0a

ConnorSheremeta reviewed Mar 6, 2023

View reviewed changes

pgwillia added 2 commits March 14, 2023 11:43

Add crawl-delay for other crawlers.

6cdee42

Merge branch 'master' into robots_site_delay

0d88f6e

pgwillia requested a review from ConnorSheremeta March 14, 2023 17:46

ConnorSheremeta approved these changes Mar 14, 2023

View reviewed changes

pgwillia merged commit 0c0bcde into master Mar 14, 2023

pgwillia deleted the robots_site_delay branch March 14, 2023 21:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add crawl delay #3027

Add crawl delay #3027

pgwillia commented Dec 23, 2022

ConnorSheremeta left a comment

pgwillia commented Dec 23, 2022

ConnorSheremeta left a comment

ConnorSheremeta commented Dec 23, 2022 •

edited

Loading

ConnorSheremeta Mar 6, 2023

ConnorSheremeta left a comment

Add crawl delay #3027

Add crawl delay #3027

Conversation

pgwillia commented Dec 23, 2022

Context

What's New

ConnorSheremeta left a comment

Choose a reason for hiding this comment

pgwillia commented Dec 23, 2022

ConnorSheremeta left a comment

Choose a reason for hiding this comment

ConnorSheremeta commented Dec 23, 2022 • edited Loading

ConnorSheremeta Mar 6, 2023

Choose a reason for hiding this comment

ConnorSheremeta left a comment

Choose a reason for hiding this comment

ConnorSheremeta commented Dec 23, 2022 •

edited

Loading