Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add crawl delay #3027

Merged
merged 7 commits into from
Mar 14, 2023
Merged

Add crawl delay #3027

merged 7 commits into from
Mar 14, 2023

Conversation

pgwillia
Copy link
Member

Context

Once upon a time megaindex.com/crawler caused database use headaches. Let's suggest that crawlers that might not be as smart as Google chill a bit.

https://www.semrush.com/blog/beginners-guide-robots-txt/#crawl-delay-directive suggests that Google will ignore this but that Bing, Yandex and our new friend Megaindex.ru will respect this.

I used Google's robots.txt Tester to evaluate. Gives a warning about the crawl-delay, just letting us know it will be ignored by Googlebot. https://support.google.com/webmasters/answer/48620?hl=en describes how to limit Googlebot if necessary.

image

What's New

Add crawl-delay to our robots.txt

Copy link
Contributor

@ConnorSheremeta ConnorSheremeta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not too knowledgeable on the particulars of how sites interpret robots.txt so I may be wrong but I think that if Rails.application.secrets.allow_crawlers is false only one of the two rules,Crawl-delay: 5 or Disallow: /, could apply and it's unclear which one will apply potentially making a bot think it is able crawl.

From https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt:

Only one group is valid for a particular crawler ... The order of the groups within the robots.txt file is irrelevant.

If there's more than one specific group declared for a user agent, all the rules from the groups applicable to the specific user agent are combined internally into a single group. User agent specific groups and global groups (*) are not combined.

Also just a small nit regarding the changelog entry missing a link to a PR/issue.

@pgwillia
Copy link
Member Author

I am not too knowledgeable on the particulars of how sites interpret robots.txt so I may be wrong but I think that if Rails.application.secrets.allow_crawlers is false only one of the two rules,Crawl-delay: 5 or Disallow: /, could apply and it's unclear which one will apply potentially making a bot think it is able crawl.

From https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt:

Only one group is valid for a particular crawler ... The order of the groups within the robots.txt file is irrelevant.

If there's more than one specific group declared for a user agent, all the rules from the groups applicable to the specific user agent are combined internally into a single group. User agent specific groups and global groups (*) are not combined.

Also just a small nit regarding the changelog entry missing a link to a PR/issue.

Thanks for your thoughtful review. I can see that I rushed this.

I added the Changelog link. I also added conditional logic so that only one Agent block is present.

This got me thinking if it would be worthwhile naming the agents we want to crawl more slowly.

Copy link
Contributor

@ConnorSheremeta ConnorSheremeta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ConnorSheremeta
Copy link
Contributor

ConnorSheremeta commented Dec 23, 2022

I am not too knowledgeable on the particulars of how sites interpret robots.txt so I may be wrong but I think that if Rails.application.secrets.allow_crawlers is false only one of the two rules,Crawl-delay: 5 or Disallow: /, could apply and it's unclear which one will apply potentially making a bot think it is able crawl.
From https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt:
Only one group is valid for a particular crawler ... The order of the groups within the robots.txt file is irrelevant.
If there's more than one specific group declared for a user agent, all the rules from the groups applicable to the specific user agent are combined internally into a single group. User agent specific groups and global groups (*) are not combined.
Also just a small nit regarding the changelog entry missing a link to a PR/issue.

Thanks for your thoughtful review. I can see that I rushed this.

I added the Changelog link. I also added conditional logic so that only one Agent block is present.

This got me thinking if it would be worthwhile naming the agents we want to crawl more slowly.

Might be worthwhile restricting based on the agent, restricting it to a request every 5 seconds should help a ton as the one IP alone this morning was making about two requests per second

Once upon a time megaindex.com/crawler caused database use headaches. Let's suggest that crawlers that might not be as smart as Google chill a bit.

https://www.semrush.com/blog/beginners-guide-robots-txt/#crawl-delay-directive suggests that Google will ignore this but that Bing, Yandex and our new friend Megaindex.ru will respect this.
If enabled, let the agents know to crawl slowly.

If disabled, let the agents know not to crawl at all.
MegaIndex.ru was our problem trigger.  We can add more if/when we experience them.
@@ -3,7 +3,7 @@
<% if Rails.application.secrets.allow_crawlers %>
# Once upon a time megaindex.com/crawler caused database use headaches
# Let's suggest that crawlers that might not be as smart as Google chill a bit
User-agent: *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fell like it would be worthwhile to still have a general crawl delay rule for all user-agents just in case we get bombarded with requests in the future from a bot who would've otherwise followed the rule, maybe a delay that's just more relaxed than MegaIndex.ru?

Copy link
Contributor

@ConnorSheremeta ConnorSheremeta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@pgwillia pgwillia merged commit 0c0bcde into master Mar 14, 2023
@pgwillia pgwillia deleted the robots_site_delay branch March 14, 2023 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants