-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add crawl delay #3027
Add crawl delay #3027
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not too knowledgeable on the particulars of how sites interpret robots.txt so I may be wrong but I think that if Rails.application.secrets.allow_crawlers is false only one of the two rules,Crawl-delay: 5
or Disallow: /
, could apply and it's unclear which one will apply potentially making a bot think it is able crawl.
From https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt:
Only one group is valid for a particular crawler ... The order of the groups within the robots.txt file is irrelevant.
If there's more than one specific group declared for a user agent, all the rules from the groups applicable to the specific user agent are combined internally into a single group. User agent specific groups and global groups (*) are not combined.
Also just a small nit regarding the changelog entry missing a link to a PR/issue.
Thanks for your thoughtful review. I can see that I rushed this. I added the Changelog link. I also added conditional logic so that only one Agent block is present. This got me thinking if it would be worthwhile naming the agents we want to crawl more slowly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Might be worthwhile restricting based on the agent, restricting it to a request every 5 seconds should help a ton as the one IP alone this morning was making about two requests per second |
c212aaf
to
bc9ab41
Compare
Once upon a time megaindex.com/crawler caused database use headaches. Let's suggest that crawlers that might not be as smart as Google chill a bit. https://www.semrush.com/blog/beginners-guide-robots-txt/#crawl-delay-directive suggests that Google will ignore this but that Bing, Yandex and our new friend Megaindex.ru will respect this.
If enabled, let the agents know to crawl slowly. If disabled, let the agents know not to crawl at all.
MegaIndex.ru was our problem trigger. We can add more if/when we experience them.
bc9ab41
to
c246eec
Compare
app/views/robots/robots.text.erb
Outdated
@@ -3,7 +3,7 @@ | |||
<% if Rails.application.secrets.allow_crawlers %> | |||
# Once upon a time megaindex.com/crawler caused database use headaches | |||
# Let's suggest that crawlers that might not be as smart as Google chill a bit | |||
User-agent: * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fell like it would be worthwhile to still have a general crawl delay rule for all user-agents just in case we get bombarded with requests in the future from a bot who would've otherwise followed the rule, maybe a delay that's just more relaxed than MegaIndex.ru?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Context
Once upon a time megaindex.com/crawler caused database use headaches. Let's suggest that crawlers that might not be as smart as Google chill a bit.
https://www.semrush.com/blog/beginners-guide-robots-txt/#crawl-delay-directive suggests that Google will ignore this but that Bing, Yandex and our new friend Megaindex.ru will respect this.
I used Google's robots.txt Tester to evaluate. Gives a warning about the crawl-delay, just letting us know it will be ignored by Googlebot. https://support.google.com/webmasters/answer/48620?hl=en describes how to limit Googlebot if necessary.
What's New
Add crawl-delay to our robots.txt