Respect robots.txt #37

jogli5er · 2019-11-21T10:28:32Z

For every new discovered host, check for a robots.txt.
Then, for every URL, we need to check whether it is allowed to access or not depending on the robots.txt.
This can be either done during insertion time or during dispatch time. For single runs, both options are pretty much identical, however, for longer, multi-scrape runs this can have some differences, as robots.txt might change.

Multiple subdomains can have a robots.txt. As we store the subdomain as part of the path, we have a more costly lookup to do, to find out if we already have a robots.txt path for a given subdomain
Check if we are allowed to access a given url. Check needs to be done with the subdomain in mind, see first point
Download the robots.txt lazy: we see that not yet a path exists with a robots.txt for the given subdomain, we should add it and eagerly get it, before continuing the download. Check that no timeouts appear, as we now do two downloads instead of a single one, hence occupy a network slot at most twice as long

jogli5er added the todo This should be implemented, is planned and a necessity, therefor not an enhancement. label Nov 21, 2019

jogli5er self-assigned this Nov 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Respect robots.txt #37

Respect robots.txt #37

jogli5er commented Nov 21, 2019 •

edited

Loading

Respect robots.txt #37

Respect robots.txt #37

Comments

jogli5er commented Nov 21, 2019 • edited Loading

jogli5er commented Nov 21, 2019 •

edited

Loading