-
-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically add exclusion rules based on robots.txt
#631
Comments
Despite its name, I don't see value in such a feature but I can imagine there are scenarios where it can be useful. @benoit74 do you have one to share? Without further information, I'd advise on not having this (non existent yet) feature on by default as it changes the browser's behavior while I think this project uses explicit flags for this. |
First use case is https://forums.gentoo.org/robots.txt where the
The idea behind automatically using Currently in self-service mode, users tend to simply input the URL After that initial run, it might prove interesting in this case to still include If we do not automate something on this, it means the self-service approach is mostly doomed to produce only bad archives, which is a bit sad. |
This confirms that it can be useful in zimit, via an option (that you'd turn on) |
We're definitely aware of This isn't a priority for us at the moment, but would welcome a PR that does this! |
Good points! This is not a high priority for us either, let's hope we find time to work on it ^^ |
It would be nice if the crawler could automatically fetch rules from
robots.txt
and addexclusion
rules for every rule present in therobots.txt
file.I think this functionality should even be turned-on by default to avoid annoying servers which have clearly expressed what they do not want "external systems" to mess with.
At Kiwix, we have lots of non-tech users configuring zimit to do a browertrix crawl. In most cases, they have no idea what a
robots.txt
is, so having the switch turned-on by default would help a lot. That being said, I don't mind if it is off by default, we can do the magic to turn it on by default in zimit ^^The text was updated successfully, but these errors were encountered: