Automatically add exclusion rules based on `robots.txt` #631

benoit74 · 2024-06-27T07:22:23Z

It would be nice if the crawler could automatically fetch rules from robots.txt and add exclusion rules for every rule present in the robots.txt file.

I think this functionality should even be turned-on by default to avoid annoying servers which have clearly expressed what they do not want "external systems" to mess with.

At Kiwix, we have lots of non-tech users configuring zimit to do a browertrix crawl. In most cases, they have no idea what a robots.txt is, so having the switch turned-on by default would help a lot. That being said, I don't mind if it is off by default, we can do the magic to turn it on by default in zimit ^^

The text was updated successfully, but these errors were encountered:

rgaudin · 2024-06-27T09:46:52Z

Despite its name, robots.txt's usage is to prevent (well just give directions actually) indexation robots from exploring resources. browsertrix-crawler is a technical bot, but it acts as a user and certainly not an indexing bot.

I don't see value in such a feature but I can imagine there are scenarios where it can be useful. @benoit74 do you have one to share?

Without further information, I'd advise on not having this (non existent yet) feature on by default as it changes the browser's behavior while I think this project uses explicit flags for this.

benoit74 · 2024-06-27T11:19:41Z

First use case is https://forums.gentoo.org/robots.txt where the robots.txt content indicate with a certain fidelity what we should exclude from a crawl of https://forums.gentoo.org/ website.

Disallow: /cgi-bin/
Disallow: /search.php
Disallow: /admin/
Disallow: /memberlist.php
Disallow: /groupcp.php
Disallow: /statistics.php
Disallow: /profile.php
Disallow: /privmsg.php
Disallow: /login.php
Disallow: /posting.php

The idea behind automatically using robots.txt is helping lazy / not knowledgeable users have a first version of a WARC/ZIM which is lickely to contain only useful content rather than wasting time and resources (ours and upstream server) building a WARC/ZIM with too many unneeded pages.

Currently in self-service mode, users tend to simply input the URL https://forums.gentoo.org/ and say "Zimit!". And this is true for "young" Kiwix editors as well.

After that initial run, it might prove interesting in this case to still include /profile.php (user profiles) in the crawl. At least such a choice probably needs to be discussed by humans. But this kind of refinement can be done in a second step once we realize we miss this.

If we do not automate something on this, it means the self-service approach is mostly doomed to produce only bad archives, which is a bit sad.

rgaudin · 2024-06-27T12:07:06Z

This confirms that it can be useful in zimit, via an option (that you'd turn on)

ikreymer · 2024-07-04T20:49:53Z

We're definitely aware of robots.txt and generally haven't used these as they may be too restrictive for browser-based archiving. However, robots.txt may provide a hint for paths to exclude, as you suggest.
The idea would be to gather all the specific Disallow rules, while ignoring something like Disallow: /. Of course, some of the robots rules are URL-specific, but could also apply to in-page block rules as well.
An interesting idea - could extend the system sitemap parsing which already parses robots.txt:
https://github.com/webrecorder/browsertrix-crawler/blob/main/src/util/sitemapper.ts#L209
and simply parse all of the Disallow and Allow rules to create exclusions and inclusions.
Not quite sure how to handle different user agents - perhaps grabbing rules from all of them, or a specific one?

This isn't a priority for us at the moment, but would welcome a PR that does this!

benoit74 · 2024-07-08T05:18:53Z

Good points!

This is not a high priority for us either, let's hope we find time to work on it ^^

github-project-automation bot added this to Webrecorder Projects Jun 27, 2024

github-project-automation bot moved this to Triage in Webrecorder Projects Jun 27, 2024

benoit74 mentioned this issue Jun 27, 2024

New request: forums.gentoo.org openzim/zim-requests#1057

Open

benoit74 mentioned this issue Jul 8, 2024

Automatically add exclusion rules based on robots.txt openzim/zimit#338

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically add exclusion rules based on `robots.txt` #631

Automatically add exclusion rules based on `robots.txt` #631

benoit74 commented Jun 27, 2024

rgaudin commented Jun 27, 2024

benoit74 commented Jun 27, 2024

rgaudin commented Jun 27, 2024

ikreymer commented Jul 4, 2024

benoit74 commented Jul 8, 2024

Automatically add exclusion rules based on robots.txt #631

Automatically add exclusion rules based on robots.txt #631

Comments

benoit74 commented Jun 27, 2024

rgaudin commented Jun 27, 2024

benoit74 commented Jun 27, 2024

rgaudin commented Jun 27, 2024

ikreymer commented Jul 4, 2024

benoit74 commented Jul 8, 2024

Automatically add exclusion rules based on `robots.txt` #631

Automatically add exclusion rules based on `robots.txt` #631