A "robots.txt" parsing and querying library for .NET
Closely following the NoRobots RFC, Robots Exclusion Protocol RFC and other details on robotstxt.org.
- Load Robots by string, by URI (Async) or by streams (Async)
- Supports multiple user-agents and wildcard user-agent (
*
) - Supports
Allow
andDisallow
- Supports
Crawl-delay
entries - Supports
Sitemap
entries - Supports wildcard paths (
*
) as well as must-end-with declarations ($
) - Dedicated parser for the data from
<meta name="robots" />
tag and theX-Robots-Tag
header
Robots Exclusion Tools is licensed under the MIT license. It is free to use in personal and commercial projects.
There are support plans available that cover all active Turner Software OSS projects. Support plans provide private email support, expert usage advice for our projects, priority bug fixes and more. These support plans help fund our OSS commitments to provide better software for everyone.
The NoRobots RFC was released in 1996 and describes the core syntax that makes up a typical Robots.txt file. There is a new standard being proposed called the Robots Exclusion Protocol RFC (in draft as of August 2022) which would effectively replace it.
The two RFCs have quite a lot of overlap in terms of the core rules. Generally though, the Robots Exclusion Protocol RFC is more flexible when it comes to allowed characters (full UTF-8) and spacing.
The Robots Exclusion Tools library attempts to strike a compatibility balance for both, allowing some specific quirks of the NoRobots RFC with the expanded characterset from the Robots Exclusion Protocol RFC.
Similar to the rules from a "robots.txt" file, there can be in-request rules deciding whether a page allows indexing or following links. The process of extracting this data from a request isn't currently part of this library, avoiding a dependency to parse HTML.
If you extract the raw rules from the metatags and X-Robots-Tag
header, you can pass those into the parser.
The parser takes an array of rules and returns a RobotsPageDefinition
file which allows querying of the rules by user agent.
There is no RFC available to define the formats of metatag or X-Robots-Tag
data.
The parser follows the base formatting rules described in the NoRobots and the Robots Exclusion Protocol RFCs regarding fields combined with rules from Google's documentation on the robots metatag.
There are ambiguities in the rules described there (like whether there is rule inheritence from global scope) which may be different to what other implementations may use.
using TurnerSoftware.RobotsExclusionTools;
var robotsFileParser = new RobotsFileParser();
RobotsFile robotsFile = await robotsFileParser.FromUriAsync(new Uri("http://www.example.org/robots.txt"));
var allowedAccess = robotsFile.IsAllowedAccess(
new Uri("http://www.example.org/some/url/i-want-to/check"),
"MyUserAgent"
);
using TurnerSoftware.RobotsExclusionTools;
//These rules are gathered by you from the Robots metatag and `X-Robots-Tag` header
var pageRules = new[] {
"noindex, notranslate",
"googlebot: none",
"otherbot: nofollow",
"superbot: all"
};
var robotsPageParser = new RobotsPageParser();
RobotsPageDefinition robotsPageDefinition = robotsPageParser.FromRules(pageRules);
robotsPageDefinition.CanIndex("SomeNotListedBot/1.0"); //False
robotsPageDefinition.CanFollowLinks("SomeNotListedBot/1.0"); //True
robotsPageDefinition.HasRule("notranslate", "SomeNotListedBot/1.0"); //True
robotsPageDefinition.CanIndex("GoogleBot/1.0"); //False
robotsPageDefinition.CanFollowLinks("GoogleBot/1.0"); //False
robotsPageDefinition.HasRule("notranslate", "GoogleBot/1.0"); //True
robotsPageDefinition.CanIndex("OtherBot/1.0"); //False
robotsPageDefinition.CanFollowLinks("OtherBot/1.0"); //False
robotsPageDefinition.HasRule("notranslate", "OtherBot/1.0"); //True
robotsPageDefinition.CanIndex("superbot/1.0"); //True
robotsPageDefinition.CanFollowLinks("superbot/1.0"); //True
robotsPageDefinition.HasRule("notranslate", "superbot/1.0"); //True