PHP parser/handler for Robots Exclusion Protocol (robots.txt and more..)
-
implements http://www.robotstxt.org/norobots-rfc.txt
- [DONE] "3.2.2 The Allow and Disallow lines" - as test-case
- [DONE] "4.Examples" as test-case
-
passing Nutch's test code ref
- [DONE] @see tests/Diggin/RobotRules/Imported/NutchTest.php
-
parsing & handling html-meta
- handle Crawl-Delay
- sync or testing a little pattern w/ Google Test robots.txt tool
- rewrite with PHPPEG.(because current preg* base parser makes difficulty.)
- more test, refactoring on and on..
<?php
use Diggin\RobotRules\Accepter\TxtAccepter;
use Diggin\RobotRules\Parser\TxtStringParser;
$robotstxt = <<<'ROBOTS'
# sample robots.txt
User-agent: YourCrawlerName
Disallow:
User-agent: *
Disallow: /aaa/ #comment
ROBOTS;
$accepter = new TxtAccepter;
$accepter->setRules(TxtStringParser::parse($robotstxt));
$accepter->setUserAgent('foo');
var_dump($accepter->isAllow('/aaa/')); //false
var_dump($accepter->isAllow('/b.html')); //true
$accepter->setUserAgent('YourCrawlerName');
var_dump($accepter->isAllow('/aaa/')); // true
Diggin_RobotRules is following PSR-0, so to register namespace Diggin\RobotRules into your ClassLoader.
To install via composer
- $php composer.phar require diggin/diggin-robotrules "dev-master"
Diggin_RobotRules is licensed under new-bsd.
- Perl
- Python
- Ruby