-
Notifications
You must be signed in to change notification settings - Fork 0
A robots.txt parser written in PHP, useful for checking whether or not your bot has access to a given URL. Should probably be cleaned up a little in the future™ due to being a little on the messier side.
meklu/rbt_prs
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
/// copyleft 2012 meklu (public domain) rbt_prs is a robots.txt parser written in PHP and intended for checking whether or not your bot can access a given URL. Currently it provides just one function: isUrlBotSafe($url, $your_useragent, $robots_txt, $redirects, $debug) Setting any of the last four parameters to NULL will make them use their default values. The first argument, $url, is the only required argument and means the URL you would like to check. The second argument, $your_useragent, is the user agent string your bot will use. While it isn't mandatory to select one, it is highly advised. To mention this software in your user agent string, you could invoke the function as isUrlBotSafe("http://example.com/", "Fancybot/0.1 (awesome) " . RBT_PRS_UA); Notice the space after the description. The constant RBT_PRS_UA doesn't have one in it. The third argument, $robots_txt, is useful when you want to go through several URLs with just a single download of robots.txt. It allows you to use your own rules on a given URL. You could utilise the argument like this: $myrules = file_get_contents("http://example.com/robots.txt"); isUrlBotSafe("http://example.com/", "Fancybot/0.1", $myrules); isUrlBotSafe("http://example.com/contact", "Fancybot/0.1", $myrules); isUrlBotSafe("http://example.com/blog", "Fancybot/0.1", $myrules); isUrlBotSafe("http://example.com/about", "Fancybot/0.1", $myrules); The fourth argument, $redirects, is only useful when you let the function download the robots.txt. It allows you to handle HTTP redirects in a slightly better way than having to miserably fail when grabbing the file. Values that are 1 or less will disable redirects, whereas FALSE will leave things the way they are. TRUE will set the value to 20. Here's how to use a similar approach with $robots_txt defined. $opts = array('http' => array('method' => 'GET', 'max_redirects' => 20) ); $context = stream_context_create($opts); $myrules = file_get_contents("http://example.com/robots.txt", FALSE, $context); The fifth argument, $debug, will make the function verbose and have it print all sorts of useful analysis on the rules. This piece of software is already in use in real life at http://cheesetalks.twolofbees.com/humble/ and was originally written for the site author just because it was the correct way of doing things.
About
A robots.txt parser written in PHP, useful for checking whether or not your bot has access to a given URL. Should probably be cleaned up a little in the future™ due to being a little on the messier side.
Resources
Stars
Watchers
Forks
Packages 0
No packages published