-
Notifications
You must be signed in to change notification settings - Fork 0
fork of scrapi 1.2.0
License
noahhaon/scrapi
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
== NOTES This is a fork of aasaf's scrapi-1.2.0 gem. The intent of the fork was to support handling of cookies. You may pass in your cookie string in the Scrapi.scrape options hash like so: scraper.scrape(url, :user_agent => 'Mosaic for Amiga/1.2', :cookies => 'chocolate_chip=1; peanut_butter=5;' ) == ScrAPI toolkit for Ruby A framework for writing scrapers using CSS selectors and simple select => extract => store processing rules. Here’s an example that scrapes auctions from eBay: ebay_auction = Scraper.define do process "h3.ens>a", :description=>:text, :url=>"@href" process "td.ebcPr>span", :price=>:text process "div.ebPicture >a>img", :image=>"@src" result :description, :url, :price, :image end ebay = Scraper.define do array :auctions process "table.ebItemlist tr.single", :auctions => ebay_auction result :auctions end And using the scraper: auctions = ebay.scrape(html) # No. of auctions found puts auctions.size # First auction: auction = auctions[0] puts auction.description puts auction.url To get the latest source code with regular updates: git clone [email protected]:noahhaon/scrapi.git == Using TIDY By default scrAPI uses Tidy to cleanup the HTML. You need to install the Tidy Gem for Ruby: gem install tidy And the Tidy binary libraries, available here: http://tidy.sourceforge.net/ By default scrAPI looks for the Tidy DLL (Windows) or shared library (Linux) in the directory lib/tidy. That's one place to place the Tidy library. Alternatively, just point Tidy to the library with: Tidy.path = "...." On Linux this would probably be: Tidy.path = "/usr/local/lib/libtidy.so" On OS/X this would probably be: Tidy.path = “/usr/lib/libtidy.dylib” For testing purposes, you can also use the built in HTML parser. It's useful for testing and getting up to grabs with scrAPI, but it doesn't deal well with broken HTML. So for testing only: Scraper::Base.parser :html_parser == License Copyright (c) 2006 Assaf Arkin, under Creative Commons Attribution and/or MIT License Developed for http://co.mments.com Code and documention: http://labnotes.org HTML cleanup and good hygene by Tidy, Copyright (c) 1998-2003 World Wide Web Consortium. License at http://tidy.sourceforge.net/license.html HTML DOM extracted from Rails, Copyright (c) 2004 David Heinemeier Hansson. Under MIT license. HTML parser by Takahiro Maebashi and Katsuyuki Komatsu, Ruby license. http://www.jin.gr.jp/~nahi/Ruby/html-parser/README.html
About
fork of scrapi 1.2.0
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published