Skip to content

rajeshkmindix/crawler-intro

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Crawler Assignment

The purpose of this assignment is to get started in Crawler4J with Scala.

Deliverables

  • Link to your fork of this assignment (Yes you need to fork this project and make changes on your repo) with the changes (refer next section for the changes)
  • Final CSV file containing atleast 100 product information details.

Changes to be done in the code

  • Support for css / js / image / other junk exclude filters. (Check BaseCrawler.scala)
  • Add Crawler for the site (site name will be sent separately in the email)
  • Support for Proxy (check CrawlDriver.scala) - If everybody starts crawling from office, the site can identify that and block our IP Address. Since all outgoing requests originate from the same IP. Will share the proxy details in the email separately.
  • Parser unit tests for the new site(s).
  • Fix the existing unit test for Flipkart as well.

Resources

More Resources on CSS Selectors

About

Getting started with Crawler4J and Scala

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 99.1%
  • Scala 0.9%