Skip to content

Introduction to Crawler

Kummita Sriteja edited this page Apr 19, 2019 · 1 revision
  • Why crawling? ->To make life easier.

So if you want to read today’s financial news directly from your email inbox, you could simply subscribe to the provider’s like Google's RSS feed or BBC RSS feed. Similarly, your system or application could also use a provider’s API to get up to date about stock market prices.

But what about if the data is unstructured or does not have RSS feeds How will you fetch them? You could hire people to manually log in and save the information into an Excel sheet, but the process is tedious and impractical.

Let us take a simple example. You have a shopping site and have 1000 products. You want to make sure your prices are competitive. In order to do that, you will need to monitor your competitors’ sites and their prices for the same products. If there are a lot of products and a lot of competitors it is going to be very difficult to do this without some automated process. This is where Web Crawling comes into the picture.

  • What is a Web crawler? A web crawler is a program that collects the content from the web. Web crawlers also are known as robots, spiders, worms, walkers, and wanderers which are almost as old as the Web itself.

  • Typical Structure of a web crawler :

typical_structure_of_a_web_crawler

To begin a web crawler we need to find a starting point to start. From seed pages, extract the links(URL). By indexing and compressing store the content flow and then extract links from HTML pages. At last resolve all the URL's.

Clone this wiki locally