Skip to content
Amirul Menjeni edited this page Oct 8, 2017 · 14 revisions

Core Concept

The spiders employed in dmine "see" its target website as a collection of components that make up the whole website. The components contains one or more attributes that act as factors in determining which data to collect, and which data to ignore when scraping.

In order to make the process of data filtering more robust and simple for every spider employed under dmine, a simple language called Scrape Filter Language (SFL) was introduced.

An example of a one liner of SFL is as follows. Assuming a website has a "thing", or a component called post, and the post is simply a user submitted content with a title and a numerical score that represent how much votes the post has currently gained. Thus, in our spider's venture into the target site, we can make our spider only collect post with a positive score and the word 'funny' like so:

post { score > 0 and 'funny' in title }
Clone this wiki locally