Skip to content

Scrape Filter Language

Amirul Menjeni edited this page Nov 21, 2017 · 12 revisions

About SFL

SFL is a scrape filtering language that determine which components are to be collect, and which to be ignored. This means that it only need to concern itself with boolean operations, and not arithmetical ones, to filter the scraped data.

Syntax

Tokens

Dynamic Tokens

There are two types of tokens in SFL that a spider's developer is required to define; an identifier token and a storable token (also called as argument variable).

The identifier token is used to identify components and its attributes. For example, in

post { score > 1 and 'nice' in title},

we see that we have post as a component, and score and title as its attributes.

On the other hand, the storable token is used to store values, which is then passed to the ScrapeFilter object defined in a spider (see creating a spider wiki page for details). For example, in @scan_limit = 99, we see that @scan_limit is a storable token, and the value token 99 is stored in it. A storable token only accepts value tokens (i.e. boolean, string, and number).

Value Tokens

SFL has value tokens that is used to be compared with some component's attributes and assigned to some storables. The value tokens are of three types:

  1. boolean

e.g. True, False.

  1. string

e.g. "Hello World", 'hi', or 'c'.

  1. number

e.g. the integer 99 or the decimal 99.9999.

These value tokens can be stored in a list similar to python's list enclosed in [ and ], delimited by comma (,). A list may also contain another list.

e.g. a list containg all the three value tokens: [1, 'hello world', [5, 9], false].

Operator Tokens

The symbols of operators in SFL follows closely the operator symbols used in python. Below is a list of available operators in SFL and its description.

Operator Description
a == b Returns True if a and b matches.
a != b Returns True if a and b does not match.
a < b Returns True if a is less than b.
a <= b Returns True if a is less than or equal to b.
a > b Returns True if a is greater than b.
a >= b Returns True if a is greater than or equal to b.
a in b Returns True if the string b contains the substring a.1
(...) Used to group a series of boolean operations.
a search b Returns True if the string b match the regex expression a.
not Negate any the returned value of other operators.

*1 b must be either a string or a list of value (i.e. [ val1, val2, ... ]).

Basic Syntax

SFL only does two things: Evaluate whether or not a component should be scraped, and pass a value to an argument variable of a spider.

The basic syntax for evaluating a some components comp_1, comp_2, ..., com_n would look like this:

comp_1 { *evaluate* } comp_2 { *evaluate* } ... comp_n { *evaluate* }

The evaluation of each component comp_i is done inside the following block of curly braces ({ ... }).

Suppose we have a spider that defined some components called post and comment, and post has the attributes score, author, and title, while comment has the attributes body and score. Then the syntax for evaluating post and comment would look something like this:

post {score > 1 and 'awesome' in title} and comment {0 < score < 100 and 'nice' in body}

Suppose also that the spider accepts some string value for its argument variable called sections. Then we can pass the value like so:

@sections = 'hot, trending, new'

SFL ignores whitelines and newlines. This make it possible to create a free style of writing the SFL script in a file to be read by dmine for data filtering.

@sections = 'hot, trending, new'

post {
   score > 1 
   and 'awesome' in title
   and not ('gore' in title)
}

comment {
   0 < score < 100 and 'nice' not in body
}

EBNF Expression

We use EBNF (Extended Baccus-Naur Form) expression to describe SFL for concise understanding of its grammar.

The following EBNF defines SFL:

     expr ::=                                                                        
    {                                                                               
        { identifier "{" eval "}" }                                                 
        { storable "=" factor  }                                                    
    }                                                                               
                                                                                    
    eval ::= term ( "and" | "or" ) term { ( "and" | "or" )  term }                  
                                                                                    
    opt ::= ( "<" | "<=" | ">" | ">=" | "==" | "!=" | ["not"] "in" )                
                                                                                    
    term ::=  factor opt factor { opt factor }                                      
            | [ "not" ] term                                                        
                                                                                    
    list ::= "[" { value {, value } } "]"                                           
                                                                                    
    value ::= "string" | "number" | "boolean" | list                                
                                                                                    
    identifier ::= { "letter" | "_" }+, { "digit" | "letter" | "_" }                
                                                                                    
    storable ::= "@", { "letter" | "digit" | "_" }+                                 
                                                                                    
    factor ::=                                                                      
              value                                                                 
            | identifier                                                            
            | "(" eval ")"                                                          
            | storable   

Clone this wiki locally