-
Notifications
You must be signed in to change notification settings - Fork 2
Scrape Filter Language
SFL is a scrape filtering language that determine which components are to be collect, and which to be ignored. This means that it only need to concern itself with boolean operations, and not arithmetical ones, to filter the scraped data.
There are two types of tokens in SFL that a spider's developer is required to define; an identifier token and a storable token (also called as argument variable).
The identifier token is used to identify components and its attributes. For example, in
post { score > 1 and 'nice' in title}
,
we see that we
have post
as a component, and score
and title
as its attributes.
On the other hand, the storable token is used to store values, which
is then passed to the ScrapeFilter
object defined in a spider
(see creating a spider wiki page for details). For example,
in @scan_limit = 99
, we see that @scan_limit
is a storable
token, and the value token 99
is stored in it. A storable token
only accepts value tokens
(i.e. boolean, string, and number).
SFL has value tokens that is used to be compared with some component's attributes and assigned to some storables. The value tokens are of three types:
- boolean
e.g. True
, False
.
- string
e.g. "Hello World"
, 'hi'
, or 'c'
.
- number
e.g. the integer 99
or the decimal 99.9999
.
These value tokens can be stored in a list similar to python's list enclosed in [
and ]
, delimited by comma (,
).
A list may also contain another list.
e.g. a list containg all the three value tokens: [1, 'hello world', [5, 9], false]
.
The symbols of operators in SFL follows closely the operator symbols used in python. Below is a list of available operators in SFL and its description.
Operator | Description |
---|---|
a == b |
Returns True if a and b matches. |
a != b |
Returns True if a and b does not match. |
a < b |
Returns True if a is less than b . |
a <= b |
Returns True if a is less than or equal to b . |
a > b |
Returns True if a is greater than b . |
a >= b |
Returns True if a is greater than or equal to b . |
a in b |
Returns True if the string b contains the substring a .1
|
(...) |
Used to group a series of boolean operations. |
a search b |
Returns True if the string b match the regex expression a . |
not |
Negate any the returned value of other operators. |
*1 b must be either a string or a list of value (i.e. [ val1, val2, ... ]
).
SFL only does two things: Evaluate whether or not a component should be scraped, and pass a value to an argument variable of a spider.
The basic syntax for evaluating a some components comp_1, comp_2, ..., com_n would look like this:
comp_1 { *evaluate* } comp_2 { *evaluate* } ... comp_n { *evaluate* }
The evaluation of each component comp_i is done inside the following
block of curly braces ({ ... }
).
Suppose we have a spider that defined some components called post
and comment
, and
post has the attributes score
, author
, and title
, while comment
has the attributes
body
and score
. Then the syntax for evaluating post
and comment
would look
something like this:
post {score > 1 and 'awesome' in title} and comment {0 < score < 100 and 'nice' in body}
Suppose also that the spider accepts some string value for its argument variable
called sections
. Then we can pass the value like so:
@sections = 'hot, trending, new'
SFL ignores whitelines and newlines. This make it possible to create a free style of writing the SFL script in a file to be read by dmine for data filtering.
@sections = 'hot, trending, new'
post {
score > 1
and 'awesome' in title
and not ('gore' in title)
}
comment {
0 < score < 100 and 'nice' not in body
}
We use EBNF (Extended Baccus-Naur Form) expression to describe SFL for concise understanding of its grammar.
The following EBNF defines SFL:
expr ::=
{
{ identifier "{" eval "}" }
{ storable "=" factor }
}
eval ::= term ( "and" | "or" ) term { ( "and" | "or" ) term }
opt ::= ( "<" | "<=" | ">" | ">=" | "==" | "!=" | ["not"] "in" )
term ::= factor opt factor { opt factor }
| [ "not" ] term
list ::= "[" { value {, value } } "]"
value ::= "string" | "number" | "boolean" | list
identifier ::= { "letter" | "_" }+, { "digit" | "letter" | "_" }
storable ::= "@", { "letter" | "digit" | "_" }+
factor ::=
value
| identifier
| "(" eval ")"
| storable