Skip to content

Latest commit

 

History

History
74 lines (47 loc) · 1.69 KB

README.md

File metadata and controls

74 lines (47 loc) · 1.69 KB

Durian Extractor

Web page extractor and readability using Jsoup.

Prerequisites:

Install

because this project not pushed to any public maven repos, you should install it first locally

    mvn clean install

add this project as dependency of your project

    <dependency>
        <groupId>co.mailtarget</groupId>
        <artifactId>durian</artifactId>
        <version>0.0.10</version>
    </dependency>

Usage

###kotin

    val extractor = WebExtractor.Builder
                    .strategy(Strategy.HYBRID)
                    .build()
    
    val webData = extractor.extract(url)

or

    val forceJavascript = false
    WebData webData = extractor.extract(url, forceJavacript)

###Java

    WebExtractor extractor = new WebExtractor.Builder()
                    .strategy(Strategy.HYBRID)
                    .build();
    WebData webData = extractor.extract(url);

or

    boolean forceJavascript = false;
    WebData webData = extractor.extract(url, forceJavacript);

Options

###Extract Strategy

  • META : fastest method, just parse content from meta
  • CONTENT : prefer using content as source of extraction
  • HYBRID : fetch from meta first, if not found search deeper from content

###System Config

tried in MAC OS machine and work well, on centos machine, please install

    yum groupinstall -y "Fonts"
    yum install gtk2 

optional : gtkhtml3 libXtst libxslt alsa-lib