Home

Welcome to the InformationExtraction wiki!

Overview

The project is to extract org hierarchy, names, titles, Business units from unstructured documents such as Web crawling with specific URLs, and can be supported via Information Extraction techniques. For example:

Ternary relation: T= (people, job, company) _
Input: Web page saved in HDFS, e.g. John Smith is the CEO at Inc. Corp” _
Output: Structured data in data store and accessible via UI, e.g. (John Smith, CEO, Inc. Corp)_

More specifically, it’s a Relation Extraction problem.(https://en.wikipedia.org/wiki/Relationship_extraction). We formulate the problem as a classification problem (in a discriminative framework).

Architecture

System and Process Flow

Getting Started

Required Env :

Scala 2.10.4 + Spark 1.6;
Scala 2.11.8 + Spark 2.0 (If using intellij, add the lib folder, Spark assembly jar file and scale SDK to the project library)
How to run:
 1)Download latest stanford coreNLP model from here.
 2)Uncompress the downloaded jar file, and replace the edu folder in project ie.
 3)Run SparkBatchTest, input data file paths or sentences to test.

More info

Relation Extraction problems have been investigated for over 2 decades. Many available toolsets:

Stanford Relation Extraction：http://nlp.stanford.edu/software/relationExtractor.html
mit-nlp/MITIE: https://github.com/mit-nlp/MITIE

  Alchemyapi: http://www.alchemyapi.com/products/alchemylanguage/relation-extraction

```
  GATE: https://gate.ac.uk/ie/
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly