Skip to content
helenlly edited this page Aug 18, 2016 · 32 revisions

Welcome to the InformationExtraction wiki!

Table of Contents

  • Overview
  • Architecture
  • Process Flow
  • Getting Started
  • More Info

Overview

The project is to extract org hierarchy, names, titles, Business units from unstructured documents such as Web crawling with specific URLs, and can be supported via Information Extraction techniques. For example:

  • Ternary relation: T= (people, job, company) _
  • Input: Web page saved in HDFS, e.g. John Smith is the CEO at Inc. Corp” _
  • Output: Structured data in data store and accessible via UI, e.g. (John Smith, CEO, Inc. Corp)_

More specifically, it’s a Relation Extraction problem.(https://en.wikipedia.org/wiki/Relationship_extraction). We formulate the problem as a classification problem (in a discriminative framework).


Architecture


System and Process Flow


Getting Started

Required Env :

  • Scala 2.10.4 + Spark 1.6;

  • Scala 2.11.8 + Spark 2.0 (If using intellij, add the lib folder, Spark assembly jar file and scale SDK to the project library)

  • How to run:

  •  1)Download latest stanford coreNLP model from here.

  •  2)Uncompress the downloaded jar file, and replace the edu folder in project ie.

  •  3)Run SparkBatchTest, input data file paths or sentences to test.


More info

Relation Extraction problems have been investigated for over 2 decades. Many available toolsets:

Clone this wiki locally