Skip to content
helenlly edited this page Aug 12, 2016 · 32 revisions

Welcome to the InformationExtraction wiki!

Table of Contents

  • Overview

Overview

The project is to extract org hierarchy, names, titles, Business units from unstructured documents such as Web crawling with specific URLs, and can be supported via Information Extraction techniques。 For example:

Ternary relation: T= (people, job, company)
Input: Web page saved in HDFS, e.g. John Smith is the CEO at Inc. Corp” Output: Structured data in data store and accessible via UI, e.g. (John Smith, CEO, Inc. Corp)

More specifically, it’s a Relation Extraction problem. (https://en.wikipedia.org/wiki/Relationship_extraction). We formulate the problem as a classification problem (in a discriminative framework).

  1. Architecture

  2. System and Process Flow:

Iterative tuning steps  Do initial feature engineering (tokenization, lemma)  Use NER from Stanford to do the parsing and entity extraction.  Co-reference resolution. Possible options are coreNLP and Relation extraction from Stanford.  Start relation extraction. We may need to perform 3 independent relation extraction. ([name, title], [name, organization], [title, organization])  Feature extraction and transform for relation extraction, as the relation extraction can be regarded as an classification problem. (whether there’s a relationship in the sentence)  We’ll need to generate the training dataset and validation dataset. (I’m not sure how yet, this is basically manual work…)  Conduct training  Merge the binary relationship into a 3-ary relation. (by graph techniques)  Evaluate the result against the test dataset. 4. Getting Started  Required Env : * * * * * * Scala 2.10.4 + Spark 1.6; * * * * * * Scala 2.11.8 + Spark 2.0 (If using intellij, add the lib folder, Spark assembly jar file and scale SDK to the project library)  How to run:  1)Download latest stanford coreNLP model from here.  2)Uncompress the downloaded jar file, and replace the edu folder in project ie.  3)Run SparkBatchTest, input data file paths or sentences to test. 5. More info Relation Extraction problems have been investigated for over 2 decades. Many available toolsets:

Clone this wiki locally