Skip to content
SeaOfOcean edited this page Aug 19, 2016 · 32 revisions

Welcome to the InformationExtraction wiki!

Table of Contents

  • Overview
  • Architecture
  • System and Process Flow
  • DataSet
  • Model Training
  • Evaluation
  • Getting Started
  • More Info

Overview

The project is to extract org hierarchy, names, titles, Business units from unstructured documents such as Web crawling with specific URLs, and can be supported via Information Extraction techniques. For example:

  • Ternary relation: T= (people, job, company) _
  • Input: Web page saved in HDFS, e.g. John Smith is the CEO at Inc. Corp” _
  • Output: Structured data in data store and accessible via UI, e.g. (John Smith, CEO, Inc. Corp)_

More specifically, it’s a Relation Extraction problem.(https://en.wikipedia.org/wiki/Relationship_extraction). We formulate the problem as a classification problem (in a discriminative framework).


Architecture


System and Process Flow

System Flow:

Process Flow:


Dataset

Purpose

This dataset is built for purpose of model training and evaluation.

Dataset Crawler

The leadership board pages of around 300+ companies are crawled with the help of Jsoup.

The crawler script reads in a list of company urls and returns the clean page content with people profile in a single line. For example,

Tim Cook, CEO
Angela Ahrendts, Senior Vice President Retail and Online Stores
Eddy Cue, Senior Vice President Internet Software and Services

The crawled raw pages are stored in folder data/evaluation/web, each subfolder contains one company.

Manual Data Labelling

For the purpose of evaluating the NER performance, we need a set of ground truth. Since there are no available labelled data, some manual work is required. To speed up the label process, you can use LabelHelper.scala to get automatic labeled result and manually review it.

Entity Person, Title, Department are labelled separately.

The Person is labeled as \t1[PersonName]\t

The Title is labeled as \t2[TitleName]\t

The Department is labeled as \t3[DepartmentName]\t

For example, the above data is labeled as follows:

\t1Tim Cook\t, \t2CEO\t
\t1Angela Ahrendts\t, \t2Senior Vice President\t \t3Retail and Online Stores\t
\t1Eddy Cue\t, \t2Senior Vice President\t \t3Internet Software and Services\t

Manual labeled pages are put in the folder data/evaluation/maunal, each subfolder contains one company.

Extract labeled relations

Having labeled the data, TabConverter.scala parses the labeled files and extracts a list of relations for each page, and the results are stored in data/evaluation/extraction

Model Training

Train an NER classifier

The training data is put in the folder data/NERDepartment. It includes the data files, the meaning of the columns, and what features to generate via a properties file. The data files are in tab-separated columns, with minimally the word tokens in one column and the class labels in another column. [TrainModel.java] (../blob/master/ie/src/main/java/com/intel/ie/TrainModel.java) parses the files and creates a new classifier. The process needs several minutes and the new classifier is stored in model


Evaluation

The evaluation system consists of two parts: NER evaluation and relation extraction evaluation.

Metric

The evaluation metric is precision/recall.

NER evaluation

NerEvaluation.scala evaluates name entity recognition result. Before evaluating NER model, you need to label your test data with the format similar to this.

You can follow the prompt of Label.scala to label your test file, type 1 for PERSON, 2 for TITLE, 3 for EMPLOYEE_OF and c if you want to want to cancel last label and re-type.

Relation extraction evaluation

RelationEvaluation.scala evaluates relation extraction results. The dataset has been described in Section. Dataset.


Getting Started

Required Env :

  • Scala 2.10.4 + Spark 1.6;

  • Scala 2.11.8 + Spark 2.0 (If using intellij, add the lib folder, Spark assembly jar file and scale SDK to the project library)

  • How to run:

  •  1)Download latest stanford coreNLP model from here.

  •  2)Uncompress the downloaded jar file, and replace the edu folder in project ie.

  •  3)Run SparkBatchTest, input data file paths or sentences to test.


More info

Relation Extraction problems have been investigated for over 2 decades. Many available toolsets:

Clone this wiki locally