-
Notifications
You must be signed in to change notification settings - Fork 14
Home
Welcome to the InformationExtraction wiki!
- Overview
- Architecture
- Process Flow
- Getting Started
- More Info
The project is to extract org hierarchy, names, titles, Business units from unstructured documents such as Web crawling with specific URLs, and can be supported via Information Extraction techniques. For example:
- Ternary relation: T= (people, job, company) _
- Input: Web page saved in HDFS, e.g. John Smith is the CEO at Inc. Corp” _
- Output: Structured data in data store and accessible via UI, e.g. (John Smith, CEO, Inc. Corp)_
More specifically, it’s a Relation Extraction problem.(https://en.wikipedia.org/wiki/Relationship_extraction). We formulate the problem as a classification problem (in a discriminative framework).
Required Env :
-
Scala 2.10.4 + Spark 1.6;
-
Scala 2.11.8 + Spark 2.0 (If using intellij, add the lib folder, Spark assembly jar file and scale SDK to the project library)
-
How to run:
-
1)Download latest stanford coreNLP model from here.
-
2)Uncompress the downloaded jar file, and replace the edu folder in project ie.
-
3)Run SparkBatchTest, input data file paths or sentences to test.
Relation Extraction problems have been investigated for over 2 decades. Many available toolsets:
- Stanford Relation Extraction:http://nlp.stanford.edu/software/relationExtractor.html
- mit-nlp/MITIE: https://github.com/mit-nlp/MITIE
-
Alchemyapi: http://www.alchemyapi.com/products/alchemylanguage/relation-extraction
-
GATE: https://gate.ac.uk/ie/