-
Notifications
You must be signed in to change notification settings - Fork 14
Home
Welcome to the InformationExtraction wiki!
- Overview
- Architecture
- Process Flow
- Getting Started
- More Info
The project is to extract org hierarchy, names, titles, Business units from unstructured documents such as Web crawling with specific URLs, and can be supported via Information Extraction techniques. For example:
- Ternary relation: T= (people, job, company) _
- Input: Web page saved in HDFS, e.g. John Smith is the CEO at Inc. Corp” _
- Output: Structured data in data store and accessible via UI, e.g. (John Smith, CEO, Inc. Corp)_
More specifically, it’s a Relation Extraction problem. (https://en.wikipedia.org/wiki/Relationship_extraction). We formulate the problem as a classification problem (in a discriminative framework).
Iterative tuning steps Do initial feature engineering (tokenization, lemma) Use NER from Stanford to do the parsing and entity extraction. Co-reference resolution. Possible options are coreNLP and Relation extraction from Stanford. Start relation extraction. We may need to perform 3 independent relation extraction. ([name, title], [name, organization], [title, organization]) Feature extraction and transform for relation extraction, as the relation extraction can be regarded as an classification problem. (whether there’s a relationship in the sentence) We’ll need to generate the training dataset and validation dataset. (I’m not sure how yet, this is basically manual work…) Conduct training Merge the binary relationship into a 3-ary relation. (by graph techniques) Evaluate the result against the test dataset.
Required Env : * * * * * * Scala 2.10.4 + Spark 1.6; * * * * * * Scala 2.11.8 + Spark 2.0 (If using intellij, add the lib folder, Spark assembly jar file and scale SDK to the project library) How to run: 1)Download latest stanford coreNLP model from here. 2)Uncompress the downloaded jar file, and replace the edu folder in project ie. 3)Run SparkBatchTest, input data file paths or sentences to test.
Relation Extraction problems have been investigated for over 2 decades. Many available toolsets:
- Stanford Relation Extraction:http://nlp.stanford.edu/software/relationExtractor.html
- mit-nlp/MITIE: https://github.com/mit-nlp/MITIE
-
Alchemyapi: http://www.alchemyapi.com/products/alchemylanguage/relation-extraction
-
GATE: https://gate.ac.uk/ie/