InformationExtraction

Documentation for the project is available in project wiki

How to build:

Download stanford coreNLP model from here.
Put the downloaded model file into lib folder.
Run "mvn clean package" in the project directory.
The deployment package should be ready at ie-dist folder.

How to run on Spark cluster:

Finish the build process.
Copy the ie-dist folder to your spark cluster.
RunSparkBatchDriver.sh will start the batch processing, where you can input sentences or (hdfs) file paths.
To run Relation Evaluation, please refer to RunRelationEvaluation.sh, where you might need to change the file location according to your cluster settings. Please copy the data folder to your cluster and upload to hdfs before running evaluation (This only need to be performed once).

Setup development env with Intellij:

Download latest stanford coreNLP model from here.
Put the downloaded model file into lib folder.
Open/import the project as Maven project.
Add lib folder to the project library, and click Build.

How to customize/extend:

Refer to config.properties for configuration change, such like pipeline components, NER models, dictionary and regex rules;
Cutomized training for NER and Relation Extractor can be supported by com.intel.ie.training package.