-
Notifications
You must be signed in to change notification settings - Fork 0
TrainingKEA
The HIVE automatic metadata generation system is based on KEA++, an algorithm for controlled keyphrase indexing.
The KEA algorithm is based on a Naïve Bayes based classification system. Training data is used to build a statistical model which is used to recognize positive and negative examples. This statistical model is based on real world examples -- a corpus of documents with controlled terms assigned by hand.
The following instructions are based on http://www.nzdl.org/Kea/Download/Kea-5.0-Readme.txt
Before HIVE can extract controlled keyphrases for new documents, a keyphrase extraction model must be built from a set of pre-classified documents. For a given controlled vocabulary, training documents need to be copied to a central directory. For example, the following directory structure is currently used by HIVE.
hive/
conf/
hive.properties
vocabulary.properties
vocabulary/
vocabulary.rdf
vocabularyKEA/
train/
training_file1.txt
training_file1.key
- Download or convert the desired controlled vocabulary into SKOS RDF/XML format.
- Configure the vocabulary in HIVE.
- Identify a set of documents to train the keyphrase extractor. For examples, refer to the AGROVOC sample.
- Create a directory that will contain the documents used to train the keyphrase extractor (e.g., "train").
- Documents must be in plain text format. For PDFs, see TrainingKEA#Converting_PDFs_to_text below.
- Place author or indexer-assigned terms into a separate ".key" file. For example, if the document is called "doc1.txt", the file would be called "doc1.key". Each keyphrase must be on a separate line.
- Initialize the HIVE vocabularies with the "train" option.
To get good results, it is important that the input text for KEA is as "clean" as possible. That means html tags etc. in the input documents need to be deleted before the model is built and before keyphrases are extracted from new documents. Also, make sure that you have enough documents in both training and extraction phase. For example, for training at least 20-30 manually indexed documents are required. It is important that manually assigned keyphrases in the files ".key" correspond to the entries in the controlled vocabulary that you use.
The following sample files are available in the downloads section. For a more complete example, download and extract Agrovoc sample and review the files in the "agrovocKEA/training" directory.
HIVE uses the Apache Tika toolkit for PDF conversion. To convert PDFs to text for use with HIVE, use the following:
java edu.unc.ils.mrc.hive.util.TextManager <path>