Tutorial

Prerequisites

First install and configure Hadoop following Hadoop’s documentation

Add the following property to hdfs-site.xml

<property>
    <name>hadoop.job.history.user.location</name>
    <value>none</value>
  </property>

to prevent _logs directories to be generated within the output of the Behemoth jobs.

You will also need to download and compile Behemoth following these instructions.

For Hadoop 0.20.x you may need to specify in hadoop-env.sh :

export HADOOP_OPTS="-server -Djavax.xml.parsers.DocumentBuilderFactory=
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl"

Note that Behemoth relies on Hadoop’s DistributedCache and won’t run in local mode.

We will need to generate job files for the Tika, GATE and UIMA modules. Compiling Behemoth is done with Maven and will generates a file behemoth-.job.jar in the /target directory of each module. For simplicity you should make sure that the hadoop command has been set in the Path and can be called from anywhere then use the behemoth script in the root dir.

Generating the corpus

The first step is to convert a set of documents to a Behemoth corpus :

./bin/hadoop jar ./behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusGenerator 
-i "path to corpus" -o "path for output file"

Use the —recurse option if you want CorpusGenerator to process the input path recursively e.g.

./bin/hadoop jar ./behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusGenerator 
-i "path to corpus" -o "path for output file" --recurse

Note : This can be done using any of the job files generated above as it is implemented in the core module which the modules gate, tika or uima all depend on. This means that you could call it using behemoth-core.jar e.g.

./bin/hadoop jar ./behemoth-core*job.jar com.digitalpebble.behemoth.util.CorpusGenerator 
-i "path to corpus" -o "path for output file"

In the rest of the document we will use the job files instead as they contain the dependencies that the modules require. The reason why it worked with the behemoth-core.jar is that it does not have any dependencies (apart from the Hadoop libs).

This CorpusGenerator class generates a SequenceFile of BehemothDocuments which will be the input of the subsequent processes. We can then have a look at the content of the SequenceFile using

./bin/hadoop jar ./behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusReader 
-i "path to previous output file"

This command displays all the BehemothDocument from the behemothcorpus sequence file

url: file:/localPath/corpus/somedocument.rtf
contentType: 
metadata: null
Annotations:

You can also use the parameter -showBinaryContent

./bin/hadoop jar ./behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusReader 
-i "path to previous output file" -c

to display the first 200 characters of the byte content, which is not very useful for binary formats such as pdf or doc.

At this stage the text of the documents have not yet been extracted from the original format, the content type has not been identified and we don’t have any annotations for the documents.

Text Extraction and Mime-type identification with the Tika Module

The Tika module in Behemoth uses the Apache Tika library to extract the text from the documents in a Behemoth sequence file and identify their mime-type.

./bin/hadoop jar ./behemoth-tika-*-job.jar com.digitalpebble.behemoth.tika.TikaDriver 
-i "path to previous output from the CorpusGenerator" -o "path to output file"

Language Identification & Document Filtering on Language ID

The document language can be identified by running:

(1) hadoop jar ./behemoth-lang*job.jar com.digitalpebble.behemoth.languageidentification.LanguageIdDriver 
-i corpusTika -o corpusTika-lang

Having detected the language, one can filter on a specific language ID and discard the remainder:

(2) hadoop jar ./behemoth-lang*-SNAPSHOT-job.jar com.digitalpebble.behemoth.languageidentification.LanguageIdDriver 
-D document.filter.md.keep.lang=en -i corpusTika-lang -o corpusTika-EN

OR only skip a specific language, by running the same command with:

-D document.filter.md.skip.lang=en

The first step here is optional, but shows the distribution in the corpus.

Document Filtering on Mime Type, URL… & intermediate document extraction

The core module allows post-Tika- filtering of documents in the corpus based on regular expressions.
Those documents that match the RE with mime type/URL… will be written to new output destination.

./bin/hadoop jar ./behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusFilter 
-D document.filter.mimetype.keep=.+html.* -i tikaCorpus -o tikaCorpus-html

./bin/hadoop jar ./behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusFilter 
-D document.filter.url.keep=.+333.* -i tikaCorpus -o tikaCorpus-333

./bin/hadoop jar ./behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusFilter 
-D document.filter.md.keep.label=contract -i textcorpusTika -o textcorpusTika-contracts

For this last filter it’s also possible to skip a label, by replacing “document.filter.md.keep.label” with
“document.filter.md.skip.label”.

If you apply more than one filter you can control the mode with:

document.filter.md.mode=OR  document.filter.md.mode=AND

where ‘OR’ will keep or skip the document, if any filter matches and ‘AND’ if all constraints match.

Intermediately, one can extract the documents from the seq.-file that have been filtered etc. for inspection:

hadoop jar ./behemoth-core*job.jar com.digitalpebble.behemoth.util.ContentExtractor -i seq-directory -o seqdirectory-output

Processing with GATE

The zipped GATE application must then be pushed onto the distributed filesystem with

./bin/hadoop fs -copyFromLocal /mylocalpath/ANNIE.zip /apps/ANNIE.zip

If you haven’t done so already, create a file behemoth-site.xml file in your Hadoop/conf directory and add the following properties

<property>
  <name>gate.annotationset.input</name>
  <value></value>
  <description>Map the information at the behemoth format onto the select annotationset 
  </description>
</property>
<property>
  <name>gate.annotationset.output</name>
  <value></value>
  <description>AnnotationSet to consider when serializing to the behemoth format
  </description>
</property>
<property>
  <name>gate.annotations.filter</name>
  <value>Token</value>
  <description>Annotations types to consider when serializing to the behemoth format, separated by commas 
  </description>
</property>
<property>
  <name>gate.features.filter</name>
  <value>Token.string</value>
  <description>if specified, only the feature listed for a type will be kept
  </description>
</property>
<property>
  <name>gate.emptyannotationset</name>
  <value>false</value>
  <description>if specified all the annotations in the Behemoth document will be deleted before
 processing with GATE </description>
</property>

you can then call

./bin/hadoop jar ./behemoth-gate*job.jar com.digitalpebble.behemoth.gate.GATEDriver 
 "input path" "target output path" /apps/ANNIE.zip
e.g. ./bin/hadoop jar ./behemoth-gate*job.jar com.digitalpebble.behemoth.gate.GATEDriver 
 /data/behemothcorpus /data/behemothcorpus-2 /apps/ANNIE.zip

Processing with UIMA

The procedure is very similar for UIMA, first generate a job file for the UIMA module, then copy the pear file to HDFS

./bin/hadoop fs -copyFromLocal /mylocalpath/WhitespaceTokenizer.pear /apps/WhitespaceTokenizer.pear

This time the parameters in behemoth-site.xml to specify are

<property>
  <name>uima.annotations.filter</name>
  <value>org.apache.uima.TokenAnnotation,org.apache.uima.SentenceAnnotation</value>
  <description>Annotations types to consider when serializing to the behemoth format, separated by commas 
  </description>
</property>
<property>
  <name>uima.features.filter</name>
  <value>org.apache.uima.TokenAnnotation:posTag</value>
  <description>Feature names to consider when serializing to the behemoth format, separated by commas 
  </description>
</property>

./bin/hadoop jar ./behemoth-uima*job.jar com.digitalpebble.behemoth.uima.UIMADriver
 /data/behemothcorpus /data/behemothcorpus-2 /apps/WhitespaceTokenizer.pear

Again, the content of the corpus can be checked with :

e.g. ./bin/hadoop jar ./behemoth-gate*job.jar com.digitalpebble.behemoth.util.CorpusReader -i /data/behemothcorpus-2 -a

Generating Mahout Vectors

Assuming that you’ve generated a job file for the Mahout module, you can then call :

./bin/hadoop jar ./behemoth-mahout.job com.digitalpebble.behemoth.mahout.SparseVectorsFromBehemoth
 -i "previous output" -o "target output path" --typeToken Token --featureName string

options:
“- wf tf” for creating term-frequency based weighting

“—namedVector” to be able to extract cluster/document-mapping at a later stage

which will generate vectors from a Behemoth corpus by using the annotations of type Token and take the value of the feature string instead of relying on the Lucene analysers as done by Mahout’s SparseVectorsFromSequenceFiles. This allows to use any features generated by a previous module (e.g. lemmas, POS tags, semantic features, …) as feature values for the clustering / classification with Mahout.

Note : the command above works with Hadoop 0.21 only

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial

Tutorial

Prerequisites

Generating the corpus

Text Extraction and Mime-type identification with the Tika Module

Language Identification & Document Filtering on Language ID

Document Filtering on Mime Type, URL… & intermediate document extraction

Processing with GATE

Processing with UIMA

Generating Mahout Vectors

Clone this wiki locally