README

QUICK START FOR IR00:
put the following in you .profile:

export LD_LIBRARY_PATH=/aut/proj/ir/wsaskew/System/local/lib:$LD_LIBRARY_PATH
export LD_RUN_PATH=/aut/proj/ir/wsaskew/System/local/lib:$LD_RUN_PATH
export MANPATH=/aut/proj/ir/wsaskew/System/local/man:$MANPATH
export PATH=/aut/proj/ir/wsaskew/System/local/bin:$PATH
export PYTHONPATH=/aut/proj/ir/wsaskew/System/pythonpath:$PYTHONPATH

then look at the example.py files provided to get started

REQUIREMENTS:
python2.6:
http://www.python.org/download/
libxml:
http://xmlsoft.org/downloads.html
libxslt:
http://xmlsoft.org/XSLT/downloads.html
lxml:
http://pypi.python.org/pypi/lxml/
antlr:
http://www.antlr.org/download/Python/
arff package:
http://www.mit.edu/~sav/arff/dist/

INSTALLATION:
copy the code directory into someplace on your pythonpath
example:
cp -R cooccurrence_similarity ~/python_code/
touch ~/python_code/__init__.py
export PYTHONPATH=~/python_code:$PYTHONPATH

PACKAGE DESCRIPTION:
The package allows a corpora to be mined in order to calculate
relations between a set of target words.  The package creates ARFF
files which have four features which measure the contextual
relatedness of two words.  These ARFF files may be used with the WEKA
machine learning toolkit (http://www.cs.waikato.ac.nz/ml/weka/) for a
variety of machine learning tasks.

USAGE:
The main interface is the Experimenter class.  The experimenter class
allows a user to generate ARFF files while varying a number of
parameters. The Experimenter class maintains consistency between uses,
and uses results from previous experiments when possible to avoid
redundant calculation.

An Experimenter instance requires only one argument to be constructed,
the path to a directory which will be used to store results of
experiments.

e = Experimenter('experiment_dir')

experiment_dir should be either a path to an empty directory, a path
to a non-existent file or directory (which will then be created) or a
path to a previously created experiment directory.  If a path to a
previously created experiment directory is provided, then the
experimenter instance returned will be identical to the experimenter
instance which last performed work on the directory, thus maintaining
consistency across multiple uses.

Next, one or more corpora must be indexed.  The method add_to_index
requires two arguments, and allows for three more.

e.add_to_index('corpus_dir', 'corpus_type', stop_file='stop_file',
tag_file='tag_file', synch_freq=10000)

corpus_dir and corpus_type are required.  corpus_dir should be a
directory full either of html or xml files to be mined.  The directory
will be read recursively, so directory structure does not matter, as
long as the files to be indexed end in '.htm', 'php', '.html', or
'.xml'.  corpus_type must be either 'phpBB' or 'xml'.  More corpus
types may be supported in the future.  If corpus_type is phpBB, then
the files are treated as files generated by the popular phpBB forum
software.  If the type is xml, then 'tag_file' must be provided.  The
tag_file argument should be a path to a file which instructs the xml
parser how to parse the xml files in the corpus_dir.  An example tag
file looks like this:

TitleTag: ArticleTitle
DelimiatorTag: MedlineCitation
HeadingTag: MeshHeading
AbstractText

TitleTag specifies the xml tag which contains an article's title
HeadingTag specifies an xml tag which holds interesting heading or
meta-information.
DelimiatorTag specifies a tag which separates documents from each
other if a single xml file holds multiple documents
Tags which are not preceded by a label (such as AbstractText in the
above example) specify the location of text to be parsed out.
An arbitrary number of such tags may be specified.

The stop_file argument is optional but recommended.  stop_file should
be a path to a file containing a sequence of stop words to be removed
from the indexed text, separated by newlines.

The sync_freq argument is optional, and defaults to a reasonable
values.  The argument controls how often the data structures involved
in the indexing task are wiped from memory and synced to disk.
Synchronization will occur after synch_freq number of documents are
processed.  High values yield faster indexing and higher memory usage,
and lower values the opposite.

Once corpora have been indexed, experiments may be performed.

e.perform_experiment(target_file, synonym_file=None,
    window=50, pmi_threshold=25,
    relation_threshold=100, truth_DB=truth_file,
    truth_function='2_way_mild')

target_file should contain a series of newline separated words which
relations should be calculated between.
synonym_file may be optionally provided.  A synonym file should be of
the format:
synonym:target
and will cause all occurrences of synonym to be counted as occurrences
of target.

window, pmi_threshold and relation_threshold are variables which
influence how context similarity metrics are calculated.

windows is the cooccurrence window. A word must be within window
words of a target in either direction in order to be considered a
cooccurrence. 

pmi_threshold controls whether a cooccurrence will be included when
measuring the relatedness of two target words.  The cooccurrence will
only contribute to the relation metric if the cooccurrence occurs at
least pmi_threshold times.  This value restricts infrequent
cooccurrences from affecting the final similarity measure between two
targets.

relation_threshold restricts the number of relation values calculated.
Relation values are only calculated between targets which share at
least relation_threshold cooccurrence words.
This value restricts relation values calculated from small numbers of
share cooccurrences from being calculated.  relation_threshold is a
confidence threshold which influences the number of instances which
will appear in the generated ARFF files.  If two targets do not share
enough cooccurrence words, then their relation will not be represented
in the generated ARFF file.

The truth_db is an optional argument which should provide a truth
value for the relation between the specified targets.  The truth_db
should be a Berkely DB with keys in the format:
target_1,target_2 and values should be pearson correlation values.  If
no truth_db is provided, then feature files without truth values are
generated.

truth_function controls how the pearson correlation values are
interpreted from the truth_db.
'2_way_mild' causes pearson values > .1 to indicate a positive
relation
'2_way_strong' causes values > .3 to indicate a positive
relation
'3_way_mild' causes values > .1 to indicate a positive
relation, values < -.1 to indicate a negative relation, and values
in between to indicate independence
'3_way_strong'' causes values > .3 to indicate a positive
relation, values < -.3 to indicate a negative relation, and values
in between to indicate independence
'5_way' causes values > .3 to indicate strong positive correlation,
values > .1 to indicate mild positive correlation, values < -.3
represent strong negative correlation, values < -.1 represent mild
negative correlation, and values between -.1 and .1 represent
independence.

The ARFF files generated will have five features.  One of these
features is the disease pair represented by the instance.  This is a
feature that is useful for humans, but should probably not be used for
classification or training.  Using WEKA, the following command will
filter out all string features for training and classification.
Because the only string feature in the generated ARFF files is the
name of the instance, this will have the desired effect of only using
the proper features for training and classification:

java weka.classifiers.meta.FilteredClassifier
  -F weka.filters.unsupervised.attribute.RemoveType
  -W weka.classifiers.trees.J48
  -t train.arff -T test.arff -p 5

The option:  -F weka.filters.unsupervised.attribute.RemoveType removes 
all string fields, and the only string field in the arff file is the
disease names. If you actually decide to use string fields in the ARFF
file, you will need to use a more clever filter.

BUGS AND 'FEATURES':
none (yet)

SEE ALSO:
a few example files are provided and named example(#).py

You can generate documentation for any of the python modules 
with the pydoc command.

send bugs or complaints to: waltaskew@gmail.com