This is a repository for demo code for Jason Baldridge's Practical Sentiment Analysis Tutorial for the Sentiment Analysis Symposium on March 5, 2014.
NOTE: this is currently under construction! The notes below will be updated over the next two days.
This demo code was pulled out multiple sources, but primarily from the solutions key to a homework for creating classifiers for polarity classification for tweets. The original instructions may be found in the description of homework five for the @appliednlp course.
Do the following to build the code. Note: you'll need a Linux or Mac machine.
Clone the repository first (I'll make a direct download soon). This will create a directory 'sastut' with all the code and data. I'll assume it is in your home directory.
$ cd ~/sastut
$ ./build compile
If this succeeds, you should be ready to go!
The data is in the sastut/data
directory. Each dataset has its own subdirectory.
The IMDB data is not included, so you need to download it and unpack it in the data directory. Here's an easy way to do it if you have wget
.
$ cd ~/sastut/data
$ wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
$ tar xvf aclImdb_v1.tar.gz
Here's how to run the super simple "good/bad" classifier and test it on the debate08 data:
$ bin/sastut exp --eval data/debate08/dev.xml -m simple
################################################################################
Evaluating data/debate08/dev.xml
--------------------------------------------------------------------------------
Confusion matrix.
Columns give predicted counts. Rows give gold counts.
--------------------------------------------------------------------------------
5 442 7 | 454 negative
1 140 0 | 141 neutral
0 182 18 | 200 positive
--------------------------
6 764 25
negative neutral positive
--------------------------------------------------------------------------------
20.50 Overall accuracy
--------------------------------------------------------------------------------
P R F
83.33 1.10 2.17 negative
18.32 99.29 30.94 neutral
72.00 9.00 16.00 positive
...................................
57.89 36.46 16.37 Average
You can give multiple evaluation sets by including the other files.
$ bin/sastut exp --eval data/debate08/dev.xml data/hcr/dev.xml data/stanford/dev.xml -m simple
Here's how to train a logistic regression classifier on the Debate08 data and evaluate on the other sets:
$ bin/sastut exp --train data/debate08/train.xml --eval data/debate08/dev.xml data/hcr/dev.xml data/stanford/dev.xml -m L2R_LR
Use the -x option to use the extended features.
$ bin/sastut exp --train data/debate08/train.xml --eval data/debate08/dev.xml data/hcr/dev.xml data/stanford/dev.xml -m L2R_LR -x
Tip: use the "-d" (or "--detailed") option to output the detailed breakdown of which items where correctly labeled, which were incorrectly labeled, separated out by kind of error.
Use "small_lexicon" to use the small lexicon example. To try different lexicons, you need to modify the class SmallLexiconClassifier in the file src/main/scala/sastut/Classifier.scala
and recompile. (I'll make this easier later today.)
For example, to use Bing Liu's lexicon, you'll need to comment out these lines:
val positive = Set("good","awesome","great","fantastic","wonderful")
val negative = Set("bad","terrible","worst","sucks","awful","dumb")
And uncomment the second of these two lines:
// Use Bing Liu's lexicon
//import Polarity._
To run the simple lexicon on IMDB data (this will take longer because there are many more documents):
$ bin/sastut exp --eval data/aclImdb/test/labeledBow.feat -m small_lexicon -f imdb
Train a logistic regression model using bag-of-words features:
$ bin/sastut exp --train data/aclImdb/train/labeledBow.feat --eval data/aclImdb/test/labeledBow.feat -m L2R_LR -f imdb
Note: this uses the pre-extracted features provided by Chris Potts and colleagues. We can also run on the raw text and use our own feature extraction code:
$ bin/sastut exp --train data/aclImdb/train/ --eval data/aclImdb/test/ -m L2R_LR -f imdb_raw
Use "-x" to use extended features.
Finally, you can give multiple training set files to the --train
option:
$ bin/sastut exp --train data/hcr/train.xml data/debate08/train.xml --eval data/debate08/dev.xml data/hcr/dev.xml data/stanford/dev.xml -m L2R_LR
To do the IMDB sentiment lexicon using log-likelihood ratios:
$ bin/sastut run sastut.LLR data/aclImdb/
I'll provide more detail in this README later, but this is enough to start playing around.
There is a very simple OpenNLP pipeline that you can use to analyze raw texts. Run it as follows:
$ bin/sastut run sastut.EnglishPipeline data/vinken.txt
You can provide any text as an argument for it to be analyzed.