-
Notifications
You must be signed in to change notification settings - Fork 2
/
README.txt
124 lines (103 loc) · 6.05 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
J.D. Powers and Associates Sentiment Corpus Library
====================================================
code version: 1.0
code release date:
code repository:
JDPA Corpus: http://verbs.colorado.edu/jdpacorpus
contact: Gregory Ichneumon Brown (browngp (-at-) colorado edu)
===================================================
Licensing
This code is distributed under the licensing terms of the
JDPA Sentiment Corpus available at https://verbs.colorado.edu/jdpacorpus/JDPA-Sentiment-Corpus-Licence-ver-2009-12-17.pdf
To use this code be sure to sign and send in a license as described at
http://verbs.colorado.edu/jdpacorpus
The code is copyright J.D. Power and Associates and Gregory Ichneumon Brown, and
is provided as is. If you have useful changes/bugfixes then we would love contributions
to the library. Unfortunately I (Greg) am unsure whether we will ever be able to get JDPA
to change the license to a better open source license. But you're probably only using this
library if you agree to the corpus license anyways.
===================================================
This library was started while working for JDPA in 2010, and then expanded on for my
thesis. Because of the code history, some pieces may not be as clean as they could be,
and may appear overly complex. Sorry.
Pretty much all code is written in Scala, so you should be able to use this libary from any
JVM language, but I've only ever tested it by calling from Scala.
-----------------------------------
Quick and Dirty How To
-----------------------------------
To load a file from the corpus:
XXX
To load a list of files from the corpus:
XXX
The corpus documents by default get tokenized by the default OpenNLP tokenizer and sentence splitter
with a number of regexp that I used to post process for my thesis. (You should seriously consider
changing this, and someday maybe the code should be modified to support an arbitrary tokenizer.)
There is included code for running the Stanford Parser (though you'll have to download it yourself - version XXX).
Based on my results, I would suggest using a different parser though, blog data is hard. :) But to run the stanford parser
call:
XXX
The corpus documents get extracted into the the myriad datastructures, see doc/apidocs or the code for a description. There
is also a somewhat out of date UML diagram at doc/datastructures.zargo
If you need code changes, better documentation etc, feel free to contact me, I'm
currently in a "just get something online" mode, but could certainly put some more time into
cleaning up the code base if there is some demand.
-----------------------------------
Summary of Directory Structure
-----------------------------------
./
|-bin/ All scripts for running common pieces.
|---run.sh --run a class from command line with arguments (via ant)
|---run_scala.sh --run a class from command line with arguments (via scala command line)
|---classpath.sh --classpath used for run_scala.sh
|-build.xml main ant build script
|-classes/ .class files get compiled to here (DO NOT CHECKIN)
|-config/ files to control the system
|---science.properties --common control file
|---log4j.properties --logging control file
|-doc/
|---apidoc/ --scaladoc generated documentation
|jdpacorpus-lib.jar compiled library jar file for inclusion in other projects
|-log/ system log files get generated here (DO NOT CHECK IN)
|---<EXPNAME>.<process>.science.log --root level log (set <EXPNAME> name in science.properties)
|---<EXPNAME>.<process>.token.log --tokenizer specific log
| (<process> set automatically in top leve main function
| to indicate what is being run: eg decode,train)
|-output/ Output files from experiment runs get output here (DO NOT CHECK IN)
|---<EXPNAME>/ --<EXPNAME> run results
|-----token/
|-src/ Main source tree
|---com/
|-----jdpa/
|-------mlg/
|---------science/
|-----------datastructures/ ----Document representation classes
|-----------readers/ ----Main Sentiment Corpus Reader, and a few misc file readers
|-----------tests/ ----Unit Test Suites (ind tests in same pkg as class being tested)
|-----------tokenize/ ----OpenNLP Tokenizer Wrapper
|-----------utils/ ----Logging, Timer, System parameters, other generic utilities
|-----------writers/ ----Writing data out to various formats
|-thirdparty/ Third Party Libraries (broken down by license type)
|---commercial/ --commercial/Apache/not restricted distribution licenses
|-----log4j/
|-----scalatest-1.0/
|-----slf4j-1.6.0/
|---scala/ --scala binaries for compiling
|-----scala-2.7.7.final/ ----scala 2.7.7 (hopefully no longer needed)
|-----scala-2.8.0.RC6/ ----scala 2.8 RC6 (should get upgraded soon to real 2.8 release)
-----------------------------------
Compilation/Building/Setup
-----------------------------------
-Modify config/science.properties to give yourself an experiment name (EXPNAME=test-20100617)
-Compile with:> ant -f build/build.xml compile
(NOTE: if you run out of stack space you may need to define: ANT_OPTS=-Xss2M to increase the stack size
-Make doc/apidoc:> ant -f build/build.xml apidoc
-Compile, generate apidoc, and create lib/science.jar:> ant -f build/build.xml
-----------------------------------
Run All Unit Tests
-----------------------------------
>ant -f build/build.xml test
------------------
Run Tokenization
------------------
customize file locations with config/token.properties
> bin/run_scala.sh tokenize.Tokenize [args]