Please check out our new paper on accurately predict real-valued sentence specificity with no domain restrictions, presented at AAAI 2019:
Wei-Jen Ko, Greg Durrett and Junyi Jessy Li, Domain Agnostic Real-Valued Specificity Prediction, AAAI 2019
The new system is located in this repo: https://github.com/wjko2/Domain-Agnostic-Sentence-Specificity-Prediction
This system works well on non-news domains, such as Twitter, online reviews, and conversational data.
The models in this package are obtained using co-training as described in Li and Nenkova, Fast and Accurate Prediction of Sentence Specificity, AAAI 2015.
Speciteller is implemented using Python 2.7. It depends on the following packages:
- numpy
- liblinear (in particular,
liblinearutil.py
; be sure you have aliblinear.so.<x>
file in itspython/
directory. If not, typemake
in python/)
Word lexicons for the models are available for download here. Please note that these resources come with license(s). Decompress the tar ball under this (i.e., the speciteller) directory.
Call:
$ python speciteller.py --inputfile inputfile --outputfile predfile
<inputfile>
should consists of word-tokenized sentences, one sentence per line;<predfile>
will be the destination file which Speciteller will write the specificity scores to, one score per line in the same order as sentences in<inputfile>
.- An optional argument is
--write_all_preds
. When flagged this will generate two addtional files:<predfile>.s
(prediction from the shallow model) and<predfile>.w
(prediction from the word representation model).
For example:
$ python speciteller.py --inputfile sents_test --outputfile test.probs
This will give you specificity scores for the two sentences in sents_test
in test.probs
.
The scores range from 0 to 1, with 0 being most general and 1 being most specific.
-
It is best that you word-tokenize your sentences. If you don't, you will still get a score, but less good (~4% less accurate if you translate them into labels with a cutoff at 0.5).
-
Note that the word embedding file is a compressed ~190mb .gz file. Each run of speciteller.py will load the file to generate features. Thus it is best to avoid loading it multiple times, or modify predict.py and tailor it for your data loading purpose.
Please cite the following paper:
Junyi Jessy Li and Ani Nenkova. 2015. Fast and Accurate Prediction of Sentence Specificity. Twenty-Ninth Conference on Artificial Intelligence (AAAI). [bibtex]
Please send comments and feedback to Jessy Li.