for trec news track
All the data is located at /infolab/node4/lukuang/trec_news/data/washington_post/
:
The data are in the sub-directory wasington_post
:
- orginial collection:
WashingtonPost
and the original downloaded filewapo.tar.gz
Note that I use the toolkit Indri for building the index and performing retrieval. Therefore, the original data were parsed to be used by indri, which results in the "trec text" files. The indexes are also generated using Indri. For more information, you can check out these links:
- trec format text, each trec document is an article:
trec_text
. There are three data fields: publish date, document id(docno), and tt and body. The first two are from the original collection. 'tt' is the title of the document while 'body' is generated by merging the text in 'content' field. - trec format text, each trec document is a paragraph of an article:
trec_text_paragraph
. It contains three fields: document id(docno), published date, and text. The document id here is the combination of the document id of the article and the id of the paragraph (For example, in a document with document idA
, the first paragrah of it would have a document idA-1
). The text filed is the text of the paragraph. - Indri index for articles:
index
and the Indri parameter file for it:index.para
- Indri index for paragraphs:
paragraph_index
and the Indri parameter file for it:paragraph_index.para
- Directory for queries:
queries
(there is only a query file I manully created of background linking according to the examples and fromat described in the guideline) - Entity annotations for each paragraph:
paragraph_entities
(The generation process is on going). I used dbpedia spotlight for annotation. - Some testing of Indri index:
date_test
. You can ignore it.
The data are in the sub-directory other
:
- stopwords:
stopwords.json
file contains stopwords.
The source code can be found here. I used it for entity annotation. I downloaded the ".jar" file. The code resides at /infolab/node4/lukuang/code/dbpedia-spotlight
. It acts as a local server and receives rest api requests. To run the server: ../../code/java8/jre1.8.0_171/bin/java -jar dbpedia-spotlight-1.0.0.jar ./en http://headnode:2223/rest
You do not need to run it as I ran it. You just need to call the rest api. How to do it can be found in the code src/entity/dbpedia/dbpedia.py
In order to run my code, there are other things you need to know/do:
- I used the python2.7
- I used a utility module that I wrote before. It can be found at:
https://github.com/lukuang/myUtility
. Put it in your python site-package. - You need to add Indri binaries to your system paths. The binaries you might need are
IndriRunQuery
,IndriBuildIndex
,dumpindex
. They can be found in the path:/usa/lukuang/usr/bin
. - If you ever need to run the dbpedia spotlight server yourself, remember to use java 8 instead of 9.
All codes I wrote for this track are located at /infolab/node4/lukuang/trec_news/trec_news/
.
There are two parts of my code: redis and python. The redis part serves as the backend in memory data storage. Therefore, the code can get the data from redis instead of reading from disk, thus speed up the whole process. The other part is python, which is the main part of the code. The data processing, basic retrieval, keyword extraction, etc.
The code of this part is in redis
directory
redis-config
: configuration file of the redisredis-server
: redis server binaryredis-cli
: redis client binarydebug.sh
: bash script of running redis clientdb
: contain the database dumpdump.rdb
backup
: backup of the dump. Currently unused.
You only need to run the server by entering this directory and execute: ./redis-server redis.conf
. You could run the debug script for manually checking the data in the database. Importing data into the database is handled by python code.
There are three tables in the redis database now:
-
bl_query_db
: table contains queries of background linking. The keys are the query ids (e.g. the document id for the query article). The value of each query is a json string that is the same as that inresult_doc_db
. -
er_query_db
: table contains queries of entity ranking. The keys are the query ids (e.g. the document id for the query article). The value of each query is a json string that is similar to that inresult_doc_db
and it has an extra fieldentities
which contains entities in the query article. -
collection_stats_db
: table contains some collection statistics. There are two types of stats now:
pwc
: word collection probability. Its value is a hash and the key is a stemmed word and the value is the probabilitystopwords
: stopwords
result_doc_db
: table contains the documents of different queries. The keys are the query ids (e.g. the document id for the query article). There are only three example queries I take from the guideline. For each query, the value is a hash. The hash uses the result document ids as keys. The value is a json string. You can find out the string's structure by looking at how I generate it inprocess_data/import_collection_stats.py
I created some modules that can be used by other part of the codes. These modules reside at src
:
config
: it servers as the data structrue storing some configuration information. Currently, onlydb.py
is used, which stores the table ids and the database informationdata
: it contains code for processing the query (queries.py
,unfinished yet since the queries are not released yet), processing the articles in the original collection (articles.py
), and processing the trec format documents generated from original articles(trec_docs.py
)entity
: it contains entitiy annotation codes. Currently there is only one implementation, which is dbpedia spotlight(dbpedia/dbpedia.py
)
Its in the directory process_data
1: processing original articles into trec format: gene_trec_text.py
is used to generate documents per article whereas gene_trec_text_for_paragraph
is used to generate documents per paragraph
2. gene_index.py
generate an index parameter file used to generate index.
3. import_collection_stats.py
import basic statistics of collection from index into the redis database
4. gene_entity_annotation.py
annotate entities for each trec document
In the some_experiments
directory are codes for some simple experiments that I ran.
some_query_test.py
: generate a Indri query file that uses each paragraph of a given target article as query. There are two options of the content of the queries: all text or just the named entities. You have to ran the query file using indri to get results.find first_two.py
: find the first two results of each paragraph query.get_keyword_from_paragraphs.py
: similar tosome_query_test.py
, but use keywords of each paragraph as the content of the paragraph queries. Keywords are generated based on the probability of P(d|w), whered
is the paragrah, andw
is the word.import_result_doc.py
: from result files of queries, get the result document ids and get the cleaned version of the original articles (as defined insrc/data/articles.py
) and import them into the databasepara_clustering.py
: you can ignore this