Skip to content

lukuang/trec_news

Repository files navigation

trec_news

for trec news track

Table of Contents

Official Documents

  1. Official Website
  2. Guideline
  3. Google Group

Collection

Washington Post.

Data Locations

All the data is located at /infolab/node4/lukuang/trec_news/data/washington_post/:
data locations

Data Related to The Collection (Washington Post)

The data are in the sub-directory wasington_post:

  1. orginial collection: WashingtonPost and the original downloaded file wapo.tar.gz

Note that I use the toolkit Indri for building the index and performing retrieval. Therefore, the original data were parsed to be used by indri, which results in the "trec text" files. The indexes are also generated using Indri. For more information, you can check out these links:

  1. trec format text, each trec document is an article: trec_text. There are three data fields: publish date, document id(docno), and tt and body. The first two are from the original collection. 'tt' is the title of the document while 'body' is generated by merging the text in 'content' field.
  2. trec format text, each trec document is a paragraph of an article: trec_text_paragraph. It contains three fields: document id(docno), published date, and text. The document id here is the combination of the document id of the article and the id of the paragraph (For example, in a document with document id A, the first paragrah of it would have a document id A-1 ). The text filed is the text of the paragraph.
  3. Indri index for articles: index and the Indri parameter file for it: index.para
  4. Indri index for paragraphs: paragraph_index and the Indri parameter file for it: paragraph_index.para
  5. Directory for queries: queries (there is only a query file I manully created of background linking according to the examples and fromat described in the guideline)
  6. Entity annotations for each paragraph: paragraph_entities (The generation process is on going). I used dbpedia spotlight for annotation.
  7. Some testing of Indri index: date_test. You can ignore it.

Other Data

The data are in the sub-directory other:

  1. stopwords:stopwords.json file contains stopwords.

Dbpedia Spotlight

The source code can be found here. I used it for entity annotation. I downloaded the ".jar" file. The code resides at /infolab/node4/lukuang/code/dbpedia-spotlight. It acts as a local server and receives rest api requests. To run the server: ../../code/java8/jre1.8.0_171/bin/java -jar dbpedia-spotlight-1.0.0.jar ./en http://headnode:2223/rest You do not need to run it as I ran it. You just need to call the rest api. How to do it can be found in the code src/entity/dbpedia/dbpedia.py

Code

In order to run my code, there are other things you need to know/do:

  1. I used the python2.7
  2. I used a utility module that I wrote before. It can be found at: https://github.com/lukuang/myUtility. Put it in your python site-package.
  3. You need to add Indri binaries to your system paths. The binaries you might need are IndriRunQuery, IndriBuildIndex, dumpindex. They can be found in the path: /usa/lukuang/usr/bin.
  4. If you ever need to run the dbpedia spotlight server yourself, remember to use java 8 instead of 9.

All codes I wrote for this track are located at /infolab/node4/lukuang/trec_news/trec_news/.
code location

There are two parts of my code: redis and python. The redis part serves as the backend in memory data storage. Therefore, the code can get the data from redis instead of reading from disk, thus speed up the whole process. The other part is python, which is the main part of the code. The data processing, basic retrieval, keyword extraction, etc.

Redis

The code of this part is in redis directory

  1. redis-config: configuration file of the redis
  2. redis-server: redis server binary
  3. redis-cli: redis client binary
  4. debug.sh: bash script of running redis client
  5. db: contain the database dump dump.rdb
  6. backup: backup of the dump. Currently unused.

You only need to run the server by entering this directory and execute: ./redis-server redis.conf. You could run the debug script for manually checking the data in the database. Importing data into the database is handled by python code.

There are three tables in the redis database now:

  1. bl_query_db: table contains queries of background linking. The keys are the query ids (e.g. the document id for the query article). The value of each query is a json string that is the same as that in result_doc_db.

  2. er_query_db: table contains queries of entity ranking. The keys are the query ids (e.g. the document id for the query article). The value of each query is a json string that is similar to that in result_doc_db and it has an extra field entities which contains entities in the query article.

  3. collection_stats_db: table contains some collection statistics. There are two types of stats now:

  • pwc: word collection probability. Its value is a hash and the key is a stemmed word and the value is the probability
  • stopwords: stopwords
  1. result_doc_db: table contains the documents of different queries. The keys are the query ids (e.g. the document id for the query article). There are only three example queries I take from the guideline. For each query, the value is a hash. The hash uses the result document ids as keys. The value is a json string. You can find out the string's structure by looking at how I generate it in process_data/import_collection_stats.py

Python

Utility Modules

I created some modules that can be used by other part of the codes. These modules reside at src:

  1. config: it servers as the data structrue storing some configuration information. Currently, only db.py is used, which stores the table ids and the database information
  2. data: it contains code for processing the query (queries.py,unfinished yet since the queries are not released yet), processing the articles in the original collection (articles.py), and processing the trec format documents generated from original articles(trec_docs.py)
  3. entity: it contains entitiy annotation codes. Currently there is only one implementation, which is dbpedia spotlight(dbpedia/dbpedia.py)

Process Data

Its in the directory process_data 1: processing original articles into trec format: gene_trec_text.py is used to generate documents per article whereas gene_trec_text_for_paragraph is used to generate documents per paragraph 2. gene_index.py generate an index parameter file used to generate index. 3. import_collection_stats.py import basic statistics of collection from index into the redis database 4. gene_entity_annotation.py annotate entities for each trec document

Some Experiments

In the some_experiments directory are codes for some simple experiments that I ran.

  1. some_query_test.py: generate a Indri query file that uses each paragraph of a given target article as query. There are two options of the content of the queries: all text or just the named entities. You have to ran the query file using indri to get results.
  2. find first_two.py: find the first two results of each paragraph query.
  3. get_keyword_from_paragraphs.py: similar to some_query_test.py, but use keywords of each paragraph as the content of the paragraph queries. Keywords are generated based on the probability of P(d|w), where d is the paragrah, and w is the word.
  4. import_result_doc.py: from result files of queries, get the result document ids and get the cleaned version of the original articles (as defined in src/data/articles.py) and import them into the database
  5. para_clustering.py: you can ignore this

About

for trec news track

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published