Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
download_dataset.py		download_dataset.py
md5.csv		md5.csv
scrape_tweets.py		scrape_tweets.py

README.md

Dataset Information

Generally, all public datasets can be easily downloaded using the zip folder.

Below we mention how to reproduce retrieval on datasets which are not public -

1. TREC-NEWS

Corpus

Fill up the application to use the Washington Post (WaPo) Corpus: https://trec.nist.gov/data/wapost/
Loop through your contents. For a single document, get all the paragraph subtypes and extract HTML from text in case mime is text/html or directly include text from text/plain.
I used html2text (https://pypi.org/project/html2text/) python package to extract text out of the HTML.

Queries and Qrels

Download background linking topics and qrels from 2019 News Track: https://trec.nist.gov/data/news2019.html
We consider the document title as the query for our experiments.

2. BioASQ

Corpus

Register yourself at BioASQ: http://www.bioasq.org/
Download documents from BioASQ task 9a (Training v.2020 ~ 14,913,939 docs) and extract the title and abstractText for each document.
There are few documents not present in this corpus but present in test qrels so we add them manually.
Find these manual documents here: https://docs.google.com/spreadsheets/d/1GZghfN5RT8h01XzIlejuwhBIGe8f-VaGf-yGaq11U-k/edit#gid=2015463710

Queries and Qrels

Download Training and Test dataset from BioASQ 8B datasets which were published in 2020.
Consider all documents with answers as relevant (binary label) for a given question.

3. Robust04

Corpus

Fill up the application to use the TREC disks 4 and 5: https://trec.nist.gov/data/cd45/index.html
Download, format it according to ir_datasets and get the preprocessed corpus: https://ir-datasets.com/trec-robust04.html#trec-robust04

Queries and Qrels

Download the queries and qrels from ir_datasets with the key trec-robust04 here - https://ir-datasets.com/trec-robust04.html#trec-robust04
For our experiments, we used the description of the query for retrieval.

4. Signal-1M

Corpus

Scrape tweets from Twitter manually for the ids here: https://github.com/igorbrigadir/newsir16-data/tree/master/twitter/curated
I used tweepy (https://www.tweepy.org/) from python to scrape tweets. You can find the script here: scrape_tweets.py.
We preprocess the text retrieved, we remove emojis and links from the original text. You can find the function implementations in the code above.
Remove tweets which are empty or do not contain any text.

Queries and Qrels

Sign up at Signal1M website to download qrels: https://research.signal-ai.com/datasets/signal1m-tweetir.html
Sign up at Signal1M website to download queries: https://research.signal-ai.com/datasets/signal1m.html
We consider the title of the query for our experiments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset

dataset

README.md

Dataset Information

1. TREC-NEWS

Corpus

Queries and Qrels

2. BioASQ

Corpus

Queries and Qrels

3. Robust04

Corpus

Queries and Qrels

4. Signal-1M

Corpus

Queries and Qrels

Files

dataset

Directory actions

More options

Directory actions

More options

Latest commit

History

dataset

Folders and files

parent directory

README.md

Dataset Information

1. TREC-NEWS

Corpus

Queries and Qrels

2. BioASQ

Corpus

Queries and Qrels

3. Robust04

Corpus

Queries and Qrels

4. Signal-1M

Corpus

Queries and Qrels