Generally, all public datasets can be easily downloaded using the zip folder.
Below we mention how to reproduce retrieval on datasets which are not public -
- Fill up the application to use the Washington Post (WaPo) Corpus: https://trec.nist.gov/data/wapost/
- Loop through your contents. For a single document, get all the
paragraph
subtypes and extract HTML from text in case mime istext/html
or directly include text fromtext/plain
. - I used
html2text
(https://pypi.org/project/html2text/) python package to extract text out of the HTML.
- Download background linking topics and qrels from 2019 News Track: https://trec.nist.gov/data/news2019.html
- We consider the document title as the query for our experiments.
- Register yourself at BioASQ: http://www.bioasq.org/
- Download documents from BioASQ task 9a (Training v.2020 ~ 14,913,939 docs) and extract the title and abstractText for each document.
- There are few documents not present in this corpus but present in test qrels so we add them manually.
- Find these manual documents here: https://docs.google.com/spreadsheets/d/1GZghfN5RT8h01XzIlejuwhBIGe8f-VaGf-yGaq11U-k/edit#gid=2015463710
- Download Training and Test dataset from BioASQ 8B datasets which were published in 2020.
- Consider all documents with answers as relevant (binary label) for a given question.
- Fill up the application to use the TREC disks 4 and 5: https://trec.nist.gov/data/cd45/index.html
- Download, format it according to
ir_datasets
and get the preprocessed corpus: https://ir-datasets.com/trec-robust04.html#trec-robust04
- Download the queries and qrels from
ir_datasets
with the keytrec-robust04
here - https://ir-datasets.com/trec-robust04.html#trec-robust04 - For our experiments, we used the description of the query for retrieval.
- Scrape tweets from Twitter manually for the ids here: https://github.com/igorbrigadir/newsir16-data/tree/master/twitter/curated
- I used
tweepy
(https://www.tweepy.org/) from python to scrape tweets. You can find the script here: scrape_tweets.py. - We preprocess the text retrieved, we remove emojis and links from the original text. You can find the function implementations in the code above.
- Remove tweets which are empty or do not contain any text.
- Sign up at Signal1M website to download qrels: https://research.signal-ai.com/datasets/signal1m-tweetir.html
- Sign up at Signal1M website to download queries: https://research.signal-ai.com/datasets/signal1m.html
- We consider the title of the query for our experiments.