This document describes how to use Anserini to index the COVID-19 Open Research Dataset (CORD-19) from the Allen Institute for AI.
Note: With the conclusion of the TREC-COVID challenge, we no longer have the resources to keep this page up to date with the latest distribution of CORD-19. The state of this page has mostly remained unchanged since mid-July 2020, when we built the indexes for Round 5 of the TREC-COVID challenge, with some updates in mid-November 2020.
For an easy way to get started, check out our (unfortunately, out of date) Colab demos, also available here:
- Colab demo using the title + abstract index
- Colab demo using the paragraph index
- Colab demo that demonstrates integration with SciBERT
We provide instructions on how to build Lucene indexes for the collection using Anserini below, but if you don't want to bother building the indexes yourself, we have pre-built indexes that you can directly download:
Version | Type | Size | Link | Checksum |
---|---|---|---|---|
2020-07-16 | Abstract | 2.1G | [Dropbox] | c883571ccc78b4c2ce05b41eb07f5405 |
2020-07-16 | Full-Text | 4.1G | [Dropbox] | 23cfad89b4c206d66125f5736f60248f |
2020-07-16 | Paragraph | 5.9G | [Dropbox] | c2c6ac832f8a1fcb767d2356d2b1e1df |
"Size" refers to the output of ls -lh
, "Version" refers to the dataset release date from AI2.
For our answer to the question, "which one should I use?" see below.
We've kept around older versions of indexes for archival purposes — scroll all the way down to the bottom of the page to see those.
Note that starting 2020/05/27, AI2 switched to daily releases of CORD-19 (from weekly), and as a result, it has become impractical to share pre-built indexes for every single update. Thus, we will only be providing pre-built indexes "occasionally".
However, we have written a simple script that will largely automate all the instructions on this page:
$ python src/main/python/trec-covid/index_cord19.py --date 2020-07-16 --all
This script was updated in mid-November 2020 (on the July distribution above and the latest version available) and has been verified to work at the time.
The script will:
- Download a specific release of CORD-19 (
--download
). - Build abstract, full-text, and paragraph indexes (
--index
). - Verify the indexes with topics and qrels from TREC-COVID (
--verify
).
The --date
argument takes the format of YYYY-MM-DD
, corresponding to a release here.
The --all
argument runs all three steps above (or use each argument individually to run just a specific step).
By default, the script does not overwrite existing data, unless --force
is specified.
After the above script completes successfully, the output will be something like:
## Effectiveness Summary
CORD-19 release: 2020-07-16
Topics/Qrels: TREC-COVID Round 3
Whitelist: TREC-COVID Round 3 valid docids
NDCG@10 Judged@10
Abstract index 0.5431 0.8675
Full-text index 0.3379 0.6625
Paragraph index 0.4923 0.8575
The instructions below walk through, essentially, what the script does, step by step.
These instructions work with the dataset release from 2020/07/16. First, download the data:
DATE=2020-07-16
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_"${DATE}".tar.gz
tar xvfz cord-19_"${DATE}".tar.gz -C collections
tar xvfz collections/"${DATE}"/document_parses.tar.gz -C collections/"${DATE}"
mv collections/"${DATE}" collections/cord19-"${DATE}"
We can now index this corpus using Anserini. Currently, we have implemented three different variants, described below. For a sense of how these different methods stack up, refer to the following paper:
- Jimmy Lin. Is Searching Full Text More Effective Than Searching Abstracts? BMC Bioinformatics, 10:46 (3 February 2009).
The tl;dr — we'd recommend getting started with abstract index since it's the smallest in size and easiest to manipulate. Paragraph indexing is likely to be more effective (i.e., better search results), but a bit more difficult to manipulate since some deduping is required to post-process the raw hits (since multiple paragraphs from the same article might be retrieved). The full-text index overly biases long documents and isn't really effective; this condition is included here only for completeness.
Note that as of TREC-COVID Round 1, there is some evidence that the abstract index is more effective for search, see results of experiments here. (Update, mid-November 2020: this statement was made back after Round 1; there is now considerably more evidence regarding the effectiveness of each approach; see link for details.)
We can index abstracts (and titles, of course) with Cord19AbstractCollection
, as follows:
sh target/appassembler/bin/IndexCollection \
-collection Cord19AbstractCollection -generator Cord19Generator \
-threads 8 -input collections/cord19-"${DATE}" \
-index indexes/lucene-index-cord19-abstract-"${DATE}" \
-storePositions -storeDocvectors -storeContents -storeRaw -optimize > logs/log.cord19-abstract.${DATE}.txt
The log should end with something like this:
2020-07-22 08:15:07,372 INFO [main] index.IndexCollection (IndexCollection.java:874) - Indexing Complete! 192,459 documents indexed
2020-07-22 08:15:07,372 INFO [main] index.IndexCollection (IndexCollection.java:875) - ============ Final Counter Values ============
2020-07-22 08:15:07,372 INFO [main] index.IndexCollection (IndexCollection.java:876) - indexed: 192,459
2020-07-22 08:15:07,372 INFO [main] index.IndexCollection (IndexCollection.java:877) - unindexable: 0
2020-07-22 08:15:07,372 INFO [main] index.IndexCollection (IndexCollection.java:878) - empty: 44
2020-07-22 08:15:07,372 INFO [main] index.IndexCollection (IndexCollection.java:879) - skipped: 6
2020-07-22 08:15:07,372 INFO [main] index.IndexCollection (IndexCollection.java:880) - errors: 0
2020-07-22 08:15:07,378 INFO [main] index.IndexCollection (IndexCollection.java:883) - Total 192,459 documents indexed in 00:02:46
The contents
field of each Lucene document is a concatenation of the article's title and abstract.
We can index the full text, with Cord19FullTextCollection
, as follows:
sh target/appassembler/bin/IndexCollection \
-collection Cord19FullTextCollection -generator Cord19Generator \
-threads 8 -input collections/cord19-"${DATE}" \
-index indexes/lucene-index-cord19-full-text-"${DATE}" \
-storePositions -storeDocvectors -storeContents -storeRaw -optimize > logs/log.cord19-full-text.${DATE}.txt
The log should end with something like this:
2020-07-22 08:23:04,801 INFO [main] index.IndexCollection (IndexCollection.java:874) - Indexing Complete! 192,460 documents indexed
2020-07-22 08:23:04,801 INFO [main] index.IndexCollection (IndexCollection.java:875) - ============ Final Counter Values ============
2020-07-22 08:23:04,801 INFO [main] index.IndexCollection (IndexCollection.java:876) - indexed: 192,460
2020-07-22 08:23:04,801 INFO [main] index.IndexCollection (IndexCollection.java:877) - unindexable: 0
2020-07-22 08:23:04,801 INFO [main] index.IndexCollection (IndexCollection.java:878) - empty: 43
2020-07-22 08:23:04,801 INFO [main] index.IndexCollection (IndexCollection.java:879) - skipped: 6
2020-07-22 08:23:04,801 INFO [main] index.IndexCollection (IndexCollection.java:880) - errors: 0
2020-07-22 08:23:04,806 INFO [main] index.IndexCollection (IndexCollection.java:883) - Total 192,460 documents indexed in 00:07:56
The contents
field of each Lucene document is a concatenation of the article's title and abstract, and the full text JSON (if available).
We can build a paragraph index with Cord19ParagraphCollection
, as follows:
sh target/appassembler/bin/IndexCollection \
-collection Cord19ParagraphCollection -generator Cord19Generator \
-threads 8 -input collections/cord19-"${DATE}" \
-index indexes/lucene-index-cord19-paragraph-"${DATE}" \
-storePositions -storeDocvectors -storeContents -storeRaw -optimize > logs/log.cord19-paragraph.${DATE}.txt
The log should end with something like this:
2020-07-22 08:46:57,535 INFO [main] index.IndexCollection (IndexCollection.java:874) - Indexing Complete! 3,010,497 documents indexed
2020-07-22 08:46:57,535 INFO [main] index.IndexCollection (IndexCollection.java:875) - ============ Final Counter Values ============
2020-07-22 08:46:57,536 INFO [main] index.IndexCollection (IndexCollection.java:876) - indexed: 3,010,497
2020-07-22 08:46:57,536 INFO [main] index.IndexCollection (IndexCollection.java:877) - unindexable: 0
2020-07-22 08:46:57,536 INFO [main] index.IndexCollection (IndexCollection.java:878) - empty: 44
2020-07-22 08:46:57,536 INFO [main] index.IndexCollection (IndexCollection.java:879) - skipped: 2,660
2020-07-22 08:46:57,536 INFO [main] index.IndexCollection (IndexCollection.java:880) - errors: 0
2020-07-22 08:46:57,543 INFO [main] index.IndexCollection (IndexCollection.java:883) - Total 3,010,497 documents indexed in 00:23:51
In this configuration, the indexer creates multiple Lucene Documents for each source article:
docid
: title + abstractdocid.00001
: title + abstract + 1st paragraphdocid.00002
: title + abstract + 2nd paragraphdocid.00003
: title + abstract + 3rd paragraph- ...
The suffix of the docid
, .XXXXX
identifies which paragraph is being indexed.
The original raw JSON full text is stored in the raw
field of docid
(without the suffix).
All versions of pre-built indexes:
Version | Type | Size | Link | Checksum |
---|---|---|---|---|
2020-07-16 | Abstract | 2.1G | [Dropbox] | c883571ccc78b4c2ce05b41eb07f5405 |
2020-07-16 | Full-Text | 4.1G | [Dropbox] | 23cfad89b4c206d66125f5736f60248f |
2020-07-16 | Paragraph | 5.9G | [Dropbox] | c2c6ac832f8a1fcb767d2356d2b1e1df |
2020-06-19 | Abstract | 2.0G | [Dropbox] | 029bd55daba8800fbae2be9e5fcd7b33 |
2020-06-19 | Full-Text | 3.8G | [Dropbox] | 3d0eb12094a24cff9bcacd1f17c3ea1c |
2020-06-19 | Paragraph | 5.5G | [Dropbox] | 5cd8cd6998177bed7a3e0057ef8b3595 |
2020-06-12 | Abstract | 1.9G | [Dropbox] | e0d9d312a83d67c21069717957a56f47 |
2020-06-12 | Full-Text | 3.7G | [Dropbox] | 72018ee46556cc72d01885203ea386dc |
2020-06-12 | Paragraph | 5.3G | [Dropbox] | 72732d298885c2c317236af33b08197c |
2020-05-26 | Abstract | 1.7G | [Dropbox] | 2dc054f4ca7db281e9f5e0d4836df14c |
2020-05-26 | Full-Text | 3.3G | [Dropbox] | 9b9fd4b97f75fa295e3345d0cf7914e3 |
2020-05-26 | Paragraph | 4.7G | [Dropbox] | 72eb265c1c9983f02f1e79a2ba19befb |
2020-05-19 | Abstract | 1.7G | [Dropbox] | 37bb97d0c41d650ba8e135fd75ae8fd8 |
2020-05-19 | Full-Text | 3.3G | [Dropbox] | f5711915a66cd2b511e0fb8d03e4c325 |
2020-05-19 | Paragraph | 4.9G | [Dropbox] | 012ab1f804382b2275c433a74d7d31f2 |
2020-05-12 | Abstract | 1.3G | [Dropbox] | dfd09e70cd672bbe15a63437351e1f74 |
2020-05-12 | Full-Text | 2.5G | [Dropbox] | 5b914e8ae579195185cf28a60051236d |
2020-05-12 | Paragraph | 3.6G | [Dropbox] | a2cb36762078ef9373f0ddaf52618e7f |
2020-05-01 | Abstract | 1.2G | [Dropbox] | a06e71a98a68d31148cb0e97e70a2ee1 |
2020-05-01 | Full-Text | 2.4G | [Dropbox] | e7eca1b976cdf2cd80e908c9ac2263cb |
2020-05-01 | Paragraph | 3.6G | [Dropbox] | 8f9321757a03985ac1c1952b2fff2c7d |
2020-04-24 | Abstract | 1.3G | [Dropbox] | 93540ae00e166ee433db7531e1bb51c8 |
2020-04-24 | Full-Text | 2.4G | [Dropbox] | fa927b0fc9cf1cd382413039cdc7b736 |
2020-04-24 | Paragraph | 5.0G | [Dropbox] | 7c6de6298e0430b8adb3e03310db32d8 |
2020-04-17 | Abstract | 1.2G | [Dropbox] | d57b17eadb1b44fc336b4121c139a598 |
2020-04-17 | Full-Text | 2.2G | [Dropbox] | 677546e0a1b7855a48eee8b6fbd7d7af |
2020-04-17 | Paragraph | 4.7G | [Dropbox] | c11e46230b744a46747f84e49acc9c2b |
2020-04-10 | Abstract | 1.2G | [Dropbox] | ec239d56498c0e7b74e3b41e1ce5d42a |
2020-04-10 | Full-Text | 3.3G | [Dropbox] | 401a6f5583b0f05340c73fbbeb3279c8 |
2020-04-10 | Paragraph | 3.4G | [Dropbox] | 8b87a2c55bc0a15b87f11e796860216a |
2020-04-03 | Abstract | 1.1G | [Dropbox] | 5d0d222e746d522a75f94240f5ab9f23 |
2020-04-03 | Full-Text | 3.0G | [Dropbox] | 9aafb86fec39e0882bd9ef0688d7a9cc |
2020-04-03 | Paragraph | 3.1G | [Dropbox] | 523894cfb52fc51c4202e76af79e1b10 |
2020-03-27 | Abstract | 1.1G | [Dropbox] | c5f7247e921c80f41ac6b54ff38eb229 |
2020-03-27 | Full-Text | 2.9G | [Dropbox] | 3c126344f9711720e6cf627c9bc415eb |
2020-03-27 | Paragraph | 3.1G | [Dropbox] | 8e02de859317918af4829c6188a89086 |
2020-03-20 | Abstract | 1.0G | [Dropbox] | 281c632034643665d52a544fed23807a |
2020-03-20 | Full-Text | 2.6G | [Dropbox] | 30cae90b85fa8f1b53acaa62413756e3 |
2020-03-20 | Paragraph | 2.9G | [Dropbox] | 4c78e9ede690dbfac13e25e634c70ae4 |
These indexes are also mirrored here.
- Release of 2020/05/19: Missing URLs for several articles due to a known issue with the CORD-19 dataset release.