Skip to content

medrxiv biorxiv download

petermr edited this page May 5, 2020 · 8 revisions

medrxiv and biroxiv

These *rxivs have no API but can be accessed by a restful query. The process of download is shown in AMIDownloadTool and AMIDownloadTest. These examples can be seen as PoC; our current strategy is to use Ferret if possible.


These rxiv s work in 3 or 4 steps when run by a human:

  1. search/query generates a paged hitlist (e.g. 25 hits per page).
  2. foreach hitlist link create a landingpage.
  3. foreach landingpage retrieve (a) fulltext.html (b) fulltext.pdf
  4. (optional) foreach fulltext.html retrieve supplemental files

desired operation

This should work in a similar way to getpapers:

download -q "my query" -o myproject --site medrxiv --limit 100

should generate a directory of myproject containing

  • metadata.jsonand a logfile
  • and 100 subdirectories (named from URLs) each containing
    1. fulltext.pdf if it exists,
    2. metadata.json if it exists


  • HTML is created in all stages. This may sometimes be lazyloaded, requiring code that needs a headless browser. getpapers uses a headless browser (phantom.js) but neither are now supported.
  • a true API will support most of 1-3. Many sites do not do this.
  • there are problems of command-options, directory-creation and others that are somehwat independent of downloading
  • downloading can be slow. Especially if loading each resources separately, but there may be no option.

ami download

On running ami download it has successfully carried out tasks 1-3. It's possible that the "hang" earlier was due to overload on medrxiv . Here's the output for the complete process:

Specific values (AMICleanTool)
fileGlobs     [medrxiv/]
0    [main] DEBUG  - GLOB: medrxiv/(0) ==> []
0 [main] DEBUG  - GLOB: medrxiv/(0) ==> []

Generic values (AMIDownloadTool)
-v to see generic values
40   [main] INFO  - set output to: scraped/
40 [main] INFO  - set output to: scraped/
40 [main] INFO  - set output to: scraped/
project         target/medrxiv/ebola

Specific values (AMIDownloadTool)
fulltext           [pdf]
limit              40
metadata           metadata
pages              [1, 4]
pagesize           20
query              ["ebola, AND, n95"]
hitListList      []
site               medrxiv
file types          []

Query: "ebola%252BAND%252Bn95"%20sort%3Arelevance-rank%20numresults%3A20
running curl :"ebola%252BAND%252Bn95"%20sort%3Arelevance-rank%20numresults%3A20?page=0 to target/medrxiv/ebola/__metadata/hitList1.html
wrote hitList: /Users/pm286/workspace/cmdev/ami3/target/medrxiv/ebola/__metadata/hitList1.clean.html
metadataEntries 11
page hits (11) less than page size (20) ; assumed termination
Results 11
HitList: 1
 creates hitList[1..1][.clean].html
 and <per-ctree>/scrapedMetadata.html
download files in hitList target/medrxiv/ebola/__metadata/hitList1.clean.html
result set: target/medrxiv/ebola/__metadata/hitList1.clean.html
metadataEntries 11
download with curl to <tree>scrapedMetadata.html[/content/10.1101/2020.04.24.20073973v1, /content/10.1101/2020.04.22.20076117v1, /content/10.1101/2020.04.23.20077230v1, /content/10.1101/2020.04.24.20078907v1, /content/10.1101/2020.03.05.20032003v1, /content/10.1101/2020.03.31.20047126v1, /content/10.1101/2020.04.06.20054197v1, /content/10.1101/2020.03.20.20039644v2, /content/10.1101/2020.04.11.20062356v1, /content/10.1101/2020.03.23.20039446v2, /content/10.1101/2020.04.15.20066480v2]
2744 [main] DEBUG  - target/medrxiv/ebola
2744 [main] DEBUG  - target/medrxiv/ebola
2744 [main] DEBUG  - target/medrxiv/ebola
running batched up curlDownloader for 11 landingPages, takes ca 1-5 sec/page 
ran curlDownloader for 11 landingPages 
+downloaded 11 files for target/medrxiv/ebola/__metadata/hitList1.clean.html
adds LandingPages: 11
 CTrees 11
LP [10_1101_2020_04_24_20073973v1, 10_1101_2020_04_22_20076117v1, 10_1101_2020_04_23_20077230v1, 10_1101_2020_04_24_20078907v1, 10_1101_2020_03_05_20032003v1, 10_1101_2020_03_31_20047126v1, 10_1101_2020_04_06_20054197v1, 10_1101_2020_03_20_20039644v2, 10_1101_2020_04_11_20062356v1, 10_1101_2020_03_23_20039446v2, 10_1101_2020_04_15_20066480v2]
content 127132
content 112497
content 119018
content 176902
content 113807
content 119105
content 109789
content 125442
content 111965
content 119340
content 118888
Fulltext: finished

ami download output

$ tree ebola/
├── 10_1101_2020_03_05_20032003v1
│   ├── fulltext.pdf
│   ├── landingPage.html
│   └── scrapedMetadata.html
├── 10_1101_2020_03_20_20039644v2
│   ├── fulltext.pdf
│   ├── landingPage.html
│   └── scrapedMetadata.html
├── 10_1101_2020_03_23_20039446v2
│   ├── fulltext.pdf
│   ├── landingPage.html
│   └── scrapedMetadata.html
├── 10_1101_2020_03_31_20047126v1
│   ├── fulltext.pdf
│   ├── landingPage.html
│   └── scrapedMetadata.html
├── 10_1101_2020_04_06_20054197v1
│   ├── fulltext.pdf
│   ├── landingPage.html
│   └── scrapedMetadata.html
├── 10_1101_2020_04_11_20062356v1
│   ├── fulltext.pdf
│   ├── landingPage.html
│   └── scrapedMetadata.html
├── 10_1101_2020_04_15_20066480v2
│   ├── fulltext.pdf
│   ├── landingPage.html
│   └── scrapedMetadata.html
├── 10_1101_2020_04_22_20076117v1
│   ├── fulltext.pdf
│   ├── landingPage.html
│   └── scrapedMetadata.html
├── 10_1101_2020_04_23_20077230v1
│   ├── fulltext.pdf
│   ├── landingPage.html
│   └── scrapedMetadata.html
├── 10_1101_2020_04_24_20073973v1
│   ├── fulltext.pdf
│   ├── landingPage.html
│   └── scrapedMetadata.html
├── 10_1101_2020_04_24_20078907v1
│   ├── fulltext.pdf
│   ├── landingPage.html
│   └── scrapedMetadata.html
└── __metadata
    ├── hitList1.clean.html
    └── hitList1.html

12 directories, 35 files

Clone this wiki locally