Skip to content

medrxiv biorxiv download

petermr edited this page May 5, 2020 · 8 revisions

medrxiv and biroxiv

These *rxivs have no API but can be accessed by a restful query. The process of download is shown in AMIDownloadTool and AMIDownloadTest. These examples can be seen as PoC; our current strategy is to use Ferret if possible.

overview

These rxiv s work in 3 or 4 steps when run by a human:

  1. search/query generates a paged hitlist (e.g. 25 hits per page).
  2. foreach hitlist link create a landingpage.
  3. foreach landingpage retrieve (a) fulltext.html (b) fulltext.pdf
  4. (optional) foreach fulltext.html retrieve supplemental files

desired operation

This should work in a similar way to getpapers:

download -q "my query" -o myproject --site medrxiv --limit 100

should generate a directory of myproject containing

  • metadata.jsonand a logfile
  • and 100 subdirectories (named from URLs) each containing
    1. fulltext.pdf if it exists,
    2. metadata.json if it exists

problems

  • HTML is created in all stages. This may sometimes be lazyloaded, requiring code that needs a headless browser. getpapers uses a headless browser (phantom.js) but neither are now supported.
  • a true API will support most of 1-3. Many sites do not do this.
  • there are problems of command-options, directory-creation and others that are somehwat independent of downloading
  • downloading can be slow. Especially if loading each resources separately, but there may be no option.

ami download

On running ami download it has successfully carried out tasks 1-3. It's possible that the "hang" earlier was due to overload on medrxiv . Here's the output for the complete process:

Specific values (AMICleanTool)
================================
fileGlobs     [medrxiv/]
0    [main] DEBUG org.contentmine.ami.tools.AMICleanTool  - GLOB: medrxiv/(0) ==> []
0 [main] DEBUG org.contentmine.ami.tools.AMICleanTool  - GLOB: medrxiv/(0) ==> []

Generic values (AMIDownloadTool)
================================
-v to see generic values
40   [main] INFO  org.contentmine.ami.tools.AMIDownloadTool  - set output to: scraped/
40 [main] INFO org.contentmine.ami.tools.AMIDownloadTool  - set output to: scraped/
40 [main] INFO org.contentmine.ami.tools.AMIDownloadTool  - set output to: scraped/
project         target/medrxiv/ebola


Specific values (AMIDownloadTool)
================================
fulltext           [pdf]
limit              40
metadata           metadata
pages              [1, 4]
pagesize           20
query              ["ebola, AND, n95"]
hitListList      []
site               medrxiv
file types          []

Query: "ebola%252BAND%252Bn95"%20sort%3Arelevance-rank%20numresults%3A20
URL https://www.medrxiv.org/search/"ebola%252BAND%252Bn95"%20sort%3Arelevance-rank%20numresults%3A20
running curl :https://www.medrxiv.org/search/"ebola%252BAND%252Bn95"%20sort%3Arelevance-rank%20numresults%3A20?page=0 to target/medrxiv/ebola/__metadata/hitList1.html
wrote hitList: /Users/pm286/workspace/cmdev/ami3/target/medrxiv/ebola/__metadata/hitList1.clean.html
metadataEntries 11
page hits (11) less than page size (20) ; assumed termination
Results 11
[target/medrxiv/ebola/__metadata/hitList1.clean.html]
  ========
HitList: 1
 creates hitList[1..1][.clean].html
 and <per-ctree>/scrapedMetadata.html
========
download files in hitList target/medrxiv/ebola/__metadata/hitList1.clean.html
result set: target/medrxiv/ebola/__metadata/hitList1.clean.html
metadataEntries 11
download with curl to <tree>scrapedMetadata.html[/content/10.1101/2020.04.24.20073973v1, /content/10.1101/2020.04.22.20076117v1, /content/10.1101/2020.04.23.20077230v1, /content/10.1101/2020.04.24.20078907v1, /content/10.1101/2020.03.05.20032003v1, /content/10.1101/2020.03.31.20047126v1, /content/10.1101/2020.04.06.20054197v1, /content/10.1101/2020.03.20.20039644v2, /content/10.1101/2020.04.11.20062356v1, /content/10.1101/2020.03.23.20039446v2, /content/10.1101/2020.04.15.20066480v2]
2744 [main] DEBUG org.contentmine.ami.tools.download.LandingPageManager  - target/medrxiv/ebola
2744 [main] DEBUG org.contentmine.ami.tools.download.LandingPageManager  - target/medrxiv/ebola
2744 [main] DEBUG org.contentmine.ami.tools.download.LandingPageManager  - target/medrxiv/ebola
running batched up curlDownloader for 11 landingPages, takes ca 1-5 sec/page 
ran curlDownloader for 11 landingPages 
--------
+downloaded 11 files for target/medrxiv/ebola/__metadata/hitList1.clean.html
--------
========
adds LandingPages: 11
========
========
 CTrees 11
========
LP [10_1101_2020_04_24_20073973v1, 10_1101_2020_04_22_20076117v1, 10_1101_2020_04_23_20077230v1, 10_1101_2020_04_24_20078907v1, 10_1101_2020_03_05_20032003v1, 10_1101_2020_03_31_20047126v1, 10_1101_2020_04_06_20054197v1, 10_1101_2020_03_20_20039644v2, 10_1101_2020_04_11_20062356v1, 10_1101_2020_03_23_20039446v2, 10_1101_2020_04_15_20066480v2]
content 127132
 2020.04.24.20073973.full.pdf
content 112497
 2020.04.22.20076117.full.pdf
content 119018
 2020.04.23.20077230.full.pdf
content 176902
 2020.04.24.20078907.full.pdf
content 113807
 2020.03.05.20032003.full.pdf
content 119105
 2020.03.31.20047126.full.pdf
content 109789
 2020.04.06.20054197.full.pdf
content 125442
 2020.03.20.20039644.full.pdf
content 111965
 2020.04.11.20062356.full.pdf
content 119340
 2020.03.23.20039446.1.full.pdf
content 118888
 2020.04.15.20066480.full.pdf
========
Fulltext: finished
========|
Clone this wiki locally