Skip to content

medrxiv biorxiv download

petermr edited this page May 5, 2020 · 8 revisions

medrxiv and biroxiv

These *rxivs have no API but can be accessed by a restful query. The process of download is shown in AMIDownloadTool and AMIDownloadTest. These examples can be seen as PoC; our current strategy is to use Ferret if possible.

overview

These rxiv s work in 3 or 4 steps when run by a human:

  1. search/query generates a paged hitlist (e.g. 25 hits per page).
  2. foreach hitlist link create a landingpage.
  3. foreach landingpage retrieve (a) fulltext.html (b) fulltext.pdf
  4. (optional) foreach fulltext.html retrieve supplemental files

desired operation

This should work in a similar way to getpapers:

download -q "my query" -o myproject --site medrxiv --limit 100

should generate a directory of myproject containing

  • metadata.jsonand a logfile
  • and 100 subdirectories (named from URLs) each containing
    1. fulltext.pdf if it exists,
    2. metadata.json if it exists
Clone this wiki locally