medrxiv biorxiv download

medrxiv and biroxiv

These *rxivs have no API but can be accessed by a restful query. The process of download is shown in AMIDownloadTool and AMIDownloadTest. These examples can be seen as PoC; our current strategy is to use Ferret if possible.

overview

These rxiv s work in 3 or 4 steps when run by a human:

search/query generates a paged hitlist (e.g. 25 hits per page).
foreach hitlist link create a landingpage.
foreach landingpage retrieve (a) fulltext.html (b) fulltext.pdf
(optional) foreach fulltext.html retrieve supplemental files

desired operation

This should work in a similar way to getpapers:

download -q "my query" -o myproject --site medrxiv --limit 100

should generate a directory of myproject containing

metadata.jsonand a logfile
and 100 subdirectories (named from URLs) each containing
1. fulltext.pdf if it exists,
2. metadata.json if it exists

problems

HTML is created in all stages. This may sometimes be lazyloaded, requiring code that needs a headless browser. getpapers uses a headless browser (phantom.js) but neither are now supported.
a true API will support most of 1-3. Many sites do not do this.
there are problems of command-options, directory-creation and others that are somehwat independent of downloading
downloading can be slow. Especially if loading each resources separately, but there may be no option.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

medrxiv biorxiv download

medrxiv and biroxiv

overview

desired operation

problems

Clone this wiki locally