-
Notifications
You must be signed in to change notification settings - Fork 17
medrxiv biorxiv download
petermr edited this page May 5, 2020
·
8 revisions
These *rxiv
s have no API but can be accessed by a restful query. The process of download is shown in AMIDownloadTool
and AMIDownloadTest
. These examples can be seen as PoC; our current strategy is to use Ferret
if possible.
These rxiv
s work in 3 or 4 steps when run by a human:
- search/query generates a paged hitlist (e.g. 25 hits per page).
- foreach hitlist link create a landingpage.
- foreach landingpage retrieve (a)
fulltext.html
(b)fulltext.pdf
- (optional) foreach fulltext.html retrieve supplemental files
This should work in a similar way to getpapers
:
download -q "my query" -o myproject --site medrxiv --limit 100
should generate a directory of myproject
containing
-
metadata.json
and a logfile - and 100 subdirectories (named from URLs) each containing
-
fulltext.pdf
if it exists, -
metadata.json
if it exists
-
- HTML is created in all stages. This may sometimes be lazyloaded, requiring code that needs a headless browser.
getpapers
uses a headless browser (phantom.js
) but neither are now supported. - a true API will support most of 1-3. Many sites do not do this.
- there are problems of command-options, directory-creation and others that are somehwat independent of downloading
- downloading can be slow. Especially if loading each resources separately, but there may be no option.