-
Notifications
You must be signed in to change notification settings - Fork 17
medrxiv biorxiv download
petermr edited this page May 5, 2020
·
8 revisions
These *rxiv
s have no API but can be accessed by a restful query. The process of download is shown in AMIDownloadTool
and AMIDownloadTest
. These examples can be seen as PoC; our current strategy is to use Ferret
if possible.
These rxiv
s work in 3 or 4 steps when run by a human:
- search/query generates a paged hitlist (e.g. 25 hits per page).
- foreach hitlist link create a landingpage.
- foreach landingpage retrieve (a)
fulltext.html
(b)fulltext.pdf
- (optional) foreach fulltext.html retrieve supplemental files
This should work in a similar way to getpapers
:
download -q "my query" -o myproject --site medrxiv --limit 100
should generate a directory of myproject
containing
-
metadata.json
and a logfile - and 100 subdirectories (named from URLs) each containing
-
fulltext.pdf
if it exists, -
metadata.json
if it exists
-
- HTML is created in all stages. This may sometimes be lazyloaded, requiring code that needs a headless browser.
getpapers
uses a headless browser (phantom.js
) but neither are now supported. - a true API will support most of 1-3. Many sites do not do this.
- there are problems of command-options, directory-creation and others that are somehwat independent of downloading
- downloading can be slow. Especially if loading each resources separately, but there may be no option.
On running ami download
it has successfully carried out tasks 1-3. It's possible that the "hang" earlier was due to overload on medrxiv
. Here's the output for the complete process:
Specific values (AMICleanTool)
================================
fileGlobs [medrxiv/]
0 [main] DEBUG org.contentmine.ami.tools.AMICleanTool - GLOB: medrxiv/(0) ==> []
0 [main] DEBUG org.contentmine.ami.tools.AMICleanTool - GLOB: medrxiv/(0) ==> []
Generic values (AMIDownloadTool)
================================
-v to see generic values
40 [main] INFO org.contentmine.ami.tools.AMIDownloadTool - set output to: scraped/
40 [main] INFO org.contentmine.ami.tools.AMIDownloadTool - set output to: scraped/
40 [main] INFO org.contentmine.ami.tools.AMIDownloadTool - set output to: scraped/
project target/medrxiv/ebola
Specific values (AMIDownloadTool)
================================
fulltext [pdf]
limit 40
metadata metadata
pages [1, 4]
pagesize 20
query ["ebola, AND, n95"]
hitListList []
site medrxiv
file types []
Query: "ebola%252BAND%252Bn95"%20sort%3Arelevance-rank%20numresults%3A20
URL https://www.medrxiv.org/search/"ebola%252BAND%252Bn95"%20sort%3Arelevance-rank%20numresults%3A20
running curl :https://www.medrxiv.org/search/"ebola%252BAND%252Bn95"%20sort%3Arelevance-rank%20numresults%3A20?page=0 to target/medrxiv/ebola/__metadata/hitList1.html
wrote hitList: /Users/pm286/workspace/cmdev/ami3/target/medrxiv/ebola/__metadata/hitList1.clean.html
metadataEntries 11
page hits (11) less than page size (20) ; assumed termination
Results 11
[target/medrxiv/ebola/__metadata/hitList1.clean.html]
========
HitList: 1
creates hitList[1..1][.clean].html
and <per-ctree>/scrapedMetadata.html
========
download files in hitList target/medrxiv/ebola/__metadata/hitList1.clean.html
result set: target/medrxiv/ebola/__metadata/hitList1.clean.html
metadataEntries 11
download with curl to <tree>scrapedMetadata.html[/content/10.1101/2020.04.24.20073973v1, /content/10.1101/2020.04.22.20076117v1, /content/10.1101/2020.04.23.20077230v1, /content/10.1101/2020.04.24.20078907v1, /content/10.1101/2020.03.05.20032003v1, /content/10.1101/2020.03.31.20047126v1, /content/10.1101/2020.04.06.20054197v1, /content/10.1101/2020.03.20.20039644v2, /content/10.1101/2020.04.11.20062356v1, /content/10.1101/2020.03.23.20039446v2, /content/10.1101/2020.04.15.20066480v2]
2744 [main] DEBUG org.contentmine.ami.tools.download.LandingPageManager - target/medrxiv/ebola
2744 [main] DEBUG org.contentmine.ami.tools.download.LandingPageManager - target/medrxiv/ebola
2744 [main] DEBUG org.contentmine.ami.tools.download.LandingPageManager - target/medrxiv/ebola
running batched up curlDownloader for 11 landingPages, takes ca 1-5 sec/page
ran curlDownloader for 11 landingPages
--------
+downloaded 11 files for target/medrxiv/ebola/__metadata/hitList1.clean.html
--------
========
adds LandingPages: 11
========
========
CTrees 11
========
LP [10_1101_2020_04_24_20073973v1, 10_1101_2020_04_22_20076117v1, 10_1101_2020_04_23_20077230v1, 10_1101_2020_04_24_20078907v1, 10_1101_2020_03_05_20032003v1, 10_1101_2020_03_31_20047126v1, 10_1101_2020_04_06_20054197v1, 10_1101_2020_03_20_20039644v2, 10_1101_2020_04_11_20062356v1, 10_1101_2020_03_23_20039446v2, 10_1101_2020_04_15_20066480v2]
content 127132
2020.04.24.20073973.full.pdf
content 112497
2020.04.22.20076117.full.pdf
content 119018
2020.04.23.20077230.full.pdf
content 176902
2020.04.24.20078907.full.pdf
content 113807
2020.03.05.20032003.full.pdf
content 119105
2020.03.31.20047126.full.pdf
content 109789
2020.04.06.20054197.full.pdf
content 125442
2020.03.20.20039644.full.pdf
content 111965
2020.04.11.20062356.full.pdf
content 119340
2020.03.23.20039446.1.full.pdf
content 118888
2020.04.15.20066480.full.pdf
========
Fulltext: finished
========|
$ tree ebola/
ebola/
├── 10_1101_2020_03_05_20032003v1
│ ├── fulltext.pdf
│ ├── landingPage.html
│ └── scrapedMetadata.html
├── 10_1101_2020_03_20_20039644v2
│ ├── fulltext.pdf
│ ├── landingPage.html
│ └── scrapedMetadata.html
├── 10_1101_2020_03_23_20039446v2
│ ├── fulltext.pdf
│ ├── landingPage.html
│ └── scrapedMetadata.html
├── 10_1101_2020_03_31_20047126v1
│ ├── fulltext.pdf
│ ├── landingPage.html
│ └── scrapedMetadata.html
├── 10_1101_2020_04_06_20054197v1
│ ├── fulltext.pdf
│ ├── landingPage.html
│ └── scrapedMetadata.html
├── 10_1101_2020_04_11_20062356v1
│ ├── fulltext.pdf
│ ├── landingPage.html
│ └── scrapedMetadata.html
├── 10_1101_2020_04_15_20066480v2
│ ├── fulltext.pdf
│ ├── landingPage.html
│ └── scrapedMetadata.html
├── 10_1101_2020_04_22_20076117v1
│ ├── fulltext.pdf
│ ├── landingPage.html
│ └── scrapedMetadata.html
├── 10_1101_2020_04_23_20077230v1
│ ├── fulltext.pdf
│ ├── landingPage.html
│ └── scrapedMetadata.html
├── 10_1101_2020_04_24_20073973v1
│ ├── fulltext.pdf
│ ├── landingPage.html
│ └── scrapedMetadata.html
├── 10_1101_2020_04_24_20078907v1
│ ├── fulltext.pdf
│ ├── landingPage.html
│ └── scrapedMetadata.html
└── __metadata
├── hitList1.clean.html
└── hitList1.html
12 directories, 35 files