-
Notifications
You must be signed in to change notification settings - Fork 17
medrxiv biorxiv download
These *rxiv
s have no API but can be accessed by a restful query. The process of download is shown in AMIDownloadTool
and AMIDownloadTest
. These examples can be seen as PoC; our current strategy is to use Ferret
if possible.
These rxiv
s work in 3 or 4 steps when run by a human:
- search/query generates a paged hitlist (e.g. 25 hits per page).
- foreach hitlist link create a landingpage.
- foreach landingpage retrieve (a)
fulltext.html
(b)fulltext.pdf
- (optional) foreach fulltext.html retrieve supplemental files
This should work in a similar way to getpapers
:
download -q "my query" -o myproject --site medrxiv --limit 100
should generate a directory of myproject
containing
-
metadata.json
and a logfile - and 100 subdirectories (named from URLs) each containing
-
fulltext.pdf
if it exists, -
metadata.json
if it exists
-
- HTML is created in all stages. This may sometimes be lazyloaded, requiring code that needs a headless browser.
getpapers
uses a headless browser (phantom.js
) but neither are now supported. - a true API will support most of 1-3. Many sites do not do this.
- there are problems of command-options, directory-creation and others that are somehwat independent of downloading
- downloading can be slow. Especially if loading each resources separately, but there may be no option.
ami download
run in org.contentmine.ami.tools.AMIDownloadTest.testMedrxivDownload()
The code is:
@Test
public void testMedrxivDownload() {
String args;
String biorxiv = "target/medrxiv/ebola";
args = "-p " + "target"
+ " clean"
+ " medrxiv/";
AMI.execute(args);
args =
"-p " + biorxiv +""
+ " download"
+ " --site medrxiv"
+ " --query \"ebola AND n95\""
+ " --pagesize 20"
+ " --pages 1 4"
+ " --fulltext pdf"
+ " --limit 2000"
;
AMIDownloadTool amiDownload = AMI.execute(AMIDownloadTool.class, args);
}
on the commandline this would be:
cd ami3
ami -p target/medrxiv/ebola clean medrxiv
followed by
ami -p target/medrxiv/ebola download --site medrxiv --query "ebola AND n95" \
--pagesize 20 --pages 1 4 --fulltext pdf --limit 2000
The flags limit it to 4 pages of 20 hits each (80 fulltexts) and the limit is there in case of goofs.
On running ami download
it has successfully carried out tasks 1-3. It's possible that the "hang" earlier was due to overload on medrxiv
. Here's the output (with debugs clipped) for the complete process:
summary of input
Specific values (AMICleanTool)
================================
fileGlobs [medrxiv/]
Generic values (AMIDownloadTool)
================================
-v to see generic values
project target/medrxiv/ebola
Specific values (AMIDownloadTool)
================================
fulltext [pdf]
limit 40
metadata metadata
pages [1, 4]
pagesize 20
query ["ebola, AND, n95"]
hitListList []
site medrxiv
file types []
main output:
- running the query
Query: "ebola%252BAND%252Bn95"%20sort%3Arelevance-rank%20numresults%3A20
URL https://www.medrxiv.org/search/"ebola%252BAND%252Bn95"%20sort%3Arelevance-rank%20numresults%3A20
running curl :https://www.medrxiv.org/search/"ebola%252BAND%252Bn95"%20sort%3Arelevance-rank%20numresults%3A20?page=0 to target/medrxiv/ebola/__metadata/hitList1.html
wrote hitList: /Users/pm286/workspace/cmdev/ami3/target/medrxiv/ebola/__metadata/hitList1.clean.html
metadataEntries 11
page hits (11) less than page size (20) ; assumed termination
Results 11
[target/medrxiv/ebola/__metadata/hitList1.clean.html]
gets a hitlist with 11 links to landing pages
========
HitList: 1
creates hitList[1..1][.clean].html
and <per-ctree>/scrapedMetadata.html
========
downloads landing pages
download files in hitList target/medrxiv/ebola/__metadata/hitList1.clean.html
result set: target/medrxiv/ebola/__metadata/hitList1.clean.html
metadataEntries 11
download with curl to <tree>scrapedMetadata.html[/content/10.1101/2020.04.24.20073973v1, /content/10.1101/2020.04.22.20076117v1, /content/10.1101/2020.04.23.20077230v1, /content/10.1101/2020.04.24.20078907v1, /content/10.1101/2020.03.05.20032003v1, /content/10.1101/2020.03.31.20047126v1, /content/10.1101/2020.04.06.20054197v1, /content/10.1101/2020.03.20.20039644v2, /content/10.1101/2020.04.11.20062356v1, /content/10.1101/2020.03.23.20039446v2, /content/10.1101/2020.04.15.20066480v2]
running batched up curlDownloader for 11 landingPages, takes ca 1-5 sec/page
ran curlDownloader for 11 landingPages
--------
+downloaded 11 files for target/medrxiv/ebola/__metadata/hitList1.clean.html
--------
========
adds LandingPages: 11
========
downloads PDFs from links in landing pages
========
CTrees 11
========
LP [10_1101_2020_04_24_20073973v1, 10_1101_2020_04_22_20076117v1, 10_1101_2020_04_23_20077230v1, 10_1101_2020_04_24_20078907v1, 10_1101_2020_03_05_20032003v1, 10_1101_2020_03_31_20047126v1, 10_1101_2020_04_06_20054197v1, 10_1101_2020_03_20_20039644v2, 10_1101_2020_04_11_20062356v1, 10_1101_2020_03_23_20039446v2, 10_1101_2020_04_15_20066480v2]
content 127132
2020.04.24.20073973.full.pdf
content 112497
2020.04.22.20076117.full.pdf
content 119018
2020.04.23.20077230.full.pdf
content 176902
2020.04.24.20078907.full.pdf
content 113807
2020.03.05.20032003.full.pdf
content 119105
2020.03.31.20047126.full.pdf
content 109789
2020.04.06.20054197.full.pdf
content 125442
2020.03.20.20039644.full.pdf
content 111965
2020.04.11.20062356.full.pdf
content 119340
2020.03.23.20039446.1.full.pdf
content 118888
2020.04.15.20066480.full.pdf
========
Fulltext: finished
========|
$ tree ebola/
ebola/
├── 10_1101_2020_03_05_20032003v1
│ ├── fulltext.pdf
│ ├── landingPage.html
│ └── scrapedMetadata.html
├── 10_1101_2020_03_20_20039644v2
│ ├── fulltext.pdf
│ ├── landingPage.html
│ └── scrapedMetadata.html
├── 10_1101_2020_03_23_20039446v2
│ ├── fulltext.pdf
│ ├── landingPage.html
│ └── scrapedMetadata.html
├── 10_1101_2020_03_31_20047126v1
│ ├── fulltext.pdf
│ ├── landingPage.html
│ └── scrapedMetadata.html
├── 10_1101_2020_04_06_20054197v1
│ ├── fulltext.pdf
│ ├── landingPage.html
│ └── scrapedMetadata.html
├── 10_1101_2020_04_11_20062356v1
│ ├── fulltext.pdf
│ ├── landingPage.html
│ └── scrapedMetadata.html
├── 10_1101_2020_04_15_20066480v2
│ ├── fulltext.pdf
│ ├── landingPage.html
│ └── scrapedMetadata.html
├── 10_1101_2020_04_22_20076117v1
│ ├── fulltext.pdf
│ ├── landingPage.html
│ └── scrapedMetadata.html
├── 10_1101_2020_04_23_20077230v1
│ ├── fulltext.pdf
│ ├── landingPage.html
│ └── scrapedMetadata.html
├── 10_1101_2020_04_24_20073973v1
│ ├── fulltext.pdf
│ ├── landingPage.html
│ └── scrapedMetadata.html
├── 10_1101_2020_04_24_20078907v1
│ ├── fulltext.pdf
│ ├── landingPage.html
│ └── scrapedMetadata.html
└── __metadata
├── hitList1.clean.html
└── hitList1.html
12 directories, 35 files