Skip to content
Lakshmi Devi Priya edited this page Sep 21, 2020 · 40 revisions

getpapers

primary URL

https://github.com/ContentMine/getpapers

purpose

queries a repository with RESTful API and downloads content in bulk

documentation.

overview and installation https://github.com/ContentMine/getpapers/blob/master/README.md For example https://github.com/petermr/tigr2ess/blob/master/getpapers/OVERVIEW.md

installation

Installation of getpapersinvolves steps for every operation system.

Instructions followed: https://github.com/ContentMine/getpapers/blob/master/README.md

See also: https://github.com/petermr/tigr2ess/blob/master/installation/INSTALLATION.md

example of use

Simple: https://github.com/ContentMine/getpapers/blob/master/README.md

For a full example: https://github.com/petermr/tigr2ess/tree/master/getpapers

comments

  • getpapers uses a headless browser (Phantom.js) which still works but is no longer maintained. It is customised for EPMC, IEEE, Crossref and ?arXiv. It needs a RESTful API.
  • the query syntax is different on different sites. Also escape characters (" or ')
  • default query format is EPMC

usefulness

help in downloading large files with full text content in bulk at a short time duration.

Installation problems

Users can face various problems during the installation process of getpapers. They may encounter errors in their process. Follow the instructions and in case of any installation problem, post an issue about the same in the issue section, or refer to an existing issue if it matches the problem.

Alternate installation using Docker

If you have Docker installed you may skip the preceding steps and use get_papers using the following Dockerfile

FROM node:slim

WORKDIR /usr/src/app

RUN npm install --global getpapers

This may be built into an image labeled as paper_getter using the following Docker command

docker build -t paper_getter .

A search may then be performed dropping the results into a given folder (here labeled as results) using the following Docker run command (here searching for the term c4 photosynthesis flaveria

docker run -it \
 -v $(pwd)/results:/results \
 paper_getter \
 getpapers -p -x -o /results --query 'c4 photosynthesis flaveria'

This command will mount the folder results into the Docker container as it runs the image paper_getter and the command is written so as to place the results in this folder.

Usage problems

For users facing any usage problems in getpapers they can create an issue regarding the same or may refer the existing ones.

For users using macOS X or higher, a profile needs to be created before downloading nvm. See Tester 8: Charles Li's section.

Windows

More examples

Some EPMC queries

(please add some queries involving DATE, OR, AND, NOT)

tester experiences:

tester 1

Kareena Singh

operating system

Windows 10

INSTALLATION PROCESS

(1) Installation of nvm-windows

source of instructions

https://github.com/petermr/tigr2ess/blob/master/installation/windows/INSTALLATION.md

steps of installation

Go to the downloads page and download latest version of nvm-setup.zip.

Unzip the downloaded file and run the included installer.

installation

successfully installed and run

(2) Installation of node

source of instructions

https://github.com/petermr/tigr2ess/blob/master/installation/windows/INSTALLATION.md

steps of installation

Open your command prompt, and run the following commands one after the other.

  • nvm install 7
  • nvm use 7.10.1

installation of node

successful

installation problems

The following installation problem occured when I put node installation command in command line

Error: Access to the registry path is denied
  • reason insufficient privileges to install (requires "root" permission in windows)

  • solution

test of installation

successful

node --version
version 11.11.0

(3) Installation of getpapers

source of instructions

https://github.com/petermr/tigr2ess/blob/master/installation/windows/INSTALLATION.md

installation steps

  • Run the following command at command prompt: npm install --global getpapers

  • Now run the command getpapers at the command prompt, and you should see something as below:

installation problems

none reported

test of installation

You can run the test of installation by putting the command getpapers --version If you get the following, then installation is successful. 0.4.17

Tester 2

Lakshmi Devi Priya

Operating System

Windows 10

For users using macOS X or higher, please refer Tester 8's documentation.

Installation of node

source of instruction

Instructions from: https://github.com/ContentMine/blob/master/README.md

installation of node

Successful

test of installation

By using the below syntax in command prompt, the test was successful.

C:\Users>node -v
v12.16.3

Installation of getpapers

source of instruction

Instructions from: https://github.com/ContentMine/blob/master/README.md

installtion of getpapers

Successful

installation problems

C:\Users>$ npm install --global getpapers

`$` is not recognized as an internal or external command,
operable program or batch file.
  • '$' is not a part of the command(it's UNIX prompt).

  • So just try as

npm install --global getpapers
  • getpapers was installed successfully.

test of installation

Use

C:\Users>getpapers --help
  • The commands used for getpapers are viewed as below:
  Usage: getpapers [options]

  Options:

    -h, --help                output usage information
    -V, --version             output the version number
    -q, --query <query>       search query (required)
    -o, --outdir <path>       output directory (required - will be created if not found)
    --api <name>              API to search [eupmc, crossref, ieee, arxiv] (default: eupmc)
    -x, --xml                 download fulltext XMLs if available
    -p, --pdf                 download fulltext PDFs if available
    -s, --supp                download supplementary files if available
    -t, --minedterms          download text-mined terms if available
    -l, --loglevel <level>    amount of information to log (silent, verbose, info*, data, warn, error, or debug)
    -a, --all                 search all papers, not just open access
    -n, --noexecute           report how many results match the query, but don't actually download anything
    -f, --logfile <filename>  save log to specified file in output directory as well as printing to terminal
    -k, --limit <int>         limit the number of hits and downloads
    --filter <filter object>  filter by key value pair, passed straight to the crossref api only
    -r, --restart             restart file downloads after failure
  • Successfully installed getpapers.

Use of getpapers

Followed example from:

https://github.com/petermr/tigr2ess/tree/master/getpapers

To search query on a specified task

use the following syntax
getpapers -q <query> -n -k <int>

-q, --query : search query(required)

-n, --noexecute : only reports how many queries match the query, but don't actually download anything

For eg: for the query of COVID-19

Use as

getpapers -q COVID-19 -n -k 100

The results will be shown as below:

info: Searching using eupmc API
info: Running in no-execute mode, so nothing will be downloaded
info: Found 68721 open access results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 5.3.2 vs. 6.4 reported by api

Output - Founds 68721 open acesss results. This much result cannot be downloaded, so the number of downloads should be limited.

-k, --limit : limits the number of hits and downloads

<int> refers to an integer. Hence, the number of files to be downloaded should be represented.

To download the files, use the following syntax
getpapers -q <query> -k <int> -o <Cproject> -x -p

-o, --outdir : output directory(required - will be created if not found). This is known as the Cproject. This command gives the path to the directory created in the system for the downloaded files.

-p, --pdf : downloads fulltext PDFs if available.

-x, --xml : downloads fulltext XMLs if available.

Thus, for the query COVID-19 the syntax

getpapers -q COVID-19 -k 100 -o covid -x -p

gives the result similar as follows:

info: Searching using eupmc API
info: Found 68721 open access results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 5.3.2 vs. 6.4 reported by api
info: Limiting to 100 hits
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: limiting hits
info: Saving result metadata
info: Full EUPMC result metadata written to eupmc_results.json
info: Individual EUPMC result metadata records written
info: Extracting fulltext HTML URL list (may not be available for all articles)
info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt
warn: Article with pmcid "PMC7442033" was not Open Access (therefore no XML)
warn: Article with doi "10.1101/2020.08.04.20168112 did not have a PMCID (therefore no XML)
.
.
.
info: Got XML URLs for 57 out of 100 results
info: Downloading fulltext XML files
Downloading files [==============================] 100% (57/57) [5.8s elapsed, eta 0.0]
info: All downloads succeeded!
warn: Article with pmcid "PMC7442033" had no fulltext PDF url
.
.
.
info: Downloading fulltext PDF files
Downloading files [==============================] 100% (58/58) [29.6s elapsed, eta 0.0]
info: All downloads succeeded!

The downloaded PMC folders are known as Ctrees. They contain xml and pdf files as their children.

.xml files in the resultant folder are both machine-readable and human-readable.

Here, only 57 .xml files were downloaded and only 58 .pdf files were downloaded. This result of outcome on pdf and xml files will change based on the query used in the syntax.

Tester 3

Pruthiv rajan

Operating System

Windows 10

Installation of node

source of instruction

Instructions from: https://github.com/ContentMine/blob/master/README.md

installation of node

Successful

test of installation

Successful

C:\Users>node -v v12.16.3

Installation of getpapers

source of instruction

Instructions from: https://github.com/ContentMine/blob/master/README.md

installtion of getpapers

Successful

installation problems

No problems.

test of installation

Use

C:\Users>getpapers --help
  • The command option used for getpapers are viewed.

  • Installed getpapers.

Use of getpapers

Followed example from:

https://github.com/petermr/tigr2ess/tree/master/getpapers

To search query on a specified task

use the following syntax
getpapers -q <query> -n -k 100

-q, --query : search query(required)

-n, --noexecute : only reports how many queries match the query, but don't actually download anything

For eg: for the query of COVID-19

Use as

getpapers -q COVID-19 -n -k <int>

The results will be shown as below:

https://drive.google.com/file/d/1DP0_xcjC5GMQ2CflM7TQUoyIO3J3MyCW/view?usp=drivesdk

Output - Founds 46887 open acesss results. This much result cannot be downloaded, so the number of downloads should be limited.

-k, --limit : limits the number of hits and downloads

<int> refers to an integer. Hence, the number of files to be downloaded should be represented.

To download the files, use the following syntax
getpapers -q <query> -k <int> -o <path> -x -p

-o, --outdir : output directory(required - will be created if not found).

This command gives the path to the directory created in the system for the downloaded files.

-p, --pdf : downloads fulltext PDFs if available.

-x, --xml : downloads fulltext XMLs if available.

Thus, for the query Human genome project the syntax

getpapers -q “human genome project  ” -k 100 -o covid -x -hgp

gives the result as follows.

Expected 100 .xml files were downloaded. But only 84 .pdf files were downloaded.


Tester 4:

Name: Ambreen Hamadani

Operating System: Windows 10

INSTALLATION PROCESS

Installation of Node.Js

Preinstalled on the System

Installation of getpapers

Source of Instruction: ContentMine / getpapers

Steps in the Installation:

  1. Open Comand Prompt
  2. Run the command https://github.com/ContentMine/getpapers

Installation: Successful

Test of the Installation:

  1. Type getpapers in Command Prompt
  2. Usage and options displayed

Successful installation

Usage of getpapers: Test 1

The tool was used to retrieve 100 papers on the topic, 'masks' with the output directory specified as 'test1' Command used: getpapers --query 'masks ' --limit 100 --outdir test1

Results

  1. A new directory (test1) created within the home directory
  2. 100 folders (PMC###) created within 'test1' each containing a JSON file (eupmc_result)
  3. 1 text file (eupmc_fulltext_html_urls) containing the URLs of all downloaded documents
  4. 1 JSON file (eupmc_results) created **Command line output **
  5. 0 error messages
  6. No warnings

Query results for getting papers on masks (limit 100)

Usage of getpapers: Test 2

The tool was used to retrieve 200 papers on the topic, 'viral epidemics' with the output directory specified as 'test3' Command used: getpapers --query 'viral epidemics' --limit 200 --outdir test3

Results

  1. A new directory (test3) created within the home directory
  2. 200 folders (PMC###) created within 'test3' each containing a JSON file (eupmc_result)
  3. 1 text file (eupmc_fulltext_html_urls) containing the URLs of all downloaded documents
  4. 1 JSON file (eupmc_results) created **Command line output **
  5. 0 error messages
  6. 2 warnings (warn: This version of getpapers wasn't built with this version of the EuPMC api in mind; warn: getpapers EuPMCVersion: 5.3.2 vs. 6.3 reported by api)

Query results for getting papers on viral epidemics (limit 200)

Tester 5:

Name: Vaishali Arora

Operating System: Windows 10

INSTALLATION STEPS

  1. Installation of Node.Js Reference :https://github.com/petermr/tigr2ess/blob/master/installation/INSTALLATION.md

  2. Installation of getpapers= Reference :https://github.com/petermr/tigr2ess/blob/master/installation/INSTALLATION.md

Test of installation:

Successful I.Type getpapers in Command Prompt

Usage of getpapers

1.** Downloaded 100 papers on the topic, 'COVID-19' (PDF Files)**

Commands Used:

getpapers -q "COVID-19" -p -k 100 -o covid_19 Successfully downloaded 100 papers with 1(.json file) and 1(.txt file)

Command Line Output:

  1. 0 error messages
  2. 2 Warnings

RESULTS: https://drive.google.com/file/d/1rKgNGojNacMPLeViSPykpXgGsJg0zFUk/view?usp=sharing

**Downloaded 100 (.xml) files on 'COVID deaths' with the directory cdeaths **

Commands Used:

getpapers -q "COVID deaths" -o cdeaths -x -k 100 with 1 (.json file) and 1 (.txt file)

Command Line Output

  1. 0 error messages
  2. 2 Warnings

Reference:https://github.com/petermr/tigr2ess/blob/master/getpapers/TUTORIAL.md

Tester 6:

Vanisha Arora

Operating system:

Windows 10

Installation of node:

Source of instructions: https://github.com/petermr/tigr2ess/blob/master/installation/INSTALLATION.md

Installation of getpapers:

Instructions from:

https://github.com/ContentMine/blob/master/README.md

Installation of getpapers:

Successful

Test of installation:

Put the command getpapers --version in the command prompt.

Getting 0.4.17 confirms installation.

To search query on a specified task :

getpapers -q "query" -n -k 50 (If 50 articles are to be downloaded)

For eg: for the query of viral epidemics Use as

getpapers -q "viral epidemics" -n -k 50

-p, --pdf : (For downloading pdfs) -x, --xml : (For downloading .xml)

Thus, for the query viral epidemics the syntax

getpapers -q viral epidemics -k 50 -o viral epidemics -x -p

Downloaded 50 (pdf and xml files )with viral epidemics under the directory viral epidemics

Tester 7

NAME

SANA SAIFI

OPERATING SYSTEM

WINDOWS 10

INSTALLATION PROCESS

1.Installation of nvm-windows

SOURCE:https://github.com/petermr/tigr2ess/blob/master/installation/INSTALLATION.md

A. Scroll and go on section Software Installation. And click on the appropriate link, depending on your Operating system.

B. Go to download page (https://github.com/coreybutler/nvm-windows/releases) & download latest version of nvm-setup.zip.

c. Run the file and install it in your windows.

2. Installation of getpapers

SOURCE: https://github.com/petermr/tigr2ess/blob/master/installation/windows/INSTALLATION.md

3.Test of Installation

A. Run the command getpapers in the command prompt. B. Various usage options are displayed with their meanings.

4.Installation of getpapers

Successful

4 warnings 2 errors

GETPAPERS

  1. Why we are using?

To search query on a specified task and download n numbers of research paper from an open source.

  1. How to use?

TEST

To download 100 pdfs/ .xml files on viral epidemics,

open the command prompt and

type the syntax getpapers -q viral epidemics -k 100 -o viral epidemics -x -p

Downloaded 77 files out of 100 from open source under the directory of Viral Epidemics.

Tester 8: Zeyang Charles Li

Operating system: macOS Mojave 10.14.1

Node.js installation

Followed instructions in https://github.com/blahah/installing-node-tools#macos

Successfully installed Xcode but nvm installation showed problems.

Typed in

curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.30.1/install.sh | bash

cURL command was built in so it did not show any errors, but if your mac system did not have cURL built in then you need to add cURL command first, using

sudo apt-get install curl

However it shows error of failed connection to url in previous line (Error 1 in shared folder)

Then I copy and pasted the whole content in https://raw.githubusercontent.com/creationix/nvm/v0.30.1/install.sh to terminal, without the curl command. This ensures the nvm file to be downloaded but it did not find a profile. (Error 2)

So a profile is needed prior to downloading nvm

touch ~/.bash_profile

This creates a profile for nvm and then we can copy & paste content in https://raw.githubusercontent.com/creationix/nvm/v0.30.1/install.sh to run nvm installation again.

After following the instructions on terminal, you can test nvm by

nvm --version

which should say the current version of nvm installed on your computer (Final in shared folder)

You can test if Node.js is installed after the lines by

node -v

If you have other problems installing nvm on macOS, see this discussion page

https://github.com/nvm-sh/nvm/issues/576

Getpapers installation

npm install --global getpapers (remove the $ in front of line)

During installation, 9 warnings were shown and no errors occurred.

Getpapers is then tested with a test query in EPMC search format

getpapers --query 'viral epidemic' -k 50 --outdir test

Folder of error screenshots

https://drive.google.com/drive/folders/1d3PJM-bpBco0kmeyTB-kG3FTP-XDpciL?usp=sharing

Past Error: EPMC timeout when fetching papers - fixed when changed internet proxy.

A reduced corpus of 580 articles was generated

Tester 9

Name: Anugrah SR

Operating System: Ubuntu 18.04 LTS

Installing Nvm and Node

Downloading and installing packages
curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.33.0/install.sh | bash
Configuration

Added these lines to .zshrc or .bashrc

export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh"
[ -s "$NVM_DIR/bash_completion" ] && \. "$NVM_DIR/bash_completion"

source .zshrc

nvm --version
0.33.0

Installing node

nvm install 7
nvm use 7
nvm alias default 7

Installing getpapers

npm install --global getpapers

getpapers --help will prinout the command-line help getpaper

Testing getpapers
getpapers -q “human genome project  ” -k 100 -o covid -x -hgp

getpaper test

Alternate installation using Docker

installing docker

apt-get update

apt-get install \\
    apt-transport-https \\
    ca-certificates \\
    curl \\
    gnupg-agent \\
    software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository \\
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu \\
   $(lsb_release -cs) \\
   stable"

sudo apt-get update

apt-get install docker-ce docker-ce-cli containerd.io

make a dockerfile named paper_getter

FROM node:slim

WORKDIR /usr/src/app

RUN npm install --global getpapers

Building docker conatiner

docker build -t paper_getter .

#ouput an error
unable to prepare context: unable to evaluate symlinks in Dockerfile path: lstat /home/ubuntu/Dockerfile: no such file or directory

Added additional flag -f for Name of the Dockerfile

sudo docker build -t paper_getter -f paper_getter .

Running dockerised getpaper

docker run -it \                                   
 -v $(pwd)/results:/results \
 paper_getter \
 getpapers -p -x -o /results --query 'c4 photosynthesis flaveria'

TESTER 10

Name: Shweata N. Hegde

OS: Windows 7

SOURCE: https://github.com/petermr/tigr2ess/blob/master/installation/windows/INSTALLATION.md

a. DOWNLOADING Node

b. INSTALLATION of Node

  • Run the following command in command prompt
nvm install 7
nvm use 7.10.1

Installation successful without any problems.

c. INSTALLATION OF getpapers

  • Run the following command in the command prompt.
npm install --global getapapers

Installation Successful(11 warnings but no errors).

d. USAGE OF getpapers

  1. getpapers was first used to find out how many open access papers were available on a specific topic.
getpapers -q <query> -n
  • -q: search query
  • -n: non-executable

Eg. Run the following syntax on the command prompt

getpapers -q viral epidemics -n

It is found that 312120 open access results were available for our search. Since we cannot download all of them, we will have to limit the downloads.

  1. getpapers was now used to download limited number of XML and pdf of open-access paper on a specified query.
getpapers -q <query> -k <int> -o <path> -x -p
  • -q: search query
  • -k: limits the number of hits and downloads <int>: integer
  • -o: output directory, will be created if not found
  • -p: downloads, if full-text pdf is available
  • -x: downloads, if full-text XML is available

In our case, we run the following syntax

getpapers -q viral epidemics -k 100 -o test -x -p

Downloaded 83 .XML files and 86 .pdf files under the directory 'test'.

Folder of 'warnings' screenshots

Got 17 warnings while downloading .XML files and 14 warnings while downloading pdf files. https://drive.google.com/drive/folders/17d0FJJak7zAZ_OvV6V2Q_iYyNh1v4DMh?usp=sharing


Clone this wiki locally