pygetpapers: test reports

Alpha test

Tester: Ambreen H

Running pygetpapers in commandline

cloned the repository to the local computer: git clone https://github.com/petermr/dictionary.git
changed the working directory using cd to: cd dictionary/pygetpapersdev
installed requirements using pip: pip install -r requirements.txt
used the function apipaperdownload() and commented out the scrapper function.
ran the entire script: python testpapers.py

Results:

the folder papers was successfully created with the required papers.
each folder contained a pickle file and an XML
9 papers were downloaded out of the 20 requested
no errors

Comments:

can think of a graphic user interface for pygetpapers
more documentation is required for easier workflow
may think of turning the whole thing into a python library so that it can be imported easily into the script and used
a provision for downloading just abstracts and/or full-length papers may improve the functionality

Tester 2: Shweata N. Hegde

OS: Windows 10

Date: 2021-03-02

Prerequisites

git clone git clone https://github.com/petermr/dictionary.git
open command line and cd to dictionary/pygetpapersdev
install required modules using pip on the cmd pip install -r requirements.txt


C:\Users\shweata\dictionary\pygetpapersdev>pip install -r requirements.txt
Requirement already satisfied: requests==2.20.0 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 1)) (2.20.0)
Requirement already satisfied: pandas_read_xml==0.0.9 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 2)) (0.0.9)
Requirement already satisfied: pandas==1.2.0 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 3)) (1.2.0)
Requirement already satisfied: lxml==4.6.2 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 4)) (4.6.2)
Requirement already satisfied: chromedriver_autoinstaller==0.2.2 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 5)) (0.2.2)
Requirement already satisfied: xmltodict==0.12.0 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 6)) (0.12.0)
Requirement already satisfied: selenium==3.12.0 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 7)) (3.12.0)
Requirement already satisfied: pytz>=2017.3 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas==1.2.0->-r requirements.txt (line 3)) (2021.1)
Requirement already satisfied: numpy>=1.16.5 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas==1.2.0->-r requirements.txt (line 3)) (1.20.1)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas==1.2.0->-r requirements.txt (line 3)) (2.8.1)
Requirement already satisfied: pyarrow in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas_read_xml==0.0.9->-r requirements.txt (line 2)) (3.0.0)
Requirement already satisfied: distlib in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas_read_xml==0.0.9->-r requirements.txt (line 2)) (0.3.1)
Requirement already satisfied: zipfile36 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas_read_xml==0.0.9->-r requirements.txt (line 2)) (0.1.3)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (1.24.3)
Requirement already satisfied: idna<2.8,>=2.5 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (2.7)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (2020.12.5)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (3.0.4)
Requirement already satisfied: six>=1.5 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from python-dateutil>=2.7.3->pandas==1.2.0->-r requirements.txt (line 3)) (1.15.0)

Running `pygetpapers` on the cmd

Run python main.py --help

Output

usage: main.py [-h] [-q QUERY] [-k LIMIT] [-o OUTPUT] [-v] [-p FROMPICKLE] [-m] [--api | --webscraping]
              [--onlyresearcharticles | --onlypreprints | --onlyreviews]

Welcome to Pygetpapers. -h or --help for help

optional arguments:
 -h, --help            show this help message and exit
 -q QUERY, --query QUERY
                       Add the query you want to search for. Enclose the query in quotes.
 -k LIMIT, --limit LIMIT
                       Add the number of papers you want. Default =100
 -o OUTPUT, --output OUTPUT
                       Add the output directory url. Default is the current working directory
 -v, --onlyquery       Only makes the query and stores the result.
 -p FROMPICKLE, --frompickle FROMPICKLE
                       Reads the picke and makes the xml files. Takes the path to the pickle as the input
 -m, --makepdf         Also makes pdf files for the papers. Works only with --api method.
 --api                 Get papers using the official EuropePMC api
 --webscraping         Get papers using the scraping EuropePMC. Also supports getting only research papers, preprints
                       or review papers.
 --onlyresearcharticles
                       Get only research papers (Only works with --webscraping)
 --onlypreprints       Get only preprints (Only works with --webscraping)
 --onlyreviews         Get only review papers (Only works with --webscraping)

Example query python main.py -q "invasive plant species" -k 20 -m -o ips_test
It took around 3 min.
The software created "ops_test" in the current directory
A folder called papers was created within "ops_test" where both the .XML and .pdf of 20 papers were downloaded. The corresponding pickle files were also created for each paper.
Apart from that,
- a .csv file with PMC id, HTML link, pdf link, title, and the author info was created in papers
- a pickle file with the information on all papers. (useful if the network breaks off. Download can resume from where it left off)

Comments

It might be useful to add a timer to know how much time it takes download each paper.

Tester 3: Radhu Ladani

OS: Windows 10

Running pygetpapers in commandline

cloned the repository to the device using git command at specific path: git clone https://github.com/petermr/dictionary.git
now open the commandline and change the working directory using cd to: cd dictionary/pygetpapersdev
installed requirement module using pip: pip install -r requirements.txt
Running pygetpapers on commandline: python main.py --help

Output

C:\Users\DELL\Radhu\dictionary\pygetpapersdev>python main.py --help
usage: main.py [-h] [-q QUERY] [-k LIMIT] [-o OUTPUT] [-v] [-p FROMPICKLE] [-m] [--api | --webscraping]
               [--onlyresearcharticles | --onlypreprints | --onlyreviews]

Welcome to Pygetpapers. -h or --help for help

optional arguments:
  -h, --help            show this help message and exit
  -q QUERY, --query QUERY
                        Add the query you want to search for. Enclose the query in quotes.
  -k LIMIT, --limit LIMIT
                        Add the number of papers you want. Default =100
  -o OUTPUT, --output OUTPUT
                        Add the output directory url. Default is the current working directory
  -v, --onlyquery       Only makes the query and stores the result.
  -p FROMPICKLE, --frompickle FROMPICKLE
                        Reads the picke and makes the xml files. Takes the path to the pickle as the input
  -m, --makepdf         Also makes pdf files for the papers. Works only with --api method.
  --api                 Get papers using the official EuropePMC api
  --webscraping         Get papers using the scraping EuropePMC. Also supports getting only research papers, preprints
                        or review papers.
  --onlyresearcharticles
                        Get only research papers (Only works with --webscraping)
  --onlypreprints       Get only preprints (Only works with --webscraping)
  --onlyreviews         Get only review papers (Only works with --webscraping)

Example query: python main.py -q "Medicinal Activities" -k 50 -m -o output_test
This query created "output_test" in the current directory which we define in the path.
A folder papers was created within "output_test" where both the .XML and .pdf of 50 papers were downloaded. The corresponding pickle files were also created for each paper.
.csv file with PMC id, HTML link, pdf link, title, and the author info was created in papers
no errors

Comments

We can set a some guidelines so computer non-users can understand.

Tester 4: Kanishka Parashar

OS: Windows 10

Running pygetpapers in commandline

3 March 2021

cloned the repo to local device using git command: https://github.com/petermr/dictionary.git
open commandline and change cd to: cd dictionary/pygetpapersdev
install using pip: pip install -r requirements.txt
run command on cmd: python main.py --help
run query: python main.py -q "invasive plant species" -k 100 -m -o plantinvasivepaper
a folder was created plantinvasivepaper which had 100 papers both the .XML and .pdf
no error

Tester 5: Talha Hasan

OS: Windows 10

Running pygetpapers in commandline

cloned the repository to the device using the git command: git clone https://github.com/petermr/dictionary.git
changed the working directory using cd to: cd dictionary/pygetpapersdev
installed requirement module using pip: pip install -r requirements.txt
Running pygetpapers on commandline: python main.py --help

Output

C:\Users\talha hasan\dictionary\pygetpapersdev>python main.py --help
usage: main.py [-h] [-q QUERY] [-k LIMIT] [-o OUTPUT] [-v] [-p FROMPICKLE] [-m] [--api | --webscraping]
               [--onlyresearcharticles | --onlypreprints | --onlyreviews]

Welcome to Pygetpapers. -h or --help for help

optional arguments:
  -h, --help            show this help message and exit
  -q QUERY, --query QUERY
                        Add the query you want to search for. Enclose the query in quotes.
  -k LIMIT, --limit LIMIT
                        Add the number of papers you want. Default =100
  -o OUTPUT, --output OUTPUT
                        Add the output directory url. Default is the current working directory
  -v, --onlyquery       Only makes the query and stores the result.
  -p FROMPICKLE, --frompickle FROMPICKLE
                        Reads the picke and makes the xml files. Takes the path to the pickle as the input
  -m, --makepdf         Also makes pdf files for the papers. Works only with --api method.
  --api                 Get papers using the official EuropePMC api
  --webscraping         Get papers using the scraping EuropePMC. Also supports getting only research papers, preprints
                        or review papers.
  --onlyresearcharticles
                        Get only research papers (Only works with --webscraping)
  --onlypreprints       Get only preprints (Only works with --webscraping)
  --onlyreviews         Get only review papers (Only works with --webscraping)

the example query used to download test papers on the topic plant compounds: python main.py -q "plant compounds" -k 20 -m -o plantcomp
This query created "plantcomp" in the current directory which we define in the path.
the folder was automatically created named papers inside the folder "plantcomp" where both the .XML and .pdf of 50 papers related to plant compounds were downloaded, along with its pickle file
no errors

Tester 4: Vasant Kumar

OS: Windows 10

To run pygetpapers in commandline

3 March 2021

Cloned the repository to local device using git command: https://github.com/petermr/dictionary.git
Open cmd and change the directory cd to: cd dictionary/pygetpapersdev
Installed requirement module using pip: pip3 install -r requirements.txt --user because pip install -r requirements.txtdidn't work out.

Output

Requirement already satisfied: requests==2.20.0 in c:\python39\lib\site-packages (from -r requirements.txt (line 1)) (2.20.0) Collecting pandas_read_xml==0.0.9 Using cached pandas_read_xml-0.0.9-py3-none-any.whl (6.2 kB) Collecting pandas==1.2.0 Using cached pandas-1.2.0-cp39-cp39-win_amd64.whl (9.3 MB) Collecting lxml==4.6.2 Using cached lxml-4.6.2-cp39-cp39-win_amd64.whl (3.5 MB) Collecting chromedriver_autoinstaller==0.2.2 Using cached chromedriver_autoinstaller-0.2.2-py3-none-any.whl (5.9 kB) Requirement already satisfied: xmltodict==0.12.0 in c:\python39\lib\site-packages (from -r requirements.txt (line 6)) (0.12.0) Collecting selenium==3.12.0 Using cached selenium-3.12.0-py2.py3-none-any.whl (946 kB) Requirement already satisfied: certifi>=2017.4.17 in c:\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (2020.12.5) Requirement already satisfied: idna<2.8,>=2.5 in c:\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (2.7) Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (3.0.4) Requirement already satisfied: urllib3<1.25,>=1.21.1 in c:\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (1.24.3) Requirement already satisfied: zipfile36 in c:\python39\lib\site-packages (from pandas_read_xml==0.0.9->-r requirements.txt (line 2)) (0.1.3) Requirement already satisfied: distlib in c:\python39\lib\site-packages (from pandas_read_xml==0.0.9->-r requirements.txt (line 2)) (0.3.1) Collecting pyarrow Using cached pyarrow-3.0.0-cp39-cp39-win_amd64.whl (12.6 MB) Collecting pytz>=2017.3 Using cached pytz-2021.1-py2.py3-none-any.whl (510 kB) Requirement already satisfied: numpy>=1.16.5 in c:\python39\lib\site-packages (from pandas==1.2.0->-r requirements.txt (line 3)) (1.20.1) Collecting python-dateutil>=2.7.3 Using cached python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB) Collecting six>=1.5 Using cached six-1.15.0-py2.py3-none-any.whl (10 kB) Installing collected packages: pyarrow, pytz, six, python-dateutil, pandas, pandas-read-xml, lxml, chromedriver-autoinstaller, selenium WARNING: The script plasma_store.exe is installed in 'C:\Users\vasan\AppData\Roaming\Python\Python39\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. WARNING: The script chromedriver-path.exe is installed in 'C:\Users\vasan\AppData\Roaming\Python\Python39\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. Successfully installed chromedriver-autoinstaller-0.2.2 lxml-4.6.2 pandas-1.2.0 pandas-read-xml-0.0.9 pyarrow-3.0.0 python-dateutil-2.8.1 pytz-2021.1 selenium-3.12.0 six-1.15.0 WARNING: You are using pip version 20.2.3; however, version 21.0.1 is available. You should consider upgrading via the 'c:\python39\python.exe -m pip install --upgrade pip' command.

To run pygetpapers on commandline: python main.py --help

Output

usage: main.py [-h] [-q QUERY] [-k LIMIT] [-o OUTPUT] [-v] [-p FROMPICKLE] [-m] [--api | --webscraping] [--onlyresearcharticles | --onlypreprints | --onlyreviews]

Welcome to Pygetpapers. -h or --help for help

optional arguments: -h, --help show this help message and exit -q QUERY, --query QUERY Add the query you want to search for. Enclose the query in quotes. -k LIMIT, --limit LIMIT Add the number of papers you want. Default =100 -o OUTPUT, --output OUTPUT Add the output directory url. Default is the current working directory -v, --onlyquery Only makes the query and stores the result. -p FROMPICKLE, --frompickle FROMPICKLE Reads the picke and makes the xml files. Takes the path to the pickle as the input -m, --makepdf Also makes pdf files for the papers. Works only with --api method. --api Get papers using the official EuropePMC api --webscraping Get papers using the scraping EuropePMC. Also supports getting only research papers, preprints or review papers. --onlyresearcharticles Get only research papers (Only works with --webscraping) --onlypreprints Get only preprints (Only works with --webscraping) --onlyreviews Get only review papers (Only works with --webscraping)

Example query: python main.py -q "plant part" -k 20 -m -o plantpart
A folder was created plantpart containing 20 papers in both .XML and .pdf format
No errors

Comments

A basic step guide is required for the beginners

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pygetpapers: test reports

Alpha test

Tester: Ambreen H

Running pygetpapers in commandline

Results:

Comments:

Tester 2: Shweata N. Hegde

OS: Windows 10

Date: 2021-03-02

Prerequisites

Running `pygetpapers` on the cmd

Output

Comments

Tester 3: Radhu Ladani

OS: Windows 10

Running pygetpapers in commandline

Output

Comments

Tester 4: Kanishka Parashar

OS: Windows 10

Running pygetpapers in commandline

3 March 2021

Tester 5: Talha Hasan

OS: Windows 10

Running pygetpapers in commandline

Output

Tester 4: Vasant Kumar

OS: Windows 10

To run pygetpapers in commandline

3 March 2021

Output

Output

Comments

Clone this wiki locally

pygetpapers: test reports

Alpha test

Tester: Ambreen H

Running pygetpapers in commandline

Results:

Comments:

Tester 2: Shweata N. Hegde

OS: Windows 10

Date: 2021-03-02

Prerequisites

Running pygetpapers on the cmd

Output

Comments

Tester 3: Radhu Ladani

OS: Windows 10

Running pygetpapers in commandline

Output

Comments

Tester 4: Kanishka Parashar

OS: Windows 10

Running pygetpapers in commandline

3 March 2021

Tester 5: Talha Hasan

OS: Windows 10

Running pygetpapers in commandline

Output

Tester 4: Vasant Kumar

OS: Windows 10

To run pygetpapers in commandline

3 March 2021

Output

Output

Comments

Clone this wiki locally

Running `pygetpapers` on the cmd