Skip to content

pygetpapers: test reports

ShweataNHegde edited this page Mar 4, 2021 · 27 revisions

Alpha test

Tester: Ambreen H

Running pygetpapers in commandline

  • cloned the repository to the local computer: git clone https://github.com/petermr/dictionary.git
  • changed the working directory using cd to: cd dictionary/pygetpapersdev
  • installed requirements using pip: pip install -r requirements.txt
  • used the function apipaperdownload() and commented out the scrapper function.
  • ran the entire script: python testpapers.py

Results:

  • the folder papers was successfully created with the required papers.
  • each folder contained a pickle file and an XML
  • 9 papers were downloaded out of the 20 requested
  • no errors

Comments:

  • can think of a graphic user interface for pygetpapers
  • more documentation is required for easier workflow
  • may think of turning the whole thing into a python library so that it can be imported easily into the script and used
  • a provision for downloading just abstracts and/or full-length papers may improve the functionality

Tester 2: Shweata N. Hegde

OS: Windows 10

Date: 2021-03-02

Prerequisites

  • git clone git clone https://github.com/petermr/dictionary.git
  • open command line and cd to dictionary/pygetpapersdev
  • install required modules using pip on the cmd pip install -r requirements.txt

C:\Users\shweata\dictionary\pygetpapersdev>pip install -r requirements.txt
Requirement already satisfied: requests==2.20.0 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 1)) (2.20.0)
Requirement already satisfied: pandas_read_xml==0.0.9 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 2)) (0.0.9)
Requirement already satisfied: pandas==1.2.0 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 3)) (1.2.0)
Requirement already satisfied: lxml==4.6.2 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 4)) (4.6.2)
Requirement already satisfied: chromedriver_autoinstaller==0.2.2 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 5)) (0.2.2)
Requirement already satisfied: xmltodict==0.12.0 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 6)) (0.12.0)
Requirement already satisfied: selenium==3.12.0 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 7)) (3.12.0)
Requirement already satisfied: pytz>=2017.3 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas==1.2.0->-r requirements.txt (line 3)) (2021.1)
Requirement already satisfied: numpy>=1.16.5 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas==1.2.0->-r requirements.txt (line 3)) (1.20.1)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas==1.2.0->-r requirements.txt (line 3)) (2.8.1)
Requirement already satisfied: pyarrow in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas_read_xml==0.0.9->-r requirements.txt (line 2)) (3.0.0)
Requirement already satisfied: distlib in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas_read_xml==0.0.9->-r requirements.txt (line 2)) (0.3.1)
Requirement already satisfied: zipfile36 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas_read_xml==0.0.9->-r requirements.txt (line 2)) (0.1.3)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (1.24.3)
Requirement already satisfied: idna<2.8,>=2.5 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (2.7)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (2020.12.5)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (3.0.4)
Requirement already satisfied: six>=1.5 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from python-dateutil>=2.7.3->pandas==1.2.0->-r requirements.txt (line 3)) (1.15.0)

Running pygetpapers on the cmd

  • Run python main.py --help
Output
usage: main.py [-h] [-q QUERY] [-k LIMIT] [-o OUTPUT] [-v] [-p FROMPICKLE] [-m] [--api | --webscraping]
              [--onlyresearcharticles | --onlypreprints | --onlyreviews]

Welcome to Pygetpapers. -h or --help for help

optional arguments:
 -h, --help            show this help message and exit
 -q QUERY, --query QUERY
                       Add the query you want to search for. Enclose the query in quotes.
 -k LIMIT, --limit LIMIT
                       Add the number of papers you want. Default =100
 -o OUTPUT, --output OUTPUT
                       Add the output directory url. Default is the current working directory
 -v, --onlyquery       Only makes the query and stores the result.
 -p FROMPICKLE, --frompickle FROMPICKLE
                       Reads the picke and makes the xml files. Takes the path to the pickle as the input
 -m, --makepdf         Also makes pdf files for the papers. Works only with --api method.
 --api                 Get papers using the official EuropePMC api
 --webscraping         Get papers using the scraping EuropePMC. Also supports getting only research papers, preprints
                       or review papers.
 --onlyresearcharticles
                       Get only research papers (Only works with --webscraping)
 --onlypreprints       Get only preprints (Only works with --webscraping)
 --onlyreviews         Get only review papers (Only works with --webscraping)
  • Example query python main.py -q "invasive plant species" -k 20 -m -o ips_test
  • It took around 3 min.
  • The software created "ops_test" in the current directory
  • A folder called papers was created within "ops_test" where both the .XML and .pdf of 20 papers were downloaded. The corresponding pickle files were also created for each paper.
  • Apart from that,
    • a .csv file with PMC id, HTML link, pdf link, title, and the author info was created in papers
    • a pickle file with the information on all papers. (useful if the network breaks off. Download can resume from where it left off)

Comments

  • It might be useful to add a timer to know how much time it takes download each paper.

Error:

C:\Users\shweata\dictionary\pygetpapersdev>python main.py -q "anthoceros agrestis" -k 50 -m -o "anthoceros"
*/Building the Query*/
*/Making the Request to get all hits*/
*/Got the Content*/
*/Building the Query*/
*/Making the Request to get all hits*/
*/Got the Content*/
Traceback (most recent call last):
  File "C:\Users\shweata\dictionary\pygetpapersdev\main.py", line 375, in <module>
    callpygetpapers.handlecli()
  File "C:\Users\shweata\dictionary\pygetpapersdev\main.py", line 358, in handlecli
    self.apipaperdownload(args.query, args.limit,
  File "C:\Users\shweata\dictionary\pygetpapersdev\main.py", line 298, in apipaperdownload
    query_result = self.europepmc(query, size)
  File "C:\Users\shweata\dictionary\pygetpapersdev\main.py", line 122, in europepmc
    for paper in output_dict["responseWrapper"]["resultList"]["result"]:
TypeError: 'NoneType' object is not subscriptable
  • The above query had only 44 hits (manually searched epmc) and the software didn't get the results when I specified 50 papers (i.e., -k 50).

Tester 3: Radhu Ladani

OS: Windows 10

Running pygetpapers in commandline

  • cloned the repository to the device using git command at specific path: git clone https://github.com/petermr/dictionary.git
  • now open the commandline and change the working directory using cd to: cd dictionary/pygetpapersdev
  • installed requirement module using pip: pip install -r requirements.txt
  • Running pygetpapers on commandline: python main.py --help
Output
C:\Users\DELL\Radhu\dictionary\pygetpapersdev>python main.py --help
usage: main.py [-h] [-q QUERY] [-k LIMIT] [-o OUTPUT] [-v] [-p FROMPICKLE] [-m] [--api | --webscraping]
               [--onlyresearcharticles | --onlypreprints | --onlyreviews]

Welcome to Pygetpapers. -h or --help for help

optional arguments:
  -h, --help            show this help message and exit
  -q QUERY, --query QUERY
                        Add the query you want to search for. Enclose the query in quotes.
  -k LIMIT, --limit LIMIT
                        Add the number of papers you want. Default =100
  -o OUTPUT, --output OUTPUT
                        Add the output directory url. Default is the current working directory
  -v, --onlyquery       Only makes the query and stores the result.
  -p FROMPICKLE, --frompickle FROMPICKLE
                        Reads the picke and makes the xml files. Takes the path to the pickle as the input
  -m, --makepdf         Also makes pdf files for the papers. Works only with --api method.
  --api                 Get papers using the official EuropePMC api
  --webscraping         Get papers using the scraping EuropePMC. Also supports getting only research papers, preprints
                        or review papers.
  --onlyresearcharticles
                        Get only research papers (Only works with --webscraping)
  --onlypreprints       Get only preprints (Only works with --webscraping)
  --onlyreviews         Get only review papers (Only works with --webscraping)
  • Example query: python main.py -q "Medicinal Activities" -k 50 -m -o output_test

  • This query created "output_test" in the current directory which we define in the path.

  • A folder papers was created within "output_test" where both the .XML and .pdf of 50 papers were downloaded. The corresponding pickle files were also created for each paper.

  • .csv file with PMC id, HTML link, pdf link, title, and the author info was created in papers

  • no errors

Comments

  • We can set a some guidelines so computer non-users can understand.

Tester 4: Kanishka Parashar

OS: Windows 10

Running pygetpapers in commandline

3 March 2021

  • cloned the repo to local device using git command: https://github.com/petermr/dictionary.git
  • open commandline and change cd to: cd dictionary/pygetpapersdev
  • install using pip: pip install -r requirements.txt
  • run command on cmd: python main.py --help
  • run query: python main.py -q "invasive plant species" -k 100 -m -o plantinvasivepaper
  • a folder was created plantinvasivepaper which had 100 papers both the .XML and .pdf
  • no error

Tester 5: Talha Hasan

OS: Windows 10

Running pygetpapers in commandline

  • cloned the repository to the device using the git command: git clone https://github.com/petermr/dictionary.git
  • changed the working directory using cd to: cd dictionary/pygetpapersdev
  • installed requirement module using pip: pip install -r requirements.txt
  • Running pygetpapers on commandline: python main.py --help
Output
C:\Users\talha hasan\dictionary\pygetpapersdev>python main.py --help
usage: main.py [-h] [-q QUERY] [-k LIMIT] [-o OUTPUT] [-v] [-p FROMPICKLE] [-m] [--api | --webscraping]
               [--onlyresearcharticles | --onlypreprints | --onlyreviews]

Welcome to Pygetpapers. -h or --help for help

optional arguments:
  -h, --help            show this help message and exit
  -q QUERY, --query QUERY
                        Add the query you want to search for. Enclose the query in quotes.
  -k LIMIT, --limit LIMIT
                        Add the number of papers you want. Default =100
  -o OUTPUT, --output OUTPUT
                        Add the output directory url. Default is the current working directory
  -v, --onlyquery       Only makes the query and stores the result.
  -p FROMPICKLE, --frompickle FROMPICKLE
                        Reads the picke and makes the xml files. Takes the path to the pickle as the input
  -m, --makepdf         Also makes pdf files for the papers. Works only with --api method.
  --api                 Get papers using the official EuropePMC api
  --webscraping         Get papers using the scraping EuropePMC. Also supports getting only research papers, preprints
                        or review papers.
  --onlyresearcharticles
                        Get only research papers (Only works with --webscraping)
  --onlypreprints       Get only preprints (Only works with --webscraping)
  --onlyreviews         Get only review papers (Only works with --webscraping)
  • the example query used to download test papers on the topic plant compounds: python main.py -q "plant compounds" -k 20 -m -o plantcomp

  • This query created "plantcomp" in the current directory which we define in the path.

  • the folder was automatically created named papers inside the folder "plantcomp" where both the .XML and .pdf of 50 papers related to plant compounds were downloaded, along with its pickle file

  • no errors

Tester 4: Vasant Kumar

OS: Windows 10

To run pygetpapers in commandline

3 March 2021

  • Cloned the repository to local device using git command: https://github.com/petermr/dictionary.git
  • Open cmd and change the directory cd to: cd dictionary/pygetpapersdev
  • Installed requirement module using pip: pip3 install -r requirements.txt --user because pip install -r requirements.txtdidn't work out.

Output

Requirement already satisfied: requests==2.20.0 in c:\python39\lib\site-packages (from -r requirements.txt (line 1)) (2.20.0)
Collecting pandas_read_xml==0.0.9
  Using cached pandas_read_xml-0.0.9-py3-none-any.whl (6.2 kB)
Collecting pandas==1.2.0
  Using cached pandas-1.2.0-cp39-cp39-win_amd64.whl (9.3 MB)
Collecting lxml==4.6.2
  Using cached lxml-4.6.2-cp39-cp39-win_amd64.whl (3.5 MB)
Collecting chromedriver_autoinstaller==0.2.2
  Using cached chromedriver_autoinstaller-0.2.2-py3-none-any.whl (5.9 kB)
Requirement already satisfied: xmltodict==0.12.0 in c:\python39\lib\site-packages (from -r requirements.txt (line 6)) (0.12.0)
Collecting selenium==3.12.0
  Using cached selenium-3.12.0-py2.py3-none-any.whl (946 kB)
Requirement already satisfied: certifi>=2017.4.17 in c:\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (2020.12.5)
Requirement already satisfied: idna<2.8,>=2.5 in c:\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (2.7)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (3.0.4)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in c:\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (1.24.3)
Requirement already satisfied: zipfile36 in c:\python39\lib\site-packages (from pandas_read_xml==0.0.9->-r requirements.txt (line 2)) (0.1.3)
Requirement already satisfied: distlib in c:\python39\lib\site-packages (from pandas_read_xml==0.0.9->-r requirements.txt (line 2)) (0.3.1)
Collecting pyarrow
  Using cached pyarrow-3.0.0-cp39-cp39-win_amd64.whl (12.6 MB)
Collecting pytz>=2017.3
  Using cached pytz-2021.1-py2.py3-none-any.whl (510 kB)
Requirement already satisfied: numpy>=1.16.5 in c:\python39\lib\site-packages (from pandas==1.2.0->-r requirements.txt (line 3)) (1.20.1)
Collecting python-dateutil>=2.7.3
  Using cached python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
Collecting six>=1.5
  Using cached six-1.15.0-py2.py3-none-any.whl (10 kB)
Installing collected packages: pyarrow, pytz, six, python-dateutil, pandas, pandas-read-xml, lxml, chromedriver-autoinstaller, selenium
  WARNING: The script plasma_store.exe is installed in 'C:\Users\vasan\AppData\Roaming\Python\Python39\Scripts' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The script chromedriver-path.exe is installed in 'C:\Users\vasan\AppData\Roaming\Python\Python39\Scripts' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed chromedriver-autoinstaller-0.2.2 lxml-4.6.2 pandas-1.2.0 pandas-read-xml-0.0.9 pyarrow-3.0.0 python-dateutil-2.8.1 pytz-2021.1 selenium-3.12.0 six-1.15.0
WARNING: You are using pip version 20.2.3; however, version 21.0.1 is available.
You should consider upgrading via the 'c:\python39\python.exe -m pip install --upgrade pip' command.

Output

usage: main.py [-h] [-q QUERY] [-k LIMIT] [-o OUTPUT] [-v] [-p FROMPICKLE] [-m] [--api | --webscraping] [--onlyresearcharticles | --onlypreprints | --onlyreviews]

Welcome to Pygetpapers. -h or --help for help

optional arguments:
  -h, --help            show this help message and exit
  -q QUERY, --query QUERY
                        Add the query you want to search for. Enclose the query in quotes.
  -k LIMIT, --limit LIMIT
                        Add the number of papers you want. Default =100
  -o OUTPUT, --output OUTPUT
                        Add the output directory url. Default is the current working directory
  -v, --onlyquery       Only makes the query and stores the result.
  -p FROMPICKLE, --frompickle FROMPICKLE
                        Reads the picke and makes the xml files. Takes the path to the pickle as the input
  -m, --makepdf         Also makes pdf files for the papers. Works only with --api method.
  --api                 Get papers using the official EuropePMC api
  --webscraping         Get papers using the scraping EuropePMC. Also supports getting only research papers, preprints or review papers.
  --onlyresearcharticles
                        Get only research papers (Only works with --webscraping)
  --onlypreprints       Get only preprints (Only works with --webscraping)
  --onlyreviews         Get only review papers (Only works with --webscraping)

  • Example query: python main.py -q "plant part" -k 20 -m -o plantpart
  • A folder called "plant-part" was created where both the .XML and .pdf of 20 papers were downloaded.
  • No errors

Comments

  • A basic step guide is required for the beginners