Skip to content

pygetpapers: test reports

ShweataNHegde edited this page Mar 2, 2021 · 27 revisions

Alpha test

Tester: Ambreen H

Running pygetpapers in commandline

  • cloned the repository to the local computer: git clone https://github.com/petermr/dictionary.git
  • changed the working directory using cd to: cd dictionary/pygetpapersdev
  • installed requirements using pip: pip install -r requirements.txt
  • used the function apipaperdownload() and commented out the scrapper function.
  • ran the entire script: python testpapers.py

Results:

  • the folder papers was successfully created with the required papers.
  • each folder contained a pickle file and an XML
  • 9 papers were downloaded out of the 20 requested
  • no errors

Comments:

  • can think of a graphic user interface for pygetpapers
  • more documentation is required for easier workflow
  • may think of turning the whole thing into a python library so that it can be imported easily into the script and used
  • a provision for downloading just abstracts and/or full-length papers may improve the functionality

Tester 2: Shweata N. Hegde

OS: Windows 10

Date: 2021-03-02

Prerequisites

  • git clone git clone https://github.com/petermr/dictionary.git
  • open command line and cd to dictionary/pygetpapersdev
  • install required modules using pip on the cmd pip install -r requirements.txt

C:\Users\shweata\dictionary\pygetpapersdev>pip install -r requirements.txt
Requirement already satisfied: requests==2.20.0 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 1)) (2.20.0)
Requirement already satisfied: pandas_read_xml==0.0.9 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 2)) (0.0.9)
Requirement already satisfied: pandas==1.2.0 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 3)) (1.2.0)
Requirement already satisfied: lxml==4.6.2 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 4)) (4.6.2)
Requirement already satisfied: chromedriver_autoinstaller==0.2.2 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 5)) (0.2.2)
Requirement already satisfied: xmltodict==0.12.0 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 6)) (0.12.0)
Requirement already satisfied: selenium==3.12.0 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 7)) (3.12.0)
Requirement already satisfied: pytz>=2017.3 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas==1.2.0->-r requirements.txt (line 3)) (2021.1)
Requirement already satisfied: numpy>=1.16.5 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas==1.2.0->-r requirements.txt (line 3)) (1.20.1)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas==1.2.0->-r requirements.txt (line 3)) (2.8.1)
Requirement already satisfied: pyarrow in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas_read_xml==0.0.9->-r requirements.txt (line 2)) (3.0.0)
Requirement already satisfied: distlib in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas_read_xml==0.0.9->-r requirements.txt (line 2)) (0.3.1)
Requirement already satisfied: zipfile36 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas_read_xml==0.0.9->-r requirements.txt (line 2)) (0.1.3)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (1.24.3)
Requirement already satisfied: idna<2.8,>=2.5 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (2.7)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (2020.12.5)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (3.0.4)
Requirement already satisfied: six>=1.5 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from python-dateutil>=2.7.3->pandas==1.2.0->-r requirements.txt (line 3)) (1.15.0)

Running pygetpapers on the cmd

  • Run python main.py --help
Output
usage: main.py [-h] [-q QUERY] [-k LIMIT] [-o OUTPUT] [-v] [-p FROMPICKLE] [-m] [--api | --webscraping]
              [--onlyresearcharticles | --onlypreprints | --onlyreviews]

Welcome to Pygetpapers. -h or --help for help

optional arguments:
 -h, --help            show this help message and exit
 -q QUERY, --query QUERY
                       Add the query you want to search for. Enclose the query in quotes.
 -k LIMIT, --limit LIMIT
                       Add the number of papers you want. Default =100
 -o OUTPUT, --output OUTPUT
                       Add the output directory url. Default is the current working directory
 -v, --onlyquery       Only makes the query and stores the result.
 -p FROMPICKLE, --frompickle FROMPICKLE
                       Reads the picke and makes the xml files. Takes the path to the pickle as the input
 -m, --makepdf         Also makes pdf files for the papers. Works only with --api method.
 --api                 Get papers using the official EuropePMC api
 --webscraping         Get papers using the scraping EuropePMC. Also supports getting only research papers, preprints
                       or review papers.
 --onlyresearcharticles
                       Get only research papers (Only works with --webscraping)
 --onlypreprints       Get only preprints (Only works with --webscraping)
 --onlyreviews         Get only review papers (Only works with --webscraping)
  • Example query python main.py -q "invasive plant species" -k 20 -m -o ips_test
  • It took around 3 min.
  • The software created "ops_test" in the current directory
  • A folder called papers was created within "ops_test" where both the .XML and .pdf of 20 papers were downloaded. The corresponding pickle files were also created for each paper.
  • Apart from that,
    • a .csv file with PMC id, HTML link, pdf link, title, and the author info was created in papers
    • a pickle file with the information on all papers. (useful if the network breaks off. Download can resume from where it left off)

Comments

  • It might be useful to add a timer to know how much time it takes download each paper.

Tester 3: Radhu Ladani

OS: Windows 10

Running pygetpapers in commandline

  • cloned the repository to the device using git command at specific path: git clone https://github.com/petermr/dictionary.git
  • now open the commandline and change the working directory using cd to: cd dictionary/pygetpapersdev
  • installed requirement module using pip: pip install -r requirements.txt
  • Running pygetpapers on commandline: python main.py --help
Output
C:\Users\DELL\Radhu\dictionary\pygetpapersdev>python main.py --help
usage: main.py [-h] [-q QUERY] [-k LIMIT] [-o OUTPUT] [-v] [-p FROMPICKLE] [-m] [--api | --webscraping]
               [--onlyresearcharticles | --onlypreprints | --onlyreviews]

Welcome to Pygetpapers. -h or --help for help

optional arguments:
  -h, --help            show this help message and exit
  -q QUERY, --query QUERY
                        Add the query you want to search for. Enclose the query in quotes.
  -k LIMIT, --limit LIMIT
                        Add the number of papers you want. Default =100
  -o OUTPUT, --output OUTPUT
                        Add the output directory url. Default is the current working directory
  -v, --onlyquery       Only makes the query and stores the result.
  -p FROMPICKLE, --frompickle FROMPICKLE
                        Reads the picke and makes the xml files. Takes the path to the pickle as the input
  -m, --makepdf         Also makes pdf files for the papers. Works only with --api method.
  --api                 Get papers using the official EuropePMC api
  --webscraping         Get papers using the scraping EuropePMC. Also supports getting only research papers, preprints
                        or review papers.
  --onlyresearcharticles
                        Get only research papers (Only works with --webscraping)
  --onlypreprints       Get only preprints (Only works with --webscraping)
  --onlyreviews         Get only review papers (Only works with --webscraping)
  • Example query: python main.py -q "Medicinal Activities" -k 50 -m -o output_test

  • This query created "output_test" in the current directory which we define in the path.

  • A folder papers was created within "output_test" where both the .XML and .pdf of 50 papers were downloaded. The corresponding pickle files were also created for each paper.

  • .csv file with PMC id, HTML link, pdf link, title, and the author info was created in papers

  • no errors

Comments

  • We can set a some guidelines so computer non-users can understand.