-
Notifications
You must be signed in to change notification settings - Fork 4
pygetpapers: test reports
- cloned the repository to the local computer:
git clone https://github.com/petermr/dictionary.git
- changed the working directory using
cd
to:cd dictionary/pygetpapersdev
- installed requirements using pip:
pip install -r requirements.txt
- used the function
apipaperdownload()
and commented out the scrapper function. - ran the entire script:
python testpapers.py
- the folder
papers
was successfully created with the required papers. - each folder contained a pickle file and an XML
- 9 papers were downloaded out of the 20 requested
- no errors
- can think of a graphic user interface for
pygetpapers
- more documentation is required for easier workflow
- may think of turning the whole thing into a python library so that it can be imported easily into the script and used
- a provision for downloading just abstracts and/or full-length papers may improve the functionality
- git clone
git clone https://github.com/petermr/dictionary.git
- open command line and
cd
todictionary/pygetpapersdev
- install required modules using pip on the cmd
pip install -r requirements.txt
C:\Users\shweata\dictionary\pygetpapersdev>pip install -r requirements.txt
Requirement already satisfied: requests==2.20.0 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 1)) (2.20.0)
Requirement already satisfied: pandas_read_xml==0.0.9 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 2)) (0.0.9)
Requirement already satisfied: pandas==1.2.0 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 3)) (1.2.0)
Requirement already satisfied: lxml==4.6.2 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 4)) (4.6.2)
Requirement already satisfied: chromedriver_autoinstaller==0.2.2 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 5)) (0.2.2)
Requirement already satisfied: xmltodict==0.12.0 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 6)) (0.12.0)
Requirement already satisfied: selenium==3.12.0 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 7)) (3.12.0)
Requirement already satisfied: pytz>=2017.3 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas==1.2.0->-r requirements.txt (line 3)) (2021.1)
Requirement already satisfied: numpy>=1.16.5 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas==1.2.0->-r requirements.txt (line 3)) (1.20.1)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas==1.2.0->-r requirements.txt (line 3)) (2.8.1)
Requirement already satisfied: pyarrow in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas_read_xml==0.0.9->-r requirements.txt (line 2)) (3.0.0)
Requirement already satisfied: distlib in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas_read_xml==0.0.9->-r requirements.txt (line 2)) (0.3.1)
Requirement already satisfied: zipfile36 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas_read_xml==0.0.9->-r requirements.txt (line 2)) (0.1.3)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (1.24.3)
Requirement already satisfied: idna<2.8,>=2.5 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (2.7)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (2020.12.5)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (3.0.4)
Requirement already satisfied: six>=1.5 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from python-dateutil>=2.7.3->pandas==1.2.0->-r requirements.txt (line 3)) (1.15.0)
- Run
python main.py --help
usage: main.py [-h] [-q QUERY] [-k LIMIT] [-o OUTPUT] [-v] [-p FROMPICKLE] [-m] [--api | --webscraping]
[--onlyresearcharticles | --onlypreprints | --onlyreviews]
Welcome to Pygetpapers. -h or --help for help
optional arguments:
-h, --help show this help message and exit
-q QUERY, --query QUERY
Add the query you want to search for. Enclose the query in quotes.
-k LIMIT, --limit LIMIT
Add the number of papers you want. Default =100
-o OUTPUT, --output OUTPUT
Add the output directory url. Default is the current working directory
-v, --onlyquery Only makes the query and stores the result.
-p FROMPICKLE, --frompickle FROMPICKLE
Reads the picke and makes the xml files. Takes the path to the pickle as the input
-m, --makepdf Also makes pdf files for the papers. Works only with --api method.
--api Get papers using the official EuropePMC api
--webscraping Get papers using the scraping EuropePMC. Also supports getting only research papers, preprints
or review papers.
--onlyresearcharticles
Get only research papers (Only works with --webscraping)
--onlypreprints Get only preprints (Only works with --webscraping)
--onlyreviews Get only review papers (Only works with --webscraping)
- Example query
python main.py -q "invasive plant species" -k 20 -m -o ips_test
- It took around 3 min.
- The software created "ops_test" in the current directory
- A folder called
papers
was created within "ops_test" where both the.XML
and.pdf
of 20 papers were downloaded. The corresponding pickle files were also created for each paper. - Apart from that,
- a
.csv
file with PMC id, HTML link, pdf link, title, and the author info was created inpapers
- a pickle file with the information on all papers. (useful if the network breaks off. Download can resume from where it left off)
- a
- It might be useful to add a timer to know how much time it takes download each paper.
- cloned the repository to the device using git command at specific path:
git clone https://github.com/petermr/dictionary.git
- now open the commandline and change the working directory using
cd
to:cd dictionary/pygetpapersdev
- installed requirement module using pip:
pip install -r requirements.txt
- Running pygetpapers on commandline:
python main.py --help
C:\Users\DELL\Radhu\dictionary\pygetpapersdev>python main.py --help
usage: main.py [-h] [-q QUERY] [-k LIMIT] [-o OUTPUT] [-v] [-p FROMPICKLE] [-m] [--api | --webscraping]
[--onlyresearcharticles | --onlypreprints | --onlyreviews]
Welcome to Pygetpapers. -h or --help for help
optional arguments:
-h, --help show this help message and exit
-q QUERY, --query QUERY
Add the query you want to search for. Enclose the query in quotes.
-k LIMIT, --limit LIMIT
Add the number of papers you want. Default =100
-o OUTPUT, --output OUTPUT
Add the output directory url. Default is the current working directory
-v, --onlyquery Only makes the query and stores the result.
-p FROMPICKLE, --frompickle FROMPICKLE
Reads the picke and makes the xml files. Takes the path to the pickle as the input
-m, --makepdf Also makes pdf files for the papers. Works only with --api method.
--api Get papers using the official EuropePMC api
--webscraping Get papers using the scraping EuropePMC. Also supports getting only research papers, preprints
or review papers.
--onlyresearcharticles
Get only research papers (Only works with --webscraping)
--onlypreprints Get only preprints (Only works with --webscraping)
--onlyreviews Get only review papers (Only works with --webscraping)
-
Example query:
python main.py -q "Medicinal Activities" -k 50 -m -o output_test
-
This query created "output_test" in the current directory which we define in the path.
-
A folder
papers
was created within "output_test" where both the.XML
and.pdf
of 50 papers were downloaded. The corresponding pickle files were also created for each paper. -
.csv
file with PMC id, HTML link, pdf link, title, and the author info was created in papers -
no errors
- We can set a some guidelines so computer non-users can understand.
- cloned the repo to local device using git command: https://github.com/petermr/dictionary.git
- open commandline and change cd to: cd dictionary/pygetpapersdev
- install using pip: pip install -r requirements.txt
- run command on cmd: python main.py --help
- run query: python main.py -q "invasive plant species" -k 100 -m -o plantinvasivepaper
- a folder was created plantinvasivepaper which had 100 papers both the .XML and .pdf
- no error
- cloned the repository to the device using the git command:
git clone https://github.com/petermr/dictionary.git
- changed the working directory using
cd
to:cd dictionary/pygetpapersdev
- installed requirement module using pip:
pip install -r requirements.txt
- Running pygetpapers on commandline:
python main.py --help
C:\Users\talha hasan\dictionary\pygetpapersdev>python main.py --help
usage: main.py [-h] [-q QUERY] [-k LIMIT] [-o OUTPUT] [-v] [-p FROMPICKLE] [-m] [--api | --webscraping]
[--onlyresearcharticles | --onlypreprints | --onlyreviews]
Welcome to Pygetpapers. -h or --help for help
optional arguments:
-h, --help show this help message and exit
-q QUERY, --query QUERY
Add the query you want to search for. Enclose the query in quotes.
-k LIMIT, --limit LIMIT
Add the number of papers you want. Default =100
-o OUTPUT, --output OUTPUT
Add the output directory url. Default is the current working directory
-v, --onlyquery Only makes the query and stores the result.
-p FROMPICKLE, --frompickle FROMPICKLE
Reads the picke and makes the xml files. Takes the path to the pickle as the input
-m, --makepdf Also makes pdf files for the papers. Works only with --api method.
--api Get papers using the official EuropePMC api
--webscraping Get papers using the scraping EuropePMC. Also supports getting only research papers, preprints
or review papers.
--onlyresearcharticles
Get only research papers (Only works with --webscraping)
--onlypreprints Get only preprints (Only works with --webscraping)
--onlyreviews Get only review papers (Only works with --webscraping)
-
the example query used to download test papers on the topic plant compounds:
python main.py -q "plant compounds" -k 20 -m -o plantcomp
-
This query created "plantcomp" in the current directory which we define in the path.
-
the folder was automatically created named
papers
inside the folder "plantcomp" where both the.XML
and.pdf
of 50 papers related to plant compounds were downloaded, along with its pickle file -
no errors
- Cloned the repository to local device using git command:
https://github.com/petermr/dictionary.git
- Open cmd and change the directory
cd
to:cd dictionary/pygetpapersdev
- Installed requirement module using pip:
pip3 install -r requirements.txt --user
becausepip install -r requirements.txt
didn't work out.
Requirement already satisfied: requests==2.20.0 in c:\python39\lib\site-packages (from -r requirements.txt (line 1)) (2.20.0)
Collecting pandas_read_xml==0.0.9
Using cached pandas_read_xml-0.0.9-py3-none-any.whl (6.2 kB)
Collecting pandas==1.2.0
Using cached pandas-1.2.0-cp39-cp39-win_amd64.whl (9.3 MB)
Collecting lxml==4.6.2
Using cached lxml-4.6.2-cp39-cp39-win_amd64.whl (3.5 MB)
Collecting chromedriver_autoinstaller==0.2.2
Using cached chromedriver_autoinstaller-0.2.2-py3-none-any.whl (5.9 kB)
Requirement already satisfied: xmltodict==0.12.0 in c:\python39\lib\site-packages (from -r requirements.txt (line 6)) (0.12.0)
Collecting selenium==3.12.0
Using cached selenium-3.12.0-py2.py3-none-any.whl (946 kB)
Requirement already satisfied: certifi>=2017.4.17 in c:\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (2020.12.5)
Requirement already satisfied: idna<2.8,>=2.5 in c:\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (2.7)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (3.0.4)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in c:\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (1.24.3)
Requirement already satisfied: zipfile36 in c:\python39\lib\site-packages (from pandas_read_xml==0.0.9->-r requirements.txt (line 2)) (0.1.3)
Requirement already satisfied: distlib in c:\python39\lib\site-packages (from pandas_read_xml==0.0.9->-r requirements.txt (line 2)) (0.3.1)
Collecting pyarrow
Using cached pyarrow-3.0.0-cp39-cp39-win_amd64.whl (12.6 MB)
Collecting pytz>=2017.3
Using cached pytz-2021.1-py2.py3-none-any.whl (510 kB)
Requirement already satisfied: numpy>=1.16.5 in c:\python39\lib\site-packages (from pandas==1.2.0->-r requirements.txt (line 3)) (1.20.1)
Collecting python-dateutil>=2.7.3
Using cached python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
Collecting six>=1.5
Using cached six-1.15.0-py2.py3-none-any.whl (10 kB)
Installing collected packages: pyarrow, pytz, six, python-dateutil, pandas, pandas-read-xml, lxml, chromedriver-autoinstaller, selenium
WARNING: The script plasma_store.exe is installed in 'C:\Users\vasan\AppData\Roaming\Python\Python39\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The script chromedriver-path.exe is installed in 'C:\Users\vasan\AppData\Roaming\Python\Python39\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed chromedriver-autoinstaller-0.2.2 lxml-4.6.2 pandas-1.2.0 pandas-read-xml-0.0.9 pyarrow-3.0.0 python-dateutil-2.8.1 pytz-2021.1 selenium-3.12.0 six-1.15.0
WARNING: You are using pip version 20.2.3; however, version 21.0.1 is available.
You should consider upgrading via the 'c:\python39\python.exe -m pip install --upgrade pip' command.
- To run pygetpapers on commandline: python main.py --help
usage: main.py [-h] [-q QUERY] [-k LIMIT] [-o OUTPUT] [-v] [-p FROMPICKLE] [-m] [--api | --webscraping] [--onlyresearcharticles | --onlypreprints | --onlyreviews]
Welcome to Pygetpapers. -h or --help for help
optional arguments:
-h, --help show this help message and exit
-q QUERY, --query QUERY
Add the query you want to search for. Enclose the query in quotes.
-k LIMIT, --limit LIMIT
Add the number of papers you want. Default =100
-o OUTPUT, --output OUTPUT
Add the output directory url. Default is the current working directory
-v, --onlyquery Only makes the query and stores the result.
-p FROMPICKLE, --frompickle FROMPICKLE
Reads the picke and makes the xml files. Takes the path to the pickle as the input
-m, --makepdf Also makes pdf files for the papers. Works only with --api method.
--api Get papers using the official EuropePMC api
--webscraping Get papers using the scraping EuropePMC. Also supports getting only research papers, preprints or review papers.
--onlyresearcharticles
Get only research papers (Only works with --webscraping)
--onlypreprints Get only preprints (Only works with --webscraping)
--onlyreviews Get only review papers (Only works with --webscraping)
- Example query:
python main.py -q "plant part" -k 20 -m -o plantpart
- A folder was created plantpart containing 20 papers in both .XML and .pdf format
- No errors
- A basic step guide is required for the beginners