-
Notifications
You must be signed in to change notification settings - Fork 4
pygetpapers: test reports
Radhu Ladani edited this page Mar 2, 2021
·
27 revisions
- cloned the repository to the local computer:
git clone https://github.com/petermr/dictionary.git
- changed the working directory using
cd
to:cd dictionary/pygetpapersdev
- installed requirements using pip:
pip install -r requirements.txt
- used the function
apipaperdownload()
and commented out the scrapper function. - ran the entire script:
python testpapers.py
- the folder
papers
was successfully created with the required papers. - each folder contained a pickle file and an XML
- 9 papers were downloaded out of the 20 requested
- no errors
- can think of a graphic user interface for
pygetpapers
- more documentation is required for easier workflow
- may think of turning the whole thing into a python library so that it can be imported easily into the script and used
- a provision for downloading just abstracts and/or full-length papers may improve the functionality
- git clone git clone https://github.com/petermr/dictionary.git
- open command line and
cd
todictionary/pygetpapersdev
- install required modules using pip on the cmd
pip install -r requirements.txt
C:\Users\shweata\dictionary\pygetpapersdev>pip install -r requirements.txt
Requirement already satisfied: requests==2.20.0 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 1)) (2.20.0)
Requirement already satisfied: pandas_read_xml==0.0.9 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 2)) (0.0.9)
Requirement already satisfied: pandas==1.2.0 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 3)) (1.2.0)
Requirement already satisfied: lxml==4.6.2 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 4)) (4.6.2)
Requirement already satisfied: chromedriver_autoinstaller==0.2.2 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 5)) (0.2.2)
Requirement already satisfied: xmltodict==0.12.0 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 6)) (0.12.0)
Requirement already satisfied: selenium==3.12.0 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from -r requirements.txt (line 7)) (3.12.0)
Requirement already satisfied: pytz>=2017.3 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas==1.2.0->-r requirements.txt (line 3)) (2021.1)
Requirement already satisfied: numpy>=1.16.5 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas==1.2.0->-r requirements.txt (line 3)) (1.20.1)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas==1.2.0->-r requirements.txt (line 3)) (2.8.1)
Requirement already satisfied: pyarrow in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas_read_xml==0.0.9->-r requirements.txt (line 2)) (3.0.0)
Requirement already satisfied: distlib in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas_read_xml==0.0.9->-r requirements.txt (line 2)) (0.3.1)
Requirement already satisfied: zipfile36 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from pandas_read_xml==0.0.9->-r requirements.txt (line 2)) (0.1.3)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (1.24.3)
Requirement already satisfied: idna<2.8,>=2.5 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (2.7)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (2020.12.5)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->-r requirements.txt (line 1)) (3.0.4)
Requirement already satisfied: six>=1.5 in c:\users\shweata\appdata\local\programs\python\python39\lib\site-packages (from python-dateutil>=2.7.3->pandas==1.2.0->-r requirements.txt (line 3)) (1.15.0)
- Run
python main.py --help
usage: main.py [-h] [-q QUERY] [-k LIMIT] [-o OUTPUT] [-v] [-p FROMPICKLE] [-m] [--api | --webscraping]
[--onlyresearcharticles | --onlypreprints | --onlyreviews]
Welcome to Pygetpapers. -h or --help for help
optional arguments:
-h, --help show this help message and exit
-q QUERY, --query QUERY
Add the query you want to search for. Enclose the query in quotes.
-k LIMIT, --limit LIMIT
Add the number of papers you want. Default =100
-o OUTPUT, --output OUTPUT
Add the output directory url. Default is the current working directory
-v, --onlyquery Only makes the query and stores the result.
-p FROMPICKLE, --frompickle FROMPICKLE
Reads the picke and makes the xml files. Takes the path to the pickle as the input
-m, --makepdf Also makes pdf files for the papers. Works only with --api method.
--api Get papers using the official EuropePMC api
--webscraping Get papers using the scraping EuropePMC. Also supports getting only research papers, preprints
or review papers.
--onlyresearcharticles
Get only research papers (Only works with --webscraping)
--onlypreprints Get only preprints (Only works with --webscraping)
--onlyreviews Get only review papers (Only works with --webscraping)
- Example query
python main.py -q "invasive plant species" -k 20 -m -o ips_test
- It took around 3 min.
- The software created "ops_test" in the current directory
- A folder called
papers
was created within "ops_test" where both the.XML
and.pdf
of 20 papers were downloaded. The corresponding pickle files were also created for each paper. - Apart from that,
- a
.csv
file with PMC id, HTML link, pdf link, title, and the author info was created inpapers
- a pickle file with the information on all papers. (useful if the network breaks off. Download can resume from where it left off)
- a
- It might be useful to add a timer to know how much time it takes download each paper.
- cloned the repository to the device using git command at specific path:
git clone https://github.com/petermr/dictionary.git
- now open the commandline and change the working directory using
cd
to:cd dictionary/pygetpapersdev
- installed requirement module using pip:
pip install -r requirements.txt
- Running pygetpapers on commandline:
python main.py --help
C:\Users\DELL\Radhu\dictionary\pygetpapersdev>python main.py --help
usage: main.py [-h] [-q QUERY] [-k LIMIT] [-o OUTPUT] [-v] [-p FROMPICKLE] [-m] [--api | --webscraping]
[--onlyresearcharticles | --onlypreprints | --onlyreviews]
Welcome to Pygetpapers. -h or --help for help
optional arguments:
-h, --help show this help message and exit
-q QUERY, --query QUERY
Add the query you want to search for. Enclose the query in quotes.
-k LIMIT, --limit LIMIT
Add the number of papers you want. Default =100
-o OUTPUT, --output OUTPUT
Add the output directory url. Default is the current working directory
-v, --onlyquery Only makes the query and stores the result.
-p FROMPICKLE, --frompickle FROMPICKLE
Reads the picke and makes the xml files. Takes the path to the pickle as the input
-m, --makepdf Also makes pdf files for the papers. Works only with --api method.
--api Get papers using the official EuropePMC api
--webscraping Get papers using the scraping EuropePMC. Also supports getting only research papers, preprints
or review papers.
--onlyresearcharticles
Get only research papers (Only works with --webscraping)
--onlypreprints Get only preprints (Only works with --webscraping)
--onlyreviews Get only review papers (Only works with --webscraping)
-Example query: python main.py -q "Medicinal Activities" -k 50 -m -o output_test
-This query created "output_test" in the current directory which we define in the path.
-A folder papers
was created within "output_test" where both the .XML
and .pdf
of 50 papers were downloaded. The corresponding pickle files were also created for each paper.
-.csv
file with PMC id, HTML link, pdf link, title, and the author info was created in papers
- no errors
-We can set a some guidelines so computer non-users can understand.