Skip to content

Commit

Permalink
added files
Browse files Browse the repository at this point in the history
  • Loading branch information
ayush4921 committed Apr 4, 2022
2 parents ae79bff + 6deb5ce commit 1307369
Show file tree
Hide file tree
Showing 5 changed files with 115 additions and 25 deletions.
91 changes: 72 additions & 19 deletions joss/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ In 2015, we reviewed tools for scraping websites and decided that none met our n

An important aspect is to provide a simple cross-platform approach for scientists who may find tools like `curl` too complex and want a one-line command to combine the search, download, and analysis into a single: "please give me the results". We've tested this on many interns who learn `pygetpapers` in minutes. It was also easy to wrap it `tkinter GUI`[@tkinter]. The architecture of the results is simple and natural, based on full-text files in the normal filesystem. The result of `pygetpapers` is interfaced using a “master” JSON file (for eg. eupmc_results.json), which allows corpus to be reused/added to. This allows maximum flexibility of re-use and some projects have large amounts of derived data in these directories.

<div class="figure">
```
pygetpapers -q "METHOD: invasive plant species" -k 10 -o "invasive_plant_species_test" -c --makehtml -x --save_query
```
Expand All @@ -51,9 +52,11 @@ INFO: Saving XML files to C:\Users\shweata\invasive_plant_species_test\*\fulltex
```

<h2 align="center">Fig.1 Example query of `pygetpapers`</h2>

The number and type of scientific repositories (especially preprints) is expanding , and users do not want to use a different tool for each new one. `pygetpapers` is built on a modular system and repository-specific code can be swapped in as needed. Often they use different query systems and `pygetpapers` makes a start on simplifying this. By configuring repositories in a configuration file, users can easily configure support for new repositories.
</div>
The number of repositories is rapidly expanding, driven by the rise in preprint use (both per-subjects and percountry), Institutional repositories and aggregation sites such as EuropePMC, HAL, SciELO, etc. Each of these uses their own dialect of query syntax and API access. A major aspect of `pygetpapers` is to make it easy to add new repositories, often by people who have little coding experience. `pygetpapers` is built on a modular system and repository-specific code can be swapped in as needed. By configuring repositories in a configuration file, users can easily configure support for new repositories.

<div class="figure">
```
[europe_pmc]
query_url=https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST
Expand All @@ -72,6 +75,7 @@ features_not_supported = ["filter",]
```

<h2 align="center">Fig.2 Example configuration for a repository (europePMC)</h2>
</div>

Many **searches** are simple keywords or phrases. However, these often fail to include synonyms and phrases and authors spend time creating complex error-prone boolean queries. We have developed a dictionary-based approach to automate much of the creation of complex queries.

Expand All @@ -88,6 +92,8 @@ We do not know of other tools which have the same functionality. `curl` [@curl]

## Data

### raw data

The download may be repository-dependent but usually contains:
* download metadata. (query, date, errors, etc.)
* journal/article metadata. We use JATS-NISO [@JATS] which is widely used by publishers and repository owners, especially in bioscience and medicine. There are over 200 tags.
Expand All @@ -98,14 +104,26 @@ The download may be repository-dependent but usually contains:
- PDF - usually includes the whole material but not machine-sectioned
- HTML . often avaliable on websites
* supplemental data. This is very variable, often as PDF but also raw data files and sometimes zipped. It is not systematically arranged but `pygetpapers` allows for some heuristics.
* figures. This is not supported by some repositories and others may require custom code.

<div class="figure">

![Fig.3 Architecture of `pygetpapers`](../resources/archietecture.png)

<h2 align="center">Fig.3 Architecture of `pygetpapers`</h2>
</div>

This directory structure is designed so that analysis tools can add computed data for articles
For this reason we create a directory structure with a root (`CProjects`) and a (`CTree`) subdirectory for each downloaded article or document. `pygetpapers` will routinely populate this with 1-5 files or subdirectories (see above). At present `pygetpapers` always creates a *_result.json file (possibly empty) and this can be used as a marker for identifying CTrees. This means that a `CProject` contains subdirectories which may be CTrees or not, distinguished by this marker.

### derived data

Besides the downloaded data (already quite variable) users often wish to create new derived data and this directory structure is designed so that tools can add an arbitrary amount of new data, normally in sub-directory trees. For example we have sibling projects that add data to the `CTree`:
* docanalysis (text analysis including NLTK and spaCy/sciSpaCy [URL]
* pyamiimage (image processing and analysis of figures). [URL]

<hr/>
<div class="figure">

```
C:.
│ eupmc_results.json
Expand All @@ -118,23 +136,46 @@ C:.
│ eupmc_result.json
│ fulltext.xml
```

and with examples of derived data

```
├───PMC8198815
│ eupmc_result.json
│ fulltext.xml
|. bag_of_words.txt
|. figure/
|. raw.jpg
|. skeleton.png
├───PMC8216501
eupmc_result.json
├───10.9999_123456 # CTree due to fooRxiv_result.json
fooRxiv_result.json
│ fulltext.xml
|. bag_of_words.txt
|. search/
|. results/
|. terpenes.csv
├───PMC8309040
│ eupmc_result.json
│ fulltext.xml
└───PMC8325914
eupmc_result.json
fulltext.xml
|. univ_bar_thesis_studies_on_lantana/ # CTree dues to thesis_12345_results.json
|. thesis_12345_results.json
| fulltext.pdf
|. figures/
|. figure/
| Fig1/
|.
|____summary/ # not CTree as no child *_results.json
|. bag_of_words.txt
|. figures/
| <aggregated and filtered figures>
```

<h2 align="center">Fig.4 Typical download directory</h2>
<p>Several types of download have been combined in this CProject and some CTrees have derived data
</div>
</hr>



## Code
Expand All @@ -145,20 +186,21 @@ Most repository APIs provide a cursor-based approach to querying:
1. A query is sent and the repository creates a list of M hits (pointers to documents), sets a cursor start, and returns this information to the `pygetpapers` client.
2. The client requests a chunk of size N <= M (normally 25-1000) and the repository replies with N pointers to documents.
3. The server response is pages of hits (metadata) as XML , normally <= 1000 hits per page , (1 sec)
4. `pygetpapers` - incremental aggregates XML metadata as python dict in memory - small example for paper
5. If cursor indicates next page, submits a query for next page, else if end terminates this part
6. When finished all pages, writes metadata to CProject (Top level project directory) as JSON (total, and creates CTrees (per-article directories) with individual metadata)
7. Recover from crashes, restart (if needed)
4. `pygetpapers` - incremental aggregates XML metadata as python dict in memory
5. If cursor indicates next page, `pygetpapers` submits a query for next page, otherwise it terminates the data collection and processes the python dict
6. If user has requested supplemental data (eg. references, citations, fulltext, etc.) then the `pygetpapers` iterates through the python dict and uses the identifier, usually in the form of DOI, to query and download supplemental data seperately.
7. When the search is finished, `pygetpapers` writes the metadata to CProject (Top level project directory) as JSON (total, and creates CTrees (per-article directories) with individual metadata)
8. It also recovers from crashes and restarts if needed).

The control module `pygetpapers` reads the commandline and
The control module `pygetpapers.py` reads the commandline and
* Selects the repository-specific downloader
* Creates a query from user input and/or terms from dictionaries
* Adds options and constraints
* Downloads according to the protocol above, including recording progress in a metadata file

# Generic downloading concerns

* Download speeds. Excessively rapid or voluminous downloads can overload servers and are sometimes hostile (DOS). We have discussed this with major sites (EPMC, biorXiv, Crossref etc. and have a default (resettable) delay in `pygetpapers`.
* Download speeds. Excessively rapid or voluminous downloads can overload servers and are sometimes hostile (DOS). We have discussed this with major sites (EPMC, biorXiv, Crossref etc. and therefore choose to download sequentially instead of sending parallel requests in `pygetpapers`.
* Authentication (alerting repo to downloader header). `pygetpapers` supports anonymous, non-authenticated, access but includes a header (e.g. for Crossref)

# Design
Expand All @@ -172,6 +214,18 @@ The control module `pygetpapers` reads the commandline and

`getpapers` was implemented in `NodeJS` which allows multithreading and therefore potentially download rates of several XML documents per second on a fast line. Installing `NodeJS` was a problem on some systems (especially Windows) and was not well suited for integration with scientific libraries (mainly coded in Java and Python). We, therefore, decided to rewrite in Python, keeping only the command line and output structure, and have found very easy integration with other tools, including GUIs. `pygetpapers` can be run both as a command-line tool and a module, which makes it versatile.

## core
The core mainly consists of:
* `pygetpapers.py` (query-builder and runner). This includes query abstractions such as dates and Boolean queries for terms
* `download_tools.py` (generic code for query/download (REST))

## repository interfaces
We have tried to minimise the amount of repository-specific code, choosing to use declarative configuration files. To add a new repository you will need to:
* create a configuration file (Fig. 2)
* subclass the repo from `repository_interface.py`
* add any repository specific code to add features or disable others


# Interface with other tools

Downloading is naturally modular, rather slow, and we interface by writing all output to the filesystem. This means that a wide range of tools (Unix, Windows, Java, Python, etc.) can analyze and transform it. The target documents are usually static so downloads only need to be done once.
Expand All @@ -183,7 +237,6 @@ Among our own downstream tools are

# Acknowledgements

We thank Dr. Peter Murray-Rust for the support and help with the design of the manuscript.



Expand Down
12 changes: 10 additions & 2 deletions pygetpapers/repository/arxiv.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,15 @@
from pygetpapers.repositoryinterface import RepositoryInterface

class Arxiv(RepositoryInterface):
"""Arxiv class which handles arxiv repository. It uses arxiv repository wrapper to make its query(check https://github.com/lukasschwab/arxiv.py)"""
"""arxiv.org repository
This uses a PyPI code `arxiv` to download metadata. It is not clear whether this is
created by the `arXiv` project or layered on top of the public API.
`arXiv` current practice for bulk data download (e.g. PDFs) is described in
https://arxiv.org/help/bulk_data. Please be considerate and also include a rate limit.
"""

def __init__(self):
self.download_tools = DownloadTools(ARXIV)
Expand Down Expand Up @@ -178,4 +186,4 @@ def apipaperdownload(self, query_namespace):
makecsv=query_namespace["makecsv"],
makexml=query_namespace["xml"],
makehtml=query_namespace["makehtml"],
)
)
18 changes: 17 additions & 1 deletion pygetpapers/repository/europe_pmc.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,23 @@
HTML = "html"

class EuropePmc(RepositoryInterface):
""" """
""" Downloads metadata and optionally fulltext from https://europepmc.org"""

"""Can optionally download supplemental author data, the content of which is irregular and
not weell specified.
For articles with figures, the links to the figures on the EPMC site are included in the fulltext.xml
but the figures are NOT included. (We have are adding this functionality to our `docanalysis` and `pyamiimage`
codes.
In some cases a "zip" file is provided by EPMC which does contain figures in the paper and supplemntal author data;
this can be downloaded.
EPMC has a number of additional services including:
- references and citations denoted by 3-letter codes
pygetpapers can translate a standard date into EPMC format and include it in the query.
"""

def __init__(self):
self.download_tools = DownloadTools(EUROPEPMC)
Expand Down
7 changes: 6 additions & 1 deletion pygetpapers/repository/rxiv.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,12 @@


class Rxiv(RepositoryInterface):
"""Rxiv class which handles Biorxiv and Medrxiv repository"""
"""Biorxiv and Medrxiv repositories
At present (2022-03) the API appears only to support date searches.
The `rxivist` system is layered on top and supports fuller queries
"""

def __init__(self,api="biorxiv"):
"""initiate Rxiv class"""
Expand Down
12 changes: 10 additions & 2 deletions pygetpapers/repository/rxivist.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,15 @@


class Rxivist(RepositoryInterface):
"""Rxivist class which handles the rxivist wrapper"""
"""Rxivist wrapper for biorxiv and medrxiv
From the site (rxivist.org):
"Rxivist combines biology preprints from bioRxiv and medRxiv with data from Twitter
to help you find the papers being discussed in your field."
Appears to be metadata-only. To get full-text you may have to submit the IDs to biorxiv or medrxiv
or EPMC as this aggregates preprints.
"""

def __init__(self):
self.download_tools = DownloadTools(RXIVIST)
Expand Down Expand Up @@ -189,4 +197,4 @@ def noexecute(self, query_namespace):
result_dict = self.rxivist(query_namespace.query, size=10)
results = result_dict[NEW_RESULTS]
totalhits = results[TOTAL_HITS]
logging.info("Total number of hits for the query are %s", totalhits)
logging.info("Total number of hits for the query are %s", totalhits)

0 comments on commit 1307369

Please sign in to comment.